Turing Post
Posts
Topic 28: What is Mixture-of-Mamba?

Topic 28: What is Mixture-of-Mamba?

we discuss how to enable the Mamba Selective State Space Model (SSM) to handle multimodal data using the Mixture-of-Transformers concept and modality-aware sparsity

Alyona Vert.
February 19, 2025

At the Turing Post, we are particularly excited about exploring LLM architectures that differ from widespread approaches like transformers. One of them is the Mamba Selective State Space model (SSM), which we covered in one of our first AI 101 episodes. It's one of the main competitors to transformers due to its efficient handling of long sequences, high speed, and reduced memory use. What’s most intriguing about AI is observing how different architectures receive upgrades to align with emerging trends. For example, Mamba is not an efficient option for processing multimodal data, and this is where Mixture-of-Mamba (MoM) comes in. It expands the benefits of transformers by using the Mixture-of-Expert (MoE) concept to enhance SSMs for multimodal tasks. MoM’s main feature – modality-aware sparsity – transforms the Mamba core into a new powerful architecture that meets the need for multimodality. Let’s explore how MoM changed Mamba and how this fascinating complex system works.

In today’s episode, we will cover:

Mixture-of-Mamba (MoM): what’s the idea?
How does MoM work?
How good is MoM?
MoM’s advantages
Not without limitations
Conclusion: Why does Mixture-of-Mamba stand out?
Bonus: Resources to dive deeper

Mixture-of-Mamba: what’s the idea?

Mamba is one of the most powerful Selective State Space model (SSM). At their core, SSMs are a type of AI model that can efficiently process sequences of data, such as sentences or videos. They’ve been explored as a competitive alternative to transformers, which are powerful but computationally expensive. Mamba is especially efficient and has the following advantages over transformers:

Efficient handling of long sequences: Mamba achieves linear scaling with sequence length compared to transformers, which scale quadratically.
Faster inference: Due to its linear-time processing, Mamba can perform inference up to five times faster than transformers.
Reduced memory use: It avoids the extensive memory requirements of the attention mechanisms in transformers.
Parallelizable training: By representing the SSM as a convolution, Mamba enables parallel training similar to Convolutional Neural Networks (CNNs), leading to faster training times.

However, there’s one big problem – Mamba doesn’t make good use of different types of data and treats all input data like text, images, or speech the same way. This limits Mamba’s effectiveness for multimodal tasks.

The question arises: How can we expand Mamba’s benefits to multimodal data and make it an even more powerful architecture?

Researchers from Stanford University, Carnegie Mellon University and FAIR at Meta found the solution. They turned to the idea of Mixture-of-Experts (MoE), that allows models to use only parts of their structure for specific inputs. In particular, they were inspired by Mixture-of-Transformers (MoT), which selectively activates different processing components based on input type. So, they build on it their new SSM architecture – Mixture-of-Mamba (MoM), which makes the model more "aware" of different data types, keeping it computationally efficient. Let's explore how exactly MoM makes Mamba multimodal.

How does MoM work?

Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Simplify your learning journey →

or follow us on Hugging Face, this article will appear there tomorrow for free

Reply

or to participate.