- Turing Post
- Posts
- What is MoE 2.0? Update Your Knowledge about Mixture-of-experts
What is MoE 2.0? Update Your Knowledge about Mixture-of-experts
The fresh angle on current Mixture-of-Expert. We discuss what new MoE techniques like S'MoRE, Symbolic-MoE, and others mean to the next generation AI
Was this email forwarded to you? Forward it also to a friend or a colleague! Sign up
Even the most powerful techniques require rethinking to align with new trends. MoE is a fascinating framework that reshaped how we build and understand scalable AI systems. It has rapidly gained attention because it enables massive model growth â like trillion-parameter models â without overwhelming hardware. What makes MoE especially powerful is its ability to dynamically select experts based on the input, allowing the model to specialize in different subdomains or tasks. Itâs already a backbone of many systems: DeepSeek-V3 incorporates an impressive 671 billion parameters using MoE; Googleâs Gemini 1.5 Pro employs a sparse MoE Transformer to handle a million-token context efficiently; Mistralâs Mixtral 8Ă22B routes tokens across 8 experts per layer and outperforms dense models on cost and speed; Alibabaâs Qwen2.5-Max, a 325B MoE trained on 20T tokens, ranks near the top of Chatbot Arena with standout reasoning and coding skills; and Metaâs Llama 4 introduces a MoE architecture across its models, including the 400B-parameter Maverick and the 2T-parameter Behemoth, both designed for multimodal and multilingual tasks.
We started this AI 101 series, explaining what Mixture-of-Experts (MoE) is. Today, we will discuss the fresh angle on current MoE developments most readers havenât seen dissected yet. Why is MoE suddenly back on fire?
A lot of lab chatter and industry roadâmaps right now revolve around nextâgeneration MoE designs. A pair of brandânew papers dropped this month: 1) Structural Mixture of Residual Experts (SâMoRE) â Aprilâs release from Meta that shows how you can fuse LoRAâstyle lowârank adapters with a hierarchical MoE tree, introducing exponential âstructural flexibilityâ gain that dense models canât match; 2) SymbolicâMoE from UNC Chapel Hill which moves MoE out of gradient space and into pure language space, performing with accuracy better than GPTâ4oâmini and running 16 experts on a single GPU thanks to batched inference. There is also a bunch of fresh MoE developments optimizing inference of MoE models, such as eMoE, MoEShard, Speculative-MoE, and MoE-Gen.
What can these innovative methods teach us about rethinking the efficiency of next-gen MoE models? Letâs break down what makes these developments special and why they might be the clearest path to open-source models that scale.
Welcome to the MoE 2.0!
Follow us on đ„ YouTube Twitter Hugging Face đ€
In todayâs episode, we will cover:
Structural Mixture of Residual Experts (SâMoRE)
How does SâMoRE work?
Performance of SâMoRE
Not without limitations
Symbolic-MoE
How does Symbolic-MoE work?
Results and advantages of Symbolic-MoE
Limitations
What these two methods buy you
Other notable shifts to MoE 2.0
Conclusion: Why does this new MoE shift matter right now?
Sources and further reading
Structural Mixture of Residual Experts (SâMoRE)
Meta AIâs April 8 release showed a new approach for effective LLMsâ learning and fine-tuning. They took two popular techniques, that can be called fundamental in AI, LoRA (Low-Rank Adaptation) and MoE, and mixed them together. This turned out to be an interesting nontrivial development â Structural Mixture of Residual Experts (SâMoRE). It fuses LoRAâstyle lowârank adapters with a hierarchical MoE tree. This allows to benefit from both approaches â efficiency from LoRA, because everything remains low-rank, and flexibility and power from MoE with some additional advantageous upgrades. Letâs see how this works together.
But firstly, a quick reminder about LoRA. Itâs a lightweight and efficient way to fine-tune LLMs with minimal added parameters and computation. Instead of changing all the millions, or even billions, of parameters in a model, LoRA freezes the original weights and adds small, trainable layers (in the form of low-rank matrices) that adjust the modelâs behavior.
How does SâMoRE work?
Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Join executives from Microsoft, Hugging Face, Nvidia, Snowflake, Google, AllenAI, MIT, Columbia and many many others â

Reply