• Turing Post
  • Posts
  • What is MoE 2.0? Update Your Knowledge about Mixture-of-experts

What is MoE 2.0? Update Your Knowledge about Mixture-of-experts

The fresh angle on current Mixture-of-Expert. We discuss what new MoE techniques like S'MoRE, Symbolic-MoE, and others mean to the next generation AI

Was this email forwarded to you? Forward it also to a friend or a colleague! Sign up

Even the most powerful techniques require rethinking to align with new trends. MoE is a fascinating framework that reshaped how we build and understand scalable AI systems. It has rapidly gained attention because it enables massive model growth – like trillion-parameter models – without overwhelming hardware. What makes MoE especially powerful is its ability to dynamically select experts based on the input, allowing the model to specialize in different subdomains or tasks. It’s already a backbone of many systems: DeepSeek-V3 incorporates an impressive 671 billion parameters using MoE; Google’s Gemini 1.5 Pro employs a sparse MoE Transformer to handle a million-token context efficiently; Mistral’s Mixtral 8×22B routes tokens across 8 experts per layer and outperforms dense models on cost and speed; Alibaba’s Qwen2.5-Max, a 325B MoE trained on 20T tokens, ranks near the top of Chatbot Arena with standout reasoning and coding skills; and Meta’s Llama 4 introduces a MoE architecture across its models, including the 400B-parameter Maverick and the 2T-parameter Behemoth, both designed for multimodal and multilingual tasks.

We started this AI 101 series, explaining what Mixture-of-Experts (MoE) is. Today, we will discuss the fresh angle on current MoE developments most readers haven’t seen dissected yet. Why is MoE suddenly back on fire?

A lot of lab chatter and industry road‑maps right now revolve around next‑generation MoE designs. A pair of brand‑new papers dropped this month: 1) Structural Mixture of Residual Experts (S’MoRE) – April’s release from Meta that shows how you can fuse LoRA‑style low‑rank adapters with a hierarchical MoE tree, introducing exponential “structural flexibility” gain that dense models can’t match; 2) Symbolic‑MoE from UNC Chapel Hill which moves MoE out of gradient space and into pure language space, performing with accuracy better than GPT‑4o‑mini and running 16 experts on a single GPU thanks to batched inference. There is also a bunch of fresh MoE developments optimizing inference of MoE models, such as eMoE, MoEShard, Speculative-MoE, and MoE-Gen.

What can these innovative methods teach us about rethinking the efficiency of next-gen MoE models? Let’s break down what makes these developments special and why they might be the clearest path to open-source models that scale.

Welcome to the MoE 2.0!

Follow us on đŸŽ„ YouTube Twitter  Hugging Face đŸ€—

In today’s episode, we will cover:

  • Structural Mixture of Residual Experts (S’MoRE)

    • How does S’MoRE work?

    • Performance of S’MoRE

    • Not without limitations

  • Symbolic-MoE

    • How does Symbolic-MoE work?

    • Results and advantages of Symbolic-MoE

    • Limitations

  • What these two methods buy you

  • Other notable shifts to MoE 2.0

  • Conclusion: Why does this new MoE shift matter right now?

  • Sources and further reading

Structural Mixture of Residual Experts (S’MoRE)

Meta AI’s April 8 release showed a new approach for effective LLMs’ learning and fine-tuning. They took two popular techniques, that can be called fundamental in AI, LoRA (Low-Rank Adaptation) and MoE, and mixed them together. This turned out to be an interesting nontrivial development – Structural Mixture of Residual Experts (S’MoRE). It fuses LoRA‑style low‑rank adapters with a hierarchical MoE tree. This allows to benefit from both approaches – efficiency from LoRA, because everything remains low-rank, and flexibility and power from MoE with some additional advantageous upgrades. Let’s see how this works together.

But firstly, a quick reminder about LoRA. It’s a lightweight and efficient way to fine-tune LLMs with minimal added parameters and computation. Instead of changing all the millions, or even billions, of parameters in a model, LoRA freezes the original weights and adds small, trainable layers (in the form of low-rank matrices) that adjust the model’s behavior.

How does S’MoRE work?

Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Join executives from Microsoft, Hugging Face, Nvidia, Snowflake, Google, AllenAI, MIT, Columbia and many many others →

Reply

or to participate.