Turing Post
Posts
Topic 37: What is MoE 2.0? Update Your Knowledge about Mixture-of-experts

Topic 37: What is MoE 2.0? Update Your Knowledge about Mixture-of-experts

The fresh angle on current Mixture-of-Expert. We discuss what new MoE techniques like S'MoRE, Symbolic-MoE, and others mean to the next generation AI

Alyona Vert.
April 23, 2025

_{Was this email forwarded to you?}_{Forward it also to a friend or a colleague!}_{Sign up}

Even the most powerful techniques require rethinking to align with new trends. MoE is a fascinating framework that reshaped how we build and understand scalable AI systems. It has rapidly gained attention because it enables massive model growth – like trillion-parameter models – without overwhelming hardware. What makes MoE especially powerful is its ability to dynamically select experts based on the input, allowing the model to specialize in different subdomains or tasks. It’s already a backbone of many systems: DeepSeek-V3 incorporates an impressive 671 billion parameters using MoE; Google’s Gemini 1.5 Pro employs a sparse MoE Transformer to handle a million-token context efficiently; Mistral’s Mixtral 8×22B routes tokens across 8 experts per layer and outperforms dense models on cost and speed; Alibaba’s Qwen2.5-Max, a 325B MoE trained on 20T tokens, ranks near the top of Chatbot Arena with standout reasoning and coding skills; and Meta’s Llama 4 introduces a MoE architecture across its models, including the 400B-parameter Maverick and the 2T-parameter Behemoth, both designed for multimodal and multilingual tasks.

We started this AI 101 series, explaining what Mixture-of-Experts (MoE) is. Today, we will discuss the fresh angle on current MoE developments most readers haven’t seen dissected yet. Why is MoE suddenly back on fire?

A lot of lab chatter and industry road‑maps right now revolve around next‑generation MoE designs. A pair of brand‑new papers dropped this month: 1) Structural Mixture of Residual Experts (S’MoRE) – April’s release from Meta that shows how you can fuse LoRA‑style low‑rank adapters with a hierarchical MoE tree, introducing exponential “structural flexibility” gain that dense models can’t match; 2) Symbolic‑MoE from UNC Chapel Hill which moves MoE out of gradient space and into pure language space, performing with accuracy better than GPT‑4o‑mini and running 16 experts on a single GPU thanks to batched inference. There is also a bunch of fresh MoE developments optimizing inference of MoE models, such as eMoE, MoEShard, Speculative-MoE, and MoE-Gen.

What can these innovative methods teach us about rethinking the efficiency of next-gen MoE models? Let’s break down what makes these developments special and why they might be the clearest path to open-source models that scale.

Welcome to the MoE 2.0!

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

In today’s episode, we will cover:

Structural Mixture of Residual Experts (S’MoRE)
- How does S’MoRE work?
- Performance of S’MoRE
- Not without limitations
Symbolic-MoE
- How does Symbolic-MoE work?
- Results and advantages of Symbolic-MoE
- Limitations
What these two methods buy you
Other notable shifts to MoE 2.0
Conclusion: Why does this new MoE shift matter right now?
Sources and further reading

Structural Mixture of Residual Experts (S’MoRE)

Meta AI’s April 8 release showed a new approach for effective LLMs’ learning and fine-tuning. They took two popular techniques, that can be called fundamental in AI, LoRA (Low-Rank Adaptation) and MoE, and mixed them together. This turned out to be an interesting nontrivial development – Structural Mixture of Residual Experts (S’MoRE). It fuses LoRA‑style low‑rank adapters with a hierarchical MoE tree. This allows to benefit from both approaches – efficiency from LoRA, because everything remains low-rank, and flexibility and power from MoE with some additional advantageous upgrades. Let’s see how this works together.

But firstly, a quick reminder about LoRA. It’s a lightweight and efficient way to fine-tune LLMs with minimal added parameters and computation. Instead of changing all the millions, or even billions, of parameters in a model, LoRA freezes the original weights and adds small, trainable layers (in the form of low-rank matrices) that adjust the model’s behavior.

How does S’MoRE work?

Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Join executives from Microsoft, Hugging Face, Nvidia, Snowflake, Google, AllenAI, MIT, Columbia and many many others →

Here is the entire workflow of S’MoRE system:

It breaks down the model’s "experts" into layers of small adjustments, called residuals. Each residual uses a low-rank update to modify the input.
These residuals are connected in a tree-like structure, like branches in a tree, so the model can decide how to route information through them.
The router sends each token down a dynamically‑chosen sub‑tree of “residual experts.” S’MoRE computes the output by flowing through that tree.

By mixing and matching the paths through the tree, S’MoRE can act like it has many more experts than it actually does. S’MoRE reuses small modules across layers, letting different expert paths share the same building blocks (see the image on the left below).

Image Credit: S’MoRE original paper

And a little bit more about routing in S’MoRE. It’s a top-down, step-by-step process, where each choice helps guide the next layer below. The router uses a small neural network (an MLP) that “looks at”:

The token itself (its embedding), and
A key from the residual in the higher layer, also called a parent helper.

The router uses this info to calculate which lower-level helpers (children) make the most sense to activate, based on which parent was chosen before.

So S’MoRE allows you to get the capacity of dozens of experts while instantiating only a handful of tiny matrices. And what’s the actual result?

Performance of S’MoRE

With the same amount of parameters as older methods, S’MoRE is way more flexible and effective, helping to fine-tune LLMs better. S’MoRE was tested on multiple tasks and models (like LLaMA-3 variants), and it consistently beat the best existing models in two ways:

Higher accuracy by up to 2.1%.
It uses ≈16% fewer trainable parameters, because it only trains small low-rank matrices and lightweight projection layers.

Other benefits of S'MoRE include:

Structural flexibility: Its tree-like multi-layer structure allows the model to combine and reuse small expert pieces in many different combinations. S’MoRE chooses both the experts and how they are connected. This exponentially increases the number of meaningful expert combinations.
Image Credit: S’MoRE original paper
Low computation overhead: S’MoRE's total compute cost is nearly the same as LoRA, and routing overhead is minimal, often less than 5–10%. It’s efficient even when using 2–3 layers of experts. So →
Scalable design: Adding more layers often boosts performance with low extra cost, just 1–7%. For example, going from 2 layers to 3 layers, S’MoRE achieved better accuracy while reducing parameter count by 27% in some tasks.
Flexible routing mechanism: S’MoRE supports multiple types of gates, such as dense, noisy top-k, and switch transformer.

However, S'MoRE has some limitations that we should also take into account.

Not without limitations

Increased architectural complexity: A more complex routing system and multi-layer structure of S’MoRE can be harder to implement and tune compared to simple LoRA or flat MoE models.
Router adds some compute and memory cost, especially when using low-rank settings where routing becomes relatively more significant.
Performance in much larger models or real-world applications is not yet fully tested.
Tuning for deeper S’MoRE (with 3 layers or more) may potentially require careful hyperparameter search, which adds to development cost.

Anyway, despite these limitations, S’MoRE shows how to shift MoE to the next level with “structural flexibility”, meaning how smartly we arrange and use the model's pieces. This might be the key to making LLMs even better at fine-tuning for specific tasks, without needing bigger and more expensive models.

And what does Symbolic-MoE present to us?

Symbolic-MoE

Traditional MoE approaches allow to use a group of specialized models to combine their strengths but require retraining them from scratch, which is costly and impractical. UNC Chapel Hill, on March 7, proposed a way to avoid this constant retraining and to efficiently mix the outputs of multiple models – Symbolic-MoE.

It selects the best experts for each query based on their specific skills, focusing on the individual question instead of the overall task. For example, if a question is about algebra, it picks algebra experts, and if it's about probability, it picks experts with that skill. It groups queries and runs all of them for each selected model in a single batch. Due to this Symbolic-MoE doesn’t need to load models repeatedly, which makes it faster and less demanding on resources.

As a result, Symbolic-MoE can handle up to 16 models on a single GPU, or even scale across multiple GPUs if needed.

Image Credit: Symbolic-MoE original paper

Why is it called Symbolic? While traditional MoE frameworks operate in the parameter space of the models, Symbolic-MoE operates in the output space by leveraging text-based reasoning to integrate diverse model responses. In other words, it uses symbolic representations in the form of natural language to represent the expertise of the models. Symbolic-MoE reminded us the modular architectures that Google’s Jeff Dean considers to be the future of AI.

For now, let’s explore how it works from the technical side.

How does Symbolic-MoE work?

Symbolic-MoE works in two stages: preprocessing and inference with the final answer generation. Let’s break each stage down:

Preprocessing

Before it starts solving problems, Symbolic-MoE sets everything up. It uses a small validation set of problems and a pool of available models. It runs each model on this validation set to create model profiles that show what each model is good at (like geometry or biology). For example, a model’s profile might show that it’s strong in algebra but weak in chemistry.

A "Keyword LLM" identifies the key skills for each question, like algebra or calculus for a math problem.

Symbolic-MoE also selects an aggregator based on its ability to combine answers from different experts into a final, high-quality response.

Image Credit: Symbolic-MoE original paper

Inference

When a new problem comes in, Symbolic-MoE “looks” at the model profiles to figure out which experts are the best for the job based on the skills needed for the problem.

A suitability score is calculated for each model, which helps decide which models to recruit for the job. This process is dynamic – each new question gets a different set of experts based on its specific needs. Global competency ensures that the models selected for a given problem are not only good at the required skills but are also strong performers overall.

Selected experts generate their reasoning by producing Chain-of-Thought (CoT) responses. The aggregator then takes these reasoning outputs and combines them into the final answer. This approach avoids the need for multiple rounds of back-and-forth discussion between models, making the process faster and more efficient.

To make things faster, Symbolic-MoE uses a special trick – batch inference strategy. Instead of repeatedly loading and unloading models for each question, it groups questions that require the same set of experts and processes them all at once. So each model only gets loaded once per batch. This reduces the time spent on loading models and helps optimize the use of GPU memory.

Results and advantages of Symbolic-MoE

By automatically selecting the best experts for each query and the best aggregator, and using a batch processing approach, Symbolic-MoE outperforms existing systems that require more complex multi-agent discussions, offering a simpler and more effective solution. And here is how:

Performance improvement: Symbolic-MoE outperforms the best multi-agent baseline by an average of 8.15% across all tested benchmarks, such as MMLU-Pro, AIME, GPQA, and MedMCQA. It shows even higher accuracy than GPT‑4o‑mini.

Image Credit: Symbolic-MoE original paper

Compatible with larger models: Primarily using 7-8B parameter models, Symbolic-MoE matches or exceeds the performance of larger 70B models. This efficiency makes it accessible to users with limited hardware resources.
Efficiency: It reduces run-time by 44% compared to multi-agent baselines like MoA (Mixture-of-Agents) when run on a single GPU. Thanks to batched inference, it can handle up to 16 models on a single GPU or even scale across multiple GPUs if needed. On 4 GPUs, it results in almost 2× speedup over MoA.
Scalability: Symbolic-MoE scales efficiently, even when using a large number of experts. It also happens because the batch inference strategy reduces the need for frequent model loading and offloading.
Flexibility: This approach is modular, allowing it to adapt to different tasks without needing to modify or retrain the models. Also, it can be easily updated and adapted as new models are introduced without retraining from scratch.

However, as usual, not everything is perfect →

Limitations

Symbolic-MoE still requires running multiple models in parallel, which increases the inference cost.
Dependency on skill inference: The system uses a small validation set to create skill-based model profiles, relying on the quality of the skill inference mechanism, the "Keyword LLM". So inaccurate or insufficiently trained inference mechanism can harm expert selection and performance.
It is also limited by quality of models/experts in the pool. If they are not specialized enough or lack the required domain expertise, the framework may not achieve optimal performance.
Dealing with a large pool of models to identify the most suitable experts for each query might introduce some overhead.

Overall, Symbolic-MoE demonstrates how skill expertise of models gathered through language-based inference can lead to better efficiency compared to traditional MoE systems, which typically rely on parameter-based selection.

What these two methods buy you

Together, the two approaches, S’MoRE and Symbolic-MoE, introduce three innovative ideas that are gaining traction:

Hierarchical residual routing from S’MoRE expands the expert-choice space without increasing the parameter count.
Skill-based recruiting at query time from Symbolic-MoE selects only experts needed for each specific question.
GPU-friendly batching/sharding tricks keep latency low, even when activating 10 or more experts.

Are there any other signals that show MoE is entering a new stage of growth?

Other notable shifts to MoE 2.0

Firstly, let’s talk about the recent top model release. Yes, it’s Meta's Llama 4 Scout and Maverick models that are notable for being its first models released with a MoE architecture. MoE approach, allows Llama 4 Scout model, with 17 billion active parameters and 16 experts, to operate on a single NVIDIA H100 GPU. In addition to performance gains, MoE architectures offer potential cost reductions. As Bloomberg reported, Meta views MoE as a primary strategy for reducing expenses in large-scale inference tasks, particularly in high-performance applications.

Secondly, there is a consistent focus on optimizing MoE inference. Just take a look at some interesting developments:

eMoE

eMoE is a memory efficient inference system for MoE-based LLMs introduced by researchers from the University of Virginia and Georgia Institute of Technology. It uses a predictive model to forecast which experts will be needed for future inputs, based on recurring token-to-expert routing patterns, and preloads only the most likely experts. To reduce overhead, eMoE invokes the expert predictor periodically, every few prompts.

It also leverages a clever trick – eMoE schedules tasks based on their specific requirements, like token generation length and sensitivity to expert routing, to ensure optimal use of resources.

As a result, eMoE reduces memory consumption by up to 80% while maintaining accuracy and improving inference latency by up to 17%.

MoEShard

MoEShard is an inference system designed by EPFL and McGill University’s researchers to address the challenges of load imbalance across multiple GPUs in MoE. It employs tensor sharding. This means the expert matrices are split between GPUs, with each GPU holding a part of every expert, ensuring that computation is evenly distributed across GPUs. This also allows MoEShard to retain all tokens, unlike other methods that drop tokens to reduce memory usage.

MoEShard can achieve up to 6.4x faster time-to-first-token (TTFT) compared to systems like DeepSpeed.

DeepSpeed-MoE

Microsoft’s DeepSpeed-MoE is already a classic example, as it was developed back in 2022. It combines several techniques to efficiently handle MoE models at scale:

Pyramid-Residual MoE (PR-MoE): Integrates residual connections with MoE layers. By maintaining key parameters in these residual connections and sharing weights across layers, PR-MoE reduces the overall model size by up to 3x without compromising the quality.
Optimized inference system includes features like expert parallelism, tensor slicing (dividing model's parameters into smaller, manageable pieces), and memory bandwidth management (optimizing the flow of data between the GPU and memory).
The Mixture-of-Students (MoS) technique ensures that the system can run smaller, compressed versions of MoE models.

Thanks to these features, DeepSpeed-MoE achieves up to 7.3x reduction in inference latency and cost, and 4.5x faster and 9x cheaper inference compared to dense models of similar quality. But later MoE methods significantly outperform DeepSpeed-MoE →

Speculative-MoE (s-MoE)

s-MoE by Huawei Technology aims to improve communication efficiency in parallel inference. It employs two mechanisms:

Speculative Token Shuffling (s-TS) that predicts the routing paths for tokens early, allowing tokens to be shuffled and sent to their most likely experts in advance. This reduces the need for expensive communication between GPUs during routing.
Speculative Expert Pre-grouping (s-EG): Experts likely to be activated together are grouped and placed on the same GPU, minimizing cross-device communication and enhancing local activation rates.

The system also uses dynamic co-clustering, grouping tokens and experts based on predicted activation patterns.

All this together allows s-MoE to minimize the need for inter-GPU communication, achieving up to 75% reduction in communication costs and reducing latency. It also significantly boosts inference throughput up to 2.37x compared to DeepSpeed-MoE.

MoE-Gen

MoE-Gen from the University of Edinburgh also focuses on achieving high throughput on a single GPU to optimize inference of MoE models. It uses module-based batching. Instead of processing entire model batches at once, MoE-Gen divides the model into its attention and expert modules. It accumulates tokens in host memory and dynamically batches them for GPU processing, adjusting the batch size for each module based on GPU capabilities.

MoE-Gen offloads key-value (KV) caches and model parameters to host memory, reducing GPU memory pressure and allowing for the use of larger batch sizes, which in turn increases throughput up to 8–31× compared to other methods, like DeepSpeed-MoE.

This wide range of MoE inference optimization methods proves that we can significantly boost MoE model efficiency, making them a powerful tool for scaling AI systems while cutting resource consumption and computational costs, and making them faster.

Conclusion: Why does this new MoE shift matter right now?

Today large proprietary labs are moving toward trillion-parameter "jagged" models, where only the right 1‑2% of parameters are activated for each token. Meanwhile, the open-source community seeks similar efficiency improvements but lacks the resources for such large-scale training. As these companies continue to bet on scaling AI systems, MoE offers a practical solution for achieving high performance without the typical rise in computational costs. Techniques like S’MoRE and Symbolic-MoE address this challenge directly: they allow you to start from a smaller, dense model with, for example, 8B parameters, incorporate specialized low-rank experts or plug-in models, and create a powerful system that performs well beyond expectations – without needing a massive GPU farm.

Additionally, as many developers switch their focus to the inference stage and its efficiency, methods like eMoE, MoEShard, DeepSpeed-MoE, Speculative-MoE, and MoE-Gen are pushing the boundaries of MoE model inference. These advancements show how we can adapt the fundamental MoE technique to current trends. And the potential is still far from its limits.

Sources and further reading

Resources from the Turing Post

Topic 1: What is Mixture-of-Experts (MoE)?

Reply

or to participate.