• Turing Post
  • Posts
  • FOD#71: Matryoshka against Transformers

FOD#71: Matryoshka against Transformers

we explore the new Matryoshka State Space Model, its advantages over Transformers, and offer a carefully curated list of recent news and papers

This Week in Turing Post:

  • Wednesday, AI 101: Everything about Whisper Model

  • Friday, Agentic Workflows series: Use cases

The main topic

If in my childhood someone had told me I would set Matryoshka against Transformer, I would have been puzzled. After all, one is a symbol of traditional Russian craftsmanship – stacking dolls within dolls, each revealing something hidden beneath. The other? A futuristic robot capable of morphing into various forms, epitomizing adaptability. Yet here we are, years later, first using 'Matryoshka' to describe layered, nested representation learning within 'Transformer' architectures. And then – using Matryoshka in a rival architecture!

The first merging of concepts happened in 2023, when researchers from Google Research presented MatFormer. In it, each Transformer block was designed with nested sub-blocks, where smaller submodels (like layers in a Matryoshka doll) are contained within larger ones. This enables the model to dynamically extract submodels of varying sizes from a single universal model without the need for separate training, allowing for flexible scaling and elastic inference across tasks and modalities. This is called Matryoshka Representation Learning.

This approach allows scaling the model down by using only specific parts, while still retaining the necessary knowledge and performance. These smaller submodels work efficiently without requiring additional training, as they share the same underlying space as the larger model.

Recently, however, Transformers are facing increasing critiques. AI21 CEO Ori Goshen challenges the supremacy of Transformers. He argues that agents relying on these models struggle with efficiency and cost. He – understandably – advocates for AI21's JAMBA architecture, based on Mamba, claiming it promises faster, more reliable AI agents with better memory performance.

Well, Mamba, as we’ve explained before, is indeed a legitimate candidate to rival Transformers. But what if we combine it with the good old Matryoshka to deal an even bigger blow to Transformers?

Researchers from Scaled Foundations and the University of Washington did exactly that. MatMamba integrates Matryoshka Representation Learning into Mamba2's State Space Model (SSM), creating a flexible, nested architecture across its parameters. This design allows for the extraction of multiple smaller models from a single, large model without retraining. Each submodel retains critical learned representations, ensuring consistent performance across varying sizes.

Compared to MatFormer and Transformers, MatMamba offers faster inference – especially for long sequences – due to its SSM backbone and more granular, adaptive scaling across compute requirements.

For example, on edge devices with limited resources, MatMamba can dynamically extract smaller models without retraining, allowing inference to adjust to available memory or compute power – something Transformers struggle with due to their rigid architecture.

In cloud inference scenarios, where compute resources fluctuate, MatMamba’s ability to flexibly switch between submodels allows for efficient, real-time scaling. While Transformers dominate general-purpose tasks, MatMamba could surpass them in domains where long context and elastic deployment are needed, such as real-time video analysis or large-scale image retrieval.

To be realistic, MatMamba is unlikely to entirely replace Transformers in every context, as both excel at different tasks. Instead, it may carve out a niche in applications demanding both high efficiency and adaptive, scalable inference.

As multi-agent ecosystems emerge, we will see more attempts to create alternatives to Transformers that may steal the spotlight.

💎 We recommend - Expert insights at GenAI Productionize 2.0

Don’t miss GenAI Productionize 2.0 – the premier conference for GenAI application development, featuring AI experts from leading brands, startups, and research labs!

Learn actionable insights, strategies, and techniques for generative AI stack design, governance, evaluation, and observability.

But don’t take our word for it; here are real quotes from previous attendees:

  • "I'm blown away by the high quality and value of this event." - Ricardo B.

  • "Great event - worth getting up at 4am in the morning for!" - Sandy A.

  • "Spectacular and very insightful summit! Very well done!" - Chad B.

Twitter library

News from The Usual Suspects ©

News from The Usual Suspects © 

  • Adobe Unleashes Generative Fireworks at MAX 

    • Adobe drops major updates at its MAX conference, expanding its Firefly AI with the first video model safe for commercial use. New AI tools in Premiere Pro help smooth transitions and extend clips, while over 100 new Creative Cloud features land across flagship apps. Also in the mix: collaborative creativity via Project Concept and the GenStudio platform for marketing pros. Oh, and Gatorade bottles—now personalized with Firefly.  

  • Two Nobel Prizes (in Chemistry and Physics) were awarded for achievements rooted in Deep Learning! We explained what for in our ML flashcards.

  • OpenAI’s Swarm of AI Workers

    • OpenAI's latest cookbook introduces "routines" and "handoffs" to orchestrate AI agents more efficiently, making the leap from flashy demos to robust multi-agent workflows. With tools like Swarm, AI agents can now smoothly pass conversations to each other, handling tasks such as refunds, sales, and support, all while minimizing bottlenecks in the process. Enterprise AI just got smarter.

  • TSMC: AI's Chip Champion

    • TSMC's third-quarter profits are set to soar 40%, fueled by surging AI chip demand from tech giants like Apple and Nvidia. As the world’s leading contract chipmaker, TSMC is expanding globally, spending $65 billion on U.S. factories, but keeping most production in Taiwan. With shares up 77% this year, TSMC is riding high on the AI boom.

  • Anthropic in its Loving Grace

    • Dario Amodei’s 15,000 words investor pitch that introduces a new term ‘Powerful AI’ instead of AGI %/

    • More practical: Anthropic rolls out the Message Batches API, cutting costs by 50% for developers dealing with massive datasets. Now, you can batch up to 10,000 queries with Claude 3.5 Sonnet, Opus, and Haiku, processed within 24 hours. Perfect for non-time-sensitive work, this API offers scalable data analysis minus infrastructure headaches. Quora’s already onboard, loving the smooth ride.  

  • Gradio 5: Web Apps on Rocket Fuel 

    • Hugging Face launches Gradio 5, amping up ML web apps with sleek design, server-side rendering for lightning-fast loads, and real-time streaming. Low-latency, production-ready apps with just a few lines of Python, plus, an AI playground that lets you create apps right in your browser.

  • Writer’s Palmyra X 004 Takes Action

    • Writer introduces Palmyra X 004, a powerhouse AI model built to handle enterprise tasks with finesse. Now with tool-calling capabilities, it automates workflows across apps, pulling data, running code, and even sending emails. This LLM also leads the pack in performance benchmarks, showing up OpenAI and Anthropic.

  • Wondering what Inflection AI has been up to?

    • Inflection AI, in collaboration with Intel Gaudi® 3, launches Inflection for Enterprise, powered by the high-performing Inflection 3.0 model. Designed for businesses that need more than a chatbot, it offers full control over data, models, and architecture – on-prem, cloud, or hybrid.

We are reading

The freshest research papers, categorized for your convenience

Our TOP

AI Model Architectures & Optimization

  • Retrieval-Augmented Decision Transformer: External Memory for In-Context RL
    Incorporates external memory into reinforcement learning, improving in-context learning with reduced reliance on long episodes.
    Read the paper

  • OPTIMA: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
    Enhances multi-agent system performance by using LLMs with reduced communication complexity and token usage while increasing task performance.
    Read the paper

  • Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations
    Proposes larger convolutional kernels for ConvNets to improve spatial information capture and outperform vision transformers in various tasks.
    Read the paper

  • TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
    Improves LLM decoding efficiency by employing sparse attention, reducing memory and computational costs.
    Read the paper

  • SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe
    Introduces a novel instruction-tuning approach that improves LLM performance on instruction-following tasks by mitigating overfitting.
    Read the paper

  • MathCoder2: Better Math Reasoning from Continued Pretraining on Model-Translated Mathematical Code
    Enhances LLM mathematical reasoning by pretraining on a math-focused dataset, improving performance on math-related tasks.
    Read the paper

  • ϵ-VAE: Denoising as Visual Decoding
    Proposes a new visual autoencoder method that improves both image reconstruction and generation through an iterative denoising process.
    Read the paper

  • One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
    Introduces a new fine-tuning method that redistributes ranks in activation vectors to maximize explained variance, improving task performance.
    Read the paper

    ONLY-IF: Revealing the Decisive Effect of Instruction Diversity on Generalization
    Demonstrates that diverse instruction types are essential for LLMs to generalize well to new tasks, highlighting the importance of varied datasets.
    Read the paper

  • Inference Scaling for Long-Context Retrieval Augmented Generation
    Optimizes retrieval-augmented generation by scaling inference parameters, improving performance for long-context and multi-hop queries.
    Read the paper

AI Agents & Agentic Frameworks

  • AGENT S: An Open Agentic Framework that Uses Computers Like a Human
    Mimics human interaction with computers through a GUI, performing complex multi-step tasks autonomously using memory-based learning.
    Read the paper

  • WALL-E: World Alignment by Rule Learning Improves World Model-Based LLM Agents
    Aligns LLMs with environment dynamics through rule learning, improving decision-making and reducing errors in real-world tasks.
    Read the paper

  • Emergent Properties with Repeated Examples
    Demonstrates that repeated training examples can significantly enhance model performance, especially in tasks with smaller datasets.
    Read the paper

Learning, Safety & Alignment in AI

  • DATA ADVISOR: Dynamic Data Curation for Safety Alignment of Large Language Models
    Improves the safety of LLMs by dynamically refining data generation, targeting underrepresented safety issues.
    Read the paper

  • Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning
    Uses Monte Carlo Tree Search to enable LLMs to self-improve in reasoning tasks by refining stepwise training.
    Read the paper

  • Self-Boosting Large Language Models with Synthetic Preference Data
    Enables LLMs to improve themselves by generating synthetic preference data for better task performance.
    Read the paper

  • Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders
    Explores vulnerabilities in LLMs where they can unintentionally recommend malicious code, emphasizing the need for improved safeguards.
    Read the paper

Multimodal and Multitasking Capabilities

  • Everything Everywhere All At Once: LLMs Can In-Context Learn Multiple Tasks in Superposition
    Reveals that LLMs can perform multiple distinct tasks simultaneously during a single inference, offering insights into task superposition capabilities.
    Read the paper

  • Token-Level Detective Reward Model for Large Vision Language Models
    Introduces a reward model that provides fine-grained feedback at the token level for multimodal models, enhancing error diagnosis and correction.
    Read the paper

  • Personalized Visual Instruction Tuning
    Enhances LLMs' ability to conduct personalized conversations by training models to recognize specific individuals in images.
    Read the paper

Novel AI Capabilities & Creativity

  • Diversity-Rewarded CFG Distillation
    Promotes creativity in generative models by distilling Classifier-Free Guidance into model weights, reducing computational cost while maintaining high diversity in outputs.
    Read the paper

  • SUPERCORRECT: Supervising and Correcting Language Models with Error-Driven Insights
    Improves reasoning in smaller LLMs by using hierarchical guidance from larger models and enhancing error correction.
    Read the paper

  • LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
    Explores how LLMs internally encode truthfulness information and how this data can be leveraged to reduce hallucinations.
    Read the paper

Specialized AI Systems & Task-Specific Performance

  • F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
    Introduces a text-to-speech model that achieves high-quality, zero-shot speech generation and code-switching by using a non-autoregressive approach.
    Read the paper

  • Erasing Conceptual Knowledge from Language Models
    Proposes a framework for selectively erasing specific conceptual knowledge from LLMs while preserving overall fluency and accuracy in other tasks.
    Read the paper

  • STUFFED MAMBA: State Collapse and State Capacity of RNN-based Long-Context Modeling
    Explores challenges in RNN-based models for long-context modeling, proposing solutions to mitigate performance degradation over long sequences.
    Read the paper

  • Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA
    Studies the complementary strengths of humans and AI in question answering, showing where each excels in different reasoning tasks.
    Read the paper

Models

TinyEmo: Scaling Down Emotional Reasoning via Metric Projection – a small multimodal model for emotion classification, leveraging a synthetic emotional dataset and a Metric Projector for efficient task handling, outperforming much larger models in emotion-related tasks →read the paper

Falcon Mamba: The First Competitive Attention-Free 7B Language Model – a 7B model that achieves superior performance in long-context processing and inference speed, all without attention mechanisms, surpassing larger models across benchmarks →read the paper

Pixtral 12B – a 12B-parameter multimodal model excelling in both image and text understanding, offering state-of-the-art performance on multimodal and text-only tasks, outperforming similarly and larger-sized models →read the paper

Baichuan-Omni Technical Report – a 7B open-source multimodal model processing text, images, videos, and audio, excelling particularly in Chinese benchmarks and providing robust performance across diverse modalities →read the paper

ARIA: An Open Multimodal Native Mixture-of-Experts Model, excelling in multimodal tasks, with competitive performance in both language and multimodal benchmarks, offering enhanced long-context handling and surpassing proprietary models like GPT-4o →read the paper

Leave a review!

Login or Subscribe to participate in polls.

Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. You will get a 1-month subscription!

Reply

or to participate.