- Turing Post
- Posts
- FOD#32: Mixture of Experts – What is it?
FOD#32: Mixture of Experts – What is it?
A history of MoE and concise coverage of the remarkably rich week in ML research and innovations
It should be illegal to ship that many updates and releases so close to the holidays, but here we are, two weeks before Christmas, with our hands full of news and research papers (thank you, Conference on Neural Information Processing Systems (NeurIPs), very much!). Let’s dive in; it was truly a fascinating week.
But first, a reminder: we are piecing together expert views on the trajectory of ML&AI for 2024. Send at [email protected] your thoughts on what you believe 2024 will bring! Or just reply to this email.
Many many thanks to those who already shared their views.
Now, to the news. Everybody will be talking about Mixture of Experts these days, thanks to Mistral AI’s almost punkish release of their new model on torrent, which they announced simply like this:
magnet:?xt=urn:btih:5546272da9065eddeb6fcd7ffddeef5b75be79a7&dn=mixtral-8x7b-32kseqlen&tr=udp%3A%2F%2Fopentracker.i2p.rocks%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce
RELEASE a6bbd9affe0c2725c1b7410d66833e24
— Mistral AI (@MistralAI)
3:44 PM • Dec 8, 2023
The concept of MoE, though, has been around for a while. To be exact: it was first mentioned in 1988 at Connectionist Summer. The idea, introduced by Robert Jacobs and Geoffrey Hinton, involves using several specialized networks, called 'expert' networks, each handling different tasks, along with a controlling network that chooses the right expert for each task. This approach was suggested because using one network for all tasks often leads to problems and slow learning. By dividing tasks among experts, learning becomes faster and more efficient. This idea is the basis of the Mixture of Experts model, where different networks learn different things more effectively, emphasizing specialized learning over a one-size-fits-all strategy in neural networks. The first paper, 'Adaptive Mixtures of Local Experts about MoE’, was published in 1991.
Despite its initial promise, MoE's complexity and computational demands led to it being overshadowed by more straightforward algorithms during the early days of AI's resurgence. However, with the advent of more powerful computing resources and vast datasets, MoE has experienced a renaissance, proving integral to advancements in neural network architectures.
The MoE Framework
The essence of MoE lies in its unique structure. Unlike traditional neural networks that rely on a singular, monolithic approach to problem-solving, MoE employs a range of specialized sub-models. Each 'expert' is adept at handling specific types of data or tasks. A gating network then intelligently directs input data to the most appropriate expert(s). This division of labor not only enhances model accuracy but also scales efficiently, as experts can be trained in parallel.
Google Research was especially dedicated to researching the topic:
2013 (with the participation of Ilya Sutskever, OpenAI co-founder): Learning Factored Representations in a Deep Mixture of Experts
2017 (with the participation of Noam Shazeer, co-inventor of Transformers, and co-founder of Character. AI): Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
2021: Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference.
January 2022: ‘Learning to Route by Task for Efficient Inference’. That month Microsoft Research also published a paper on MoE called ‘DeepSpeed: Advancing MoE inference and training to power next-generation AI scale’.
November 2022: ‘Mixture-of-Experts with Expert Choice Routing’.
and others
Sparse Mixture-of-Experts model Mixtral 8×7B by Mistral
This week, MoE is on the rise due to Mistral’s release of their open-source Sparse Mixture-of-Experts model, Mixtral 8x7B. This model outperforms Llama 2 70B and GPT-3.5 in benchmarks, boasting an inference speed that’s six times faster. Licensed under Apache 2.0, Mixtral strikes an efficient balance between cost and performance. It handles five languages, excels in code generation, and can be fine-tuned for instruction-following tasks.
The community is buzzing with excitement. While Mistral is raising another $415 million, hitting a $2 billion valuation and momentarily joining the AI Unicorn Family. (Welcome! We'll be covering you shortly.)
Additional read: to nerd out more on Mixtral and MoE, please refer to Hugging Face’s blog, Interconnects newsletter, and Mistral’s own release post.
Turing Post is a reader-supported publication. To have full access to our archive, and support our work, please become a Premium member →
We recommend: Manual annotation is dead. One-shot is the future!*
Manual annotation takes time, and your innovation can’t wait.
That’s why SuperAnnotate is excited to announce that you can now annotate images in bulk with one-shot annotation, significantly increasing the speed of the image annotation process.
Instead of annotating images one by one, simply choose an image as a reference, and you will get images with a similar composition with annotation suggestions.
News from The Usual Suspects ©
Elon Musk's "Grok" is available for Premium+
Musk's foray into AI with "Grok" is a classic Musk move – disruptive and headline-grabbing. Its integration with Twitter/X for real-time data sourcing is a game changer (many like how well it summarizes the daily news). However, its positioning as an uncensored, 'anti-woke' alternative raises questions about content moderation and the handling of misinformation.
Google: A Mosaic of Success and Shortcomings
With the overwhelming demand for GPUs, Google works on alternatives. And pretty successfully. Google Cloud just announced TPU v5p and AI Hypercomputer, enhancing AI workloads with powerful, scalable accelerators and integrated systems. TPU v5p offers improved training speeds for large models, while AI Hypercomputer combines optimized hardware, open software, and ML frameworks for efficient AI management.
Google's NotebookLM is also a solid addition. It introduces many new features to enhance the process of combining ideas from various sources, including a noteboard space for pinning quotes and notes, and dynamic suggested actions for reading and note-taking. NotebookLM also offers formats for different writing projects, and ensures personal data remains private. This AI-native application continues to evolve with user feedback.
However, Google's journey with Gemini has not been entirely rosy. The advancements, particularly in leveraging TPUs and its multimodal capabilities, are impressive. However, the controversy surrounding the demo and the delayed release of Gemini Ultra highlight the challenges even tech giants face in the rapidly evolving AI landscape. As you can see from MoE’s papers: Google is (was) ahead of research on many fronts. Now, the competition is no longer just about technological prowess or innovative research; it’s a demonstration of leadership.
HoneyBee from Intel Labs and Mila
This new LLM for materials science is notable for being the first billion-parameter scale open-source LLM in this field, delivering top-notch performance on the MatSci-NLP benchmark. This collaboration aims to advance AI tools for materials discovery, tackling challenges like climate change and sustainable semiconductor production. Available on Hugging Face.
CoreWeave's Funding
GPU-rich CoreWeave, timely washing its hands of crypto, saw amazing results in dedicating itself to AI: its valuation hit $7 billion after a minority investment of $642 million led by Fidelity Management and Research Co. The significant investment in CoreWeave underscores the growing interest in AI infrastructure and cloud computing. Their focus on GPUs and AI cloud services is a testament to the increasing demand for high-powered computing in AI development.
Meta AI's Codec Avatars
Meta's advancement in creating relightable Gaussian Codec Avatars is a fascinating development in the realm of virtual reality and 3D modeling. The level of detail and real-time performance capabilities they're achieving could have far-reaching implications for VR and AR experiences.
E.U.'s AI Act
This is a significant development, marking a bold step by the EU in regulating AI technologies. The focus on high-risk and commercial AI applications, coupled with stringent regulations, is a clear indicator of the EU's commitment to ethical AI practices. The implications for open-source AI projects are particularly intriguing, as they could reshape the landscape of AI development beyond commercial entities.
Additional Read: A Framework for U.S. AI Governance from MIT
Twitter Library
Other news, categorized for your convenience
An exceptionally rich week! As always, we offer you only the freshest and the most relevant research papers of the week. Truly, the best curated selection:
Language Models and Code Generation
Magicoder: Introduces LLMs for code with Magicoder and MagicoderS, offering superior performance in various coding benchmarks →paper
Chain of Code (CoC): Combines code-writing with code execution emulation for enhanced reasoning in LMs →paper
CYBERSECEVAL: Evaluates the cybersecurity aspects of LLMs used as coding assistants →paper
Video and Image Synthesis
DeepCache: Accelerates diffusion models in image synthesis by caching and reusing features across stages →paper
Alchemist: Edits material attributes in real images using generative text-to-image models →paper
Kandinsky 3.0: A large-scale text-to-image generation model, developed in Russia, represents a significant advancement in image generation quality and realism →paper
Alpha-CLIP: Focuses on specific regions in images, improving the region-based recognition capabilities of the original CLIP model →paper
Advances in Learning and Training Methods
URIAL by Allen AI: A tuning-free method for aligning LLMs through in-context learning →paper
Nash Learning from Human Feedback: Utilizes human preference data for fine-tuning LLMs →paper
GIVT: A new approach for generative modeling using real-valued vector sequences →paper
Analyzing and Improving the Training Dynamics of Diffusion Models: Proposes modifications to stabilize ADM diffusion model training →paper
SPARQ ATTENTION: Reduces memory bandwidth requirements in LLMs during inference →paper
Efficient Monotonic Multihead Attention: Improves simultaneous translation performance with stable and unbiased monotonic alignment estimation →paper
Multimodal and General AI
OneLLM: A MLLM that aligns multiple modalities to language using a unified framework →paper
Concordia by Google DeepMind: Integrates LLMs into Generative Agent-Based Models for advanced simulations →paper
Multimodal Data and Resource Efficient Device-directed Speech Detection: Explores natural interaction with virtual assistants using a multimodal approach →paper
Reinforcement Learning and Reranking
Pathfinding and Reasoning
PATHFINDER: A tree-search-based method for generating reasoning paths in language models →paper
Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍
*We thank SuperAnnotate for their insights and ongoing support of Turing Post.
Another week with fascinating innovations! We call this overview “Froth on the Daydream" - or simply, FOD. It’s a reference to the surrealistic and experimental novel by Boris Vian – after all, AI is experimental and feels quite surrealistic, and a lot of writing on this topic is just a froth on the daydream.
How was today's FOD?Please give us some constructive feedback |
Reply