Turing Post
Posts
FOD#41: GPU's rival? What is Language Processing Unit (LPU)

FOD#41: GPU's rival? What is Language Processing Unit (LPU)

we explore how it makes inference 10x faster plus we offer the best curated list of the freshest ML news and papers

Ksenia Se
February 19, 2024

Next Week in Turing Post:

Wednesday, Token 1.21: Model Safety and Data Privacy
Friday, AI Unicorns: Scale AI

Turing Post is a reader-supported publication. To have full access to our most interesting articles and investigations, become a paid subscriber →

This week, a largely unknown company, Groq, demonstrated unprecedented speed running open-source LLMs such as Llama-2 (70 billion parameters) at more than 100 tokens per second, and Mixtral at nearly 500 tokens per second per user on a Groq’s Language Processing Unit (LPU).

For the comparison:

“According to Groq, in similar tests, ChatGPT loads at 40-50 tokens per second, and Bard at 70 tokens per second on typical GPU-based computing systems.
Context for 100 tokens per second per user – A user could generate a 4,000-word essay in just over a minute.”

So: What is LPU, how does it work, and where is Groq (such an unfortunate name, given Musk's Grok is all over the media) coming from?

Remember that game of Go in 2016 when AlphaGo played against the world champion Lee Sedol and won? Well, about a month before the competition, there was a test game which AlphaGo lost. The researchers from DeepMind ported AlphaGo to Tensor Processing Unit (TPU) and then the computer program was able to win by a wide margin.

The realization that computational power was a bottleneck for AI's potential led to the inception of Groq and the creation of the LPU. This realization came to Jonathan Ross who initially began what became TPU project in Google. He started Groq in 2016.

The LPU is a special kind of computer brain designed to handle language tasks very quickly. Unlike other computer chips that do many things at once (parallel processing), the LPU works on tasks one after the other (sequential processing), which is perfect for understanding and generating language. Imagine it like a relay race where each runner (chip) passes the baton (data) to the next, making everything run super fast. The LPU is designed to overcome the two LLM bottlenecks: compute density and memory bandwidth.

Groq took a novel approach right from the start, focusing on software and compiler development before even thinking about the hardware. They made sure the software could guide how the chips talk to each other, ensuring they work together seamlessly like a team in a factory. This makes the LPU really good at processing language efficiently and at high speed, ideal for AI tasks that involve understanding or creating text.

This led to a highly optimized system that not only runs circles around traditional setups in terms of speed but does so with greater cost efficiency and lower energy consumption. This is big news for industries like finance, government, and tech, where quick and accurate data processing is key.

Now, don't go tossing out your GPUs just yet! While the LPU is a beast when it comes to inference, making light work of applying trained models to new data, GPUs still reign supreme in the training arena. The LPU and GPU might become the dynamic duo of AI hardware, each excelling in their respective roles.

As Elvis Saravia put it: “With breakthroughs in inference and long context understanding, we are officially entering a new era in LLMs.”

To better understand architecture, Groq offers two papers: from 2020 (Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads) and 2022 (A Soware-defined Tensor Streaming Multiprocessor for Large-scale Machine Learning). The term “LPU” must be a recent addition to Groq’s narrative, since it’s never mentioned in the papers.

Additional read:

Compute is also a part of this paper: Computing Power and the Governance of Artificial Intelligence that discusses managing AI development through compute control, focusing on its potential for regulation, benefits, and risks, and suggests balanced governance approaches.
Meanwhile, The U.S. awards GlobalFoundries, the world's third-largest contract chipmaker, $1.5 billion to boost semiconductor production, enhancing domestic supply chains, with expansions in New York and Vermont.
The paper published by Berkeley Artificial Intelligence Research (BAIR) argues that “compound AI systems will likely be the best way to maximize AI results in the future, and might be one of the most impactful trends in AI in 2024.”

Twitter Library

Key Insights from Dr. Andrew Ng's Stanford Talk

From the rise of LLMs and generative AI to ethical innovation and the future of AI startups.

www.turingpost.com/p/andrew-ng-opportunities-in-ai

News from The Usual Suspects ©

Y Combinator

Since 2009, Y Combinator has published Request for Startups which hints at what “ideas we’d want to see made real, in spaces that we believe will be important in the coming decades”. This year, the list contains 20 categories:

20 big names

Twenty tech giants, including Adobe, Amazon, Google, IBM, Meta, Microsoft, OpenAI, and TikTok, have agreed to take "reasonable precautions" to prevent the misuse of AI in disrupting elections worldwide.

OpenAI

OpenAI completes a deal that values the company at $80 billion, nearly tripling its valuation in less than 10 months.

Models making headlines:

Introducing Aya

Aya model is a new open-source massively multilingual language model.
It was instruction fine-tuned by people from all over the world through one year!
Aya is the unique model supporting 101 languages!
It's the next step in building truly multilingual models.
1/10
— TuringPost (@TheTuringPost)
2:03 PM • Feb 13, 2024

Aya’s dataset: https://arxiv.org/pdf/2402.06619.pdf

Introducing Sora: This paper introduces Sora, a breakthrough in video generation technology by OpenAI, capable of producing high-fidelity videos. It leverages spacetime patches to handle videos of varying durations and resolutions, making strides toward simulating the physical world with impressive 3D consistency and long-range coherence. It represents a leap in the ability to create detailed simulations that could be used for a myriad of applications, from entertainment to virtual testing environments →read the paper

Additional read:
- Take on the Sora technical report
- Jim Fan on why he believes Sora is learning physics
- Yann LeCun on why Sora doesn’t understand the physical world and why “modeling the world for action by generating pixel is as wasteful and doomed to failure as the largely-abandoned idea of "analysis by synthesis"“
- Francois Chollet on why “the inner physics model doesn't generalize to novel situations at all”
- Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more by Interconnects
Introducing V-Jepa (Yann LeCun’s vision of advanced machine intelligence (AMI): Meta's V-JEPA model revolutionizes unsupervised learning from videos by using feature prediction as its sole objective. This approach bypasses the need for pre-trained image encoders or text annotations, relying instead on the intrinsic dynamics of video data to learn versatile visual representations. It's a significant contribution to the field of unsupervised visual learning, promising advancements in how machines understand motion and appearance without explicit guidance →read the paper
Introducing Gemini 1.5: Google DeepMind's Gemini 1.5 introduces a Mixture-of-Experts architecture, enhancing the model's performance across a broader array of tasks. Notably, it expands the context window to 1 million tokens, enabling deep analysis over large datasets. Gemini 1.5 represents a significant step forward in AI's capability to process and understand extensive contexts, marking a milestone in the development of multimodal models →read the paper
Introducing Stable Cascade: Stable Cascade from Stability AI introduces a novel text-to-image generation framework that prioritizes efficiency, ease of training, and fine-tuning on consumer-grade hardware. The model's hierarchical compression technique represents a significant reduction in the resources required for training high-quality generative models, providing a pathway for wider accessibility and experimentation in the AI community →read the paper

The freshest research papers, categorized for your convenience

Language Understanding and Generation

OpenToM: Explores evaluating Theory-of-Mind reasoning in LLMs, addressing their capability to understand complex social and psychological narratives. Read the paper
In Search of Needles in a 10M Haystack: Demonstrates the capability of NLP models to process exceptionally long documents, pushing the boundaries of document length comprehension. Read the paper
Premise Order Matters in Reasoning with LLMs: Investigates the sensitivity of LLMs to the order of premises, revealing implications for reasoning tasks. Read the paper
Chain-of-Thought Reasoning Without Prompting: Uncovers the inherent ability of LLMs to generate reasoning paths, suggesting an alternative to explicit prompting. Read the paper
Suppressing Pink Elephants with Direct Principle Feedback: Addresses the challenge of topic avoidance in LLMs, proposing a novel fine-tuning method for enhanced controllability. Read the paper
GhostWriter: Develops an AI-powered writing environment focusing on personalization and increased user control in collaborative writing. Read the paper

Speech and Text-to-Speech Technologies

BASE TTS: Presents a billion-parameter TTS model, showcasing advancements in speech synthesis through large-scale training. Read the paper

Mathematical and Scientific Reasoning

OpenMathInstruct-1: Develops a dataset for math instruction tuning, aiming to improve LLMs' mathematical reasoning capabilities. Read the paper
InternLM-Math: Introduces a specialized LLM for math reasoning, incorporating various techniques for enhanced problem-solving in mathematics. Read the paper
ChemLLM: Creates the first LLM dedicated to chemistry, transforming structured chemical data into dialogue for diverse chemical tasks. Read the paper

Efficiency and Data Utilization in AI

How to Train Data-Efficient LLMs: Proposes sampling methods for enhancing data efficiency in LLM training, optimizing example selection. Read the paper
FIDDLER: Introduces a system for efficient inference of MoE models, leveraging CPU-GPU orchestration for improved performance in resource-limited settings. Read the paper
Tandem Transformers: Presents an architecture for improving inference efficiency of LLMs, utilizing a dual-model system for faster and accurate predictions. Read the paper
Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers: Proposes an advanced PTQ algorithm for efficient deployment of large Transformer models on edge devices. Read the paper

Multimodal and Vision-Language Models

Lumos: Details the first end-to-end multimodal question-answering system with enhanced text understanding from images, advancing MM-LLMs. Read the paper

Reinforcement Learning and Model Behavior

ODIN: Addresses reward hacking in RLHF, proposing a method to mitigate verbosity bias in LLMs for more concise and content-focused responses. Read the paper
Mixtures of Experts Unlock Parameter Scaling for Deep RL: Shows the impact of MoE modules on deep RL networks, enhancing parameter scalability and performance. Read the paper

Operating Systems and Generalist Agents

OS-COPILOT: Proposes a framework for developing generalist computer agents, enabling automation of tasks across different applications with minimal supervision. Read the paper

Graph Learning and State Space Models

Graph Mamba: Explores applying State Space Models to graph learning, addressing challenges like over-squashing and long-range dependencies. Read the paper

Challenges and Innovations in AI

A Tale of Tails: Explores the effects of synthetic data on neural model performance, theorizing potential risks of model collapse with synthetic data reliance. Read the paper
Transformers Can Achieve Length Generalization But Not Robustly: Investigates Transformers' ability to generalize to longer sequences, highlighting the challenge of maintaining robust performance. Read the paper

Become our Premium subscriber today! In most cases, you can expense this subscription through your company! 🤍

How was today's FOD?

Please give us some constructive feedback

Reply

or to participate.