Turing Post
Posts
FOD#95: Llama. Kind AI. And Landscape of Foundation Agents

FOD#95: Llama. Kind AI. And Landscape of Foundation Agents

Our Research recommendation section is back!

Ksenia Se
April 07, 2025

This Week in Turing Post:

Wednesday, AI 101, Models: What are World Models?
Friday, Interview: Mati Staniszewski, CEO at Eleven Labs

This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription – do it here.

For all who voted – thank you! Our Research recommendation section is back!

1. Llama 4: Meta's Surprising New AI Models and Mixed Reception

Meta made a splash on April 5, 2025, unexpectedly releasing a whole herd of new Llama models on a Saturday, igniting initial excitement followed by a harder wave of criticism. Their new models are:

Llama 4 Herd

The most significant architectural innovation is the adoption of a mixture-of-experts (MoE) approach, where only a fraction of the model's parameters are activated for any given input. This allows for massive total parameter counts while keeping computational requirements manageable.

Initial Disappointment in Performance

Many users have expressed significant disappointment with Llama 4's actual performance, particularly in coding tasks. Dr. Kaminski on Reddit noted:

"I just finished my KCORES LLM Arena tests, adding Llama-4-Scout & Llama-4-Maverick to the mix. My conclusion is that they completely surpassed my expectations... in a negative direction."

The disappointment is particularly acute given the models' size. According to DRMCC0Y on Reddit: "In my testing it performed worse than Gemma 3 27B in every way, including multimodal. Genuinely astonished how bad it is."

Despite Maverick having 402 billion parameters, many users found it performing on par with much smaller models like Qwen-QwQ-32B.

Controversy Over Benchmark Claims

And then – benchmarks. The company showcased impressive results on the LMArena leaderboard for Llama 4 Behemoth – the model that isn’t publicly available and still in training. This discrepancy has damaged trust and led to accusations of misleading marketing.

As noted by Nathan Lambert in his analysis: "Sneaky. The results below are fake, and it is a major slight to Meta's community to not release the model they used to create their major marketing push."

Hardware Limitations and Context Window Reality

Despite claims of a 10 million token context window for Scout, current providers significantly limit this in practice (to around 128,000-328,000 tokens depending on the provider). Additionally, the MoE architecture makes these models challenging to run on consumer hardware – even with 4-bit quantization, the 109B Scout model is "far too big to fit on a 4090 – or even a pair of them," according to Jeremy Howard.

Why the Disappointment?

Several factors may have contributed to Llama 4's underwhelming reception:

Rushed Release: The unexpected Saturday release and reports of internal "war rooms" at Meta suggest competitive pressure, particularly from Chinese models like DeepSeek's R1.
Architectural Trade-offs: While the MoE architecture enables massive parameter counts, it introduces complexity that may affect performance consistency if not perfectly tuned.
Misaligned Strategy: The focus on ultra-large models doesn't serve the open-source community's needs for efficient, accessible implementations that can run on consumer hardware.
License Restrictions: The increasingly restrictive Llama license, including EU exclusion and commercial usage limitations, has alienated parts of the open-source community. (DeepSeek, for example, has a much more permissive license).

Some excitement is coming from Hugging Face (where you can try Llama) and, of course, Meta:

TLDR of LLama4:
- 10m context length.
-? memory requirements? no problem they use chunked attention (block mask) on layers that do apply rope (3 in 4). This means only 8K tokens are kept in cache for most of the layers!!!!!
- mega smart irope, so qk norm post rope, and query
— Arthur Zucker (@art_zucker)
8:05 PM • Apr 5, 2025

Fresh clarification from GenAI Lead @ Meta

There’s a huge difference between the earlier Llama releases and this current flop. The competition in the “open-source” space is fierce now – DeepSeek, Gemma, Mistral, QwQ – all delivering better results with much smaller models. Meta’s strategy feels completely convoluted, probably driven by newly adopted masculine energy from Mark Zuckerberg that requires beating everyone in benchmarks forgetting about what made Llama popular in the first place.

Meta, in trying to dazzle with size and swagger, may have built a herd for a wrong pasture.

2. Kind AI

After that snappy ending let’s talk about kindness. Last week, Google DeepMind published a very interesting paper "An Approach to Technical AGI Safety and Security", that discusses:

They say “We are highly uncertain about the timelines until powerful AI systems are developed, but crucially, we find it plausible that they will be developed by 2030.” They also reference "Position: Levels of AGI for Operationalizing Progress on the Path to AGI". Looking at the levels, I wondered why we never speak about kindness in AI.

It would be funny to turn the narrative around, and instead of AGI, start discussing KAGI – Kind Artificial General Intelligence. Yes, ChatGPT/Claude/Gemini etc are trained to refuse harmful instructions and respond with civility and care. But why don’t we use the word “kind” more?
I looked it up, by the way – 鍵 (Kagi) means "key" or "lock" in Japanese. Symbolically, keys represent security, secrets, or unlocking hidden potential. Kagi often represents opening or closing spiritual gateways; access to hidden or mystical knowledge – definitely the things that I would like my AI to do. Just some food for thoughts.

3. Advances and Challenges In Foundation Agents

20 prominent research labs got together and published a 264-page survey that is basically a deep dive into how we’re building super-smart AI agents. They are trying to answer the big questions like, "How close are these things to actually thinking or acting like humans?" and "What are they currently capable of, what's still tricky, and how do we make sure they don't go rogue?"

Image Credit: the project’s GitHub

If you're curious about where AI is headed beyond just chatbots – like AI that can plan, remember things long-term, use tools, learn on its own, and even work in teams – this survey will be of interest. It connects the AI stuff to how our brains work. Plus, it covers the super important angle of how to build these things safely and ethically. Do they operate the term “kind” – no, they don’t, but nonetheless, it’s a solid read.

Welcome to Monday. Or as AI labs call it now: Surprise Release Recovery Day.

Curated Collections

We are reading/watching

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo at Dwarkesh Patel’s Podcast
Jorge Luis Borges’s and Herbert A. Simon’s understanding of free will
Co-Agency as The Ultimate Extension of Human (we open our last article for all readers)

A horse from Kawasaki

Demis Hassabis’s Isomorphic Labs has raised $600 million in its first external funding round, led by Thrive Capital with GV and Alphabet joining in. The DeepMind-born biotech firm is pushing its AI drug discovery engine closer to clinical impact, with programs spanning multiple therapeutic areas.
Genspark – new cool AI kid on the block – has launched Super Agent, a new entrant in the increasingly crowded world of autonomous AI agents. With orchestration across 9 LLMs and 80 tools, the system executes complex real-world tasks – from trip planning to synthetic voice calls. It edges past competitors on the GAIA benchmark, raising the bar for task automation. It works pretty well.
OpenAI’s PaperBench – their new evaluation suite, challenges AI agents to replicate ICML papers from scratch – no code given. Claude 3.5 Sonnet tops the leaderboard but still scores just 21%. Even the best models falter at long-horizon reasoning and execution, reminding us that real research isn't easily outsourced.
DeepMind’s Dreamer, its latest reinforcement learning agent, has successfully collected diamonds in Minecraft without human demonstrations. It does so by imagining future actions through a learned world model. A small task with big implications: generalisation and planning in novel environments.
Anthropic announces Claude for Education, launching AI access across Northeastern, LSE, and Champlain College. A new “learning mode” shifts Claude from answer engine to reasoning coach, while partnerships with Internet2 and Canvas aim to integrate AI into academic workflows – thoughtfully and at scale.

Models to pay attention to:

HallOumi (Oumi AI) – open-source 8B model for claim verification and hallucination detection. Outperforms top models with sentence-level citations, scoring, and explanations. Includes classifier variant and custom benchmark →read their blog
ScholarCopilot – a retrieval-aware LLM fine-tuned on 500K arXiv papers. It dynamically retrieves citations mid-generation to enhance academic writing, outperforming much larger models like Qwen-2.5-72B in citation and coherence metrics →read the paper
🌟Command A (Cohere) – is an enterprise-optimized LLM trained for RAG, tool use, and multilingual capabilities (23 languages). Uses decentralized training, agent optimization, and novel model merging techniques →read the paper
OThink-MR1 – a multimodal LLM enhanced with Group Relative Policy Optimization (GRPO-D) for general reasoning across tasks. Shows significant cross-task generalization over SFT models →read the paper
RIG – an end-to-end generalist policy model for embodied agents that synergizes reasoning and imagination to achieve 17x sample efficiency →read the paper
Z1 – a code-reasoning-optimized LLM designed to reduce "thinking token" overhead via efficient trajectory training and the Shifted Thinking Window mechanism →read the paper
🌟TransMamba – a hybrid sequence model unifying Transformer and Mamba with shared parameter matrices. Dynamically switches mechanisms depending on sequence length →read the paper

📚 Surveys and overviews

🌟AI for Software Engineering – a high-level paper categorizing tasks, challenges, and promising research directions in automated software engineering →read the paper
Trustworthy GUI Agents explores five trustworthiness dimensions (security, reliability, explainability, ethics, and evaluation) for agents operating GUIs →read the paper
Harnessing the Reasoning Economy surveys methods to balance reasoning quality with computational cost in LLMs, contrasting System 1 (fast) and System 2 (deep) reasoning →read the paper
Test-Time Scaling in LLMs proposes a multidimensional taxonomy of test-time scaling (TTS) approaches: what/how/where/how well to scale →read the paper
Efficient Inference for Large Reasoning Models categorizes inference-efficient reasoning into explicit (compact CoT) and implicit (latent CoT) methods. Includes comparative analysis + future challenges →read the paper
Efficient Reasoning for LRMs: Language, Multimodality & Beyond tackles redundancy and inefficiency in CoT traces from DeepSeek-R1 and o1-style models →read the paper

There were quite a few TOP research papers this week, we will mark them with 🌟 in each section.

🧠 Reasoning & Inference

🌟Inference-Time Scaling for Reward Modeling – DeepSeek-GRM introduces SPCT training for scalable reward models. Uses meta-RMs and generative critiques →read the paper
GenPRM – a generative process reward model that performs CoT with code verification and achieves performance beyond GPT-4o and Qwen2.5 PRMs →read the paper
Thinking Intervention enables fine-grained control over LLM reasoning by modifying internal thinking tokens. Shows boosts in instruction hierarchy reasoning and unsafe prompt handling →read the paper
Recitation over Reasoning demonstrates that SOTA models regress drastically on elementary problems when conditions are perturbed, indicating strong recitation bias →read the paper
JudgeLRM trains LLMs as evaluators using outcome-driven RL. Outperforms GPT-4 and DeepSeek-R1 in judgment accuracy →read the paper
🌟Inference-Time Scaling for Complex Tasks benchmarks scaling strategies across tasks and models, showing mixed results depending on domain and verifier use →read the paper
🌟Open-Reasoner-Zero – open-source implementation of a minimalist RL training approach for LLMs, mirroring DeepSeek-R1-Zero’s effectiveness with fewer steps →read the paper
Improved Visual-Spatial Reasoning fine-tunes Qwen2-VL using GRPO for better spatial intelligence from video data; outperforms GPT-4o →read the paper
Interpreting Emergent Planning in RL provides the first mechanistic evidence that model-free agents internally plan using learned representations →paper

🤖 Multimodal & Agents

🌟SynWorld equips LLM-based agents with scenario simulation and MCTS-based refinement of action knowledge →read the paper
Scaling Visual SSL compares CLIP and visual-only SSL on the same data; shows SSL can match or surpass CLIP when scaled to 7B params→read the paper
Multi-Token Attention proposes convolution-based attention allowing multi-query/key representations, outperforming transformers in long-context benchmarks →read the paper
🌟Agent S2 – a compositional generalist-specialist framework for GUI automation. Introduces Mixture-of-Grounding and Proactive Hierarchical Planning →read the paper
🌟KnowSelf proposes “agentic knowledgeable self-awareness” for agent models, enabling dynamic self-regulation of knowledge use during planning and decision-making →read the paper
VerifiAgent – a unified verification agent integrating meta-verification and tool-based methods to validate model reasoning across types →read the paper

🧪 Training & Scaling

🌟ZClip – a z-score-based dynamic gradient clipping method that reduces loss spikes during LLM pretraining more effectively than traditional methods →paper
Scaling Speech-Text Models – finds that interleaved SLMs scale better than speech-only ones, requiring less data and compute for comparable performance →read the paper
🌟MegaScale-Infer redesigns large-scale MoE serving with disaggregated attention and FFN modules to boost GPU utilization and cut costs through a tailored pipeline and M2N communication → read the paper
MegaMath – a massive 371B-token math-centric corpus combining web data, code, and synthetic QA data for LLM math pretraining→read the paper
RLHF Data Scaling examines prompt quality and reward hacking in RLHF; proposes RTV + GenRM hybrid reward and Pre-PPO prompt selection →paper
Massive Activations refines assumptions around large activation spikes in LLMs and proposes hybrid mitigation techniques like Target Variance Rescaling →read the paper
Instruction-Guided Parameter Generation (IGPG) autoregressively generates model weights from instructions using VQ-VAE and token-level synthesis. Promises scalable and coherent weight generation →read the paper
Efficient Model Selection via LLMs uses LLMs to eliminate the need for costly meta-learning matrices in time series forecasting model selection →paper
🌟Scaling Laws in Scientific Discovery introduces the Autonomous Generalist Scientist (AGS) and proposes the possibility of new scaling laws in research →read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve

Leave a review!

Reply

or to participate.