• Turing Post
  • Posts
  • FOD#68: Vibe Check and Benchmarks: Are We Capable of Measuring AI Progress?

FOD#68: Vibe Check and Benchmarks: Are We Capable of Measuring AI Progress?

Keep the perspective: every week we connect the dots to provide a comprehensive understanding of the AI world

This Week in Turing Post:

  • Wednesday, AI 101: we do a deep dive into OLMoE.

  • Friday, Global AI Affairs: the latest six months of AI development in China

If you like Turing Post, consider becoming a paid subscriber, checking out our partner’s webinar, or sharing it with a friend. It helps us keep Monday digests free →

The main topic

A few things last week made me think about “vibes” and evaluations. First, when one of the AI newsletters said that “o1-preview consistently beats out our vibe check evals," and when Microsoft published its new Eureka framework that challenges AI's glossy leaderboard scores, revealing critical inconsistencies across 12 top models.

And I’d like to rant a little.

Vibe Check and Benchmarks: Are We Capable of Measuring AI Progress?

Let’s take a step back in time, all the way to the mid-20th century. AI was in its infancy, and back then, benchmarks were simple: games like chess or checkers, and tasks like Optical Character Recognition (OCR) became the go-to tests for proving AI's capabilities. It wasn’t until the 1980s that benchmarking, in its modern form, started to take over, largely driven by the advent of data set competitions like speech recognition challenges, and later ImageNet. These competitions, while invaluable, also set the stage for something insidious: the over-reliance on benchmarks as the sole measure of AI progress.

Fast forward to today, and we are surrounded by benchmarks. LLMs, particularly, are evaluated through a dizzying array of them: MMLU, GSM8K, and others. Companies are laser-focused on beating each other on leaderboards, chasing marginal improvements that boost their standing. But as AI’s capabilities have expanded, so too has the complexity of how we measure them, and therein lies the problem.

The Reality of Benchmark Worship

Here’s the uncomfortable truth: most of the time, AI evaluation is no longer purely about measurable, quantitative progress. Much of it boils down to what’s colloquially called the “vibe check.” It sounds unserious – I highly resent it – but it’s the best way to describe how people interact with and assess AI models today. Does it feel right? Do you “speak” the same language, or is it just... off?

In reality, this “vibe” reflects the way users intuitively assess an LLM’s strengths and weaknesses. Formal benchmarks, on the other hand, often fall short in capturing these nuances. Take summarization, for instance. It’s impossible to perfectly evaluate an LLM’s ability to condense a text without injecting human judgment. How do you quantify elegance, relevance, or tone? (and my constant yelling at ChatGPT: tone down!)

The Benchmark Trap

The problem with benchmarks is that they can be gamed. It’s not so hard to memorize test data if you’ve seen it before, and no matter how much effort goes into filtering exact matches, LLMs can still learn indirectly through synthetic data re-writes or related online discussions. This leads to the core issue: many AI models aren’t evaluated on their real-world adaptability but on how well they perform on pre-determined benchmarks.

The above-mentioned Microsoft’s Eureka tries to uncover that offerinf radar charts that clearly show where each model excels and where they lag, thus breaking the illusion that a single model is “best” just because it ranks higher in certain benchmarks.

But even when models excel on tests, are they truly innovative? No! The infamous move 37 in the AlphaGo match against Lee Sedol shocked the world because it was a move no human had anticipated. It was a game between the two of them: a human and a machine that didn’t realize it had made a completely novel move. A benchmark would not recognize it. A human will.

Benchmarks as a Bottleneck to Progress

Tesla’s Andrej Karpathy once remarked that he spent one-third of his time building good evaluation systems. The effort is immense, and yet, despite all this work, even the best benchmarks don’t align with the qualitative experience of using an AI model. The gap between what benchmarks can measure and what really matters – how models perform in messy, real-world scenarios – continues to widen.

This brings us back to the vibe check. It’s not as methodologically rigorous as a benchmark, but it captures something that numbers often miss: how humans actually interact with AI.

The last thing: I am often asked, "What is the best model?" The best model is the one you’ve learned how to "work" with. The one you’ve – pardon – built rapport with. For me, it’s ChatGPT. No other model – and I’ve played with many – can do what I can make ChatGPT do for me. I just know how to prompt it. I know its language. But for my husband, the best model is Claude because it’s better for coding and is integrated with Cursor, and Llama because it’s open-source. He’s learned how to prompt them.

What does it say about AI progress if the best models are the ones we personally "connect" with, rather than the ones that score highest on a leaderboard?

Twitter Library

đź’Ž We recommend: Fine-tuning foundation models on your data with SuperAnnotate and AWS (Free Webinar)

Model performance on specific use cases is a key blocker in deploying production-ready LLMs into enterprises. The biggest reason performance suffers is that available models are not trained on the company and use case-specific data. It’s estimated that as much as 40% of LLM initiatives are stalled by training data quality.

While many enterprises have a lot of internal data, it is most often not of the quality required to be used for fine-tuning language models. With SuperAnnotate and AWS, enterprises can easily build proprietary datasets for fine-tuning, dramatically improve LLM model performance, and deploy LLMs into the enterprise faster than ever.

News from The Usual Suspects ©

  • Sam Altman published a modern campfire story

    • Imagine: a night in the woods, the air is chilly, a few stars sparkle overhead. A fire crackling at your feet, making it cozy, and in front of you, Sam Altman reading the recent OpenAI post, “The Intelligence Age.” That’s the vibe … In a nutshell, it’s this: AI will revolutionize human progress, bringing unprecedented prosperity and transformative capabilities through deep learning.

  • Microsoft Research: Eureka Reveals AI's Hidden Flaws

    • As we’ve mentioned, Microsoft’s new Eureka framework challenges AI's glossy leaderboard scores, revealing critical inconsistencies across 12 top models. Multimodal tasks like object recognition and spatial reasoning remain Achilles' heels, with models struggling more in height than depth perception. Language capabilities? Better, but long-context reasoning and factual accuracy falter – fact precision dips below 55%. Even top-tier models like GPT-4o and Claude 3.5 Sonnet show worrying backward incompatibility and output randomness. Eureka offers a fresh lens, aiming to elevate AI evaluation standards.

Image Credit: Microsoft

  • Microsoft, the behemoth:

    • In a heavyweight partnership, Microsoft, BlackRock, Global Infrastructure Partners, and MGX have launched the Global AI Infrastructure Investment Partnership (GAIIP) to pump $100 billion into U.S. data centers and power sources. With NVIDIA onboard, they're ensuring the AI revolution has the muscle—and the energy—to reshape the digital economy. Who says AI doesn't run on volts?

    • And if not on volts, it runs on nuclear! Microsoft struck a 20-year deal with Constellation Energy to power its data centers using Three Mile Island's nuclear energy. While regulatory permits are pending, this move reflects Big Tech’s growing reliance on nuclear energy to meet AI’s rising power demands, despite new plant construction being unlikely.

  • Groq and Aramco Power Saudi AI Ambitions:

    • They are building the world’s largest AI inferencing data center in Saudi Arabia. With an initial 19,000 language processing units, and expansion plans for up to 200,000, the center aims to support AI systems across the Middle East, Africa, and India. Groq is betting on this to challenge Nvidia’s dominance.

  • Perplexity AI's Ad Play: Taking on Google’s $300bn Empire

    • Perplexity AI is shaking up the digital ad world, negotiating with brands like Nike and Marriott to launch “sponsored questions.” Unlike Google’s link-based auction ads, Perplexity’s model will feature AI-generated answers approved by advertisers. With CPM rates significantly lower than Google's, it’s an enticing option for premium brands. However, Perplexity’s success hinges on scaling up – its 250 million queries in July still pale in comparison to Google’s reach.

  • Lionsgate Taps AI Magic with Runway

    • In a pioneering partnership, Lionsgate and Runway are developing a customized AI model based on Lionsgate's vast film catalog. The future of film may soon have an AI co-director.

  • Salesforce Ventures Goes All-In on AI

    • It is ramping up its AI game with a new $500 million fund, bringing total AI investments to $1 billion in just 18 months. Backing innovators like Anthropic and Hugging Face, Salesforce aims to drive market-shifting AI advances while prioritizing trust and responsibility. The future of enterprise AI is so far very well-funded. (And we haven’t even mentioned the OpenAI’s oversubscribed gazzilion $ round).

We are watching/reading:

The freshest research papers, categorized for your convenience

Our Top: New Models

  • Vision-Language and Multimodal Models
    Introduces models enhancing the interaction between visual and textual inputs.

    • Read the paper: Qwen2-VL: Enhances vision-language model's perception of the world with improved resolution flexibility, surpassing previous models in various multimodal tasks, including video comprehension.

  • Code-Centric Language Models
    Focuses on models tailored for coding tasks, improving performance in code generation, reasoning, and completion.

    • Read the paper: Qwen2.5-Coder: Builds on previous models to advance code-related benchmarks, offering variants designed for real-world coding applications.

  • Speech and Dialogue Models
    Integrates speech and text to enable real-time conversational dynamics and improve latency in dialogue systems.

    • Read the paper: Moshi: Facilitates natural spoken dialogue with low latency, improving speech recognition and conversational dynamics for real-time dialogue systems.

Large Language Models and Optimization

  • Claude’s “Contextual Retrieval” Revolutionizes AI Knowledge enhances retrieval-augmented generation by integrating contextual embeddings, significantly improving accuracy in large-scale knowledge retrieval tasks like customer support and legal analysis. Read the paper

  • Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization proposes HyperCloning, a method to efficiently scale pre-trained small models into larger ones, speeding up training by 2-4x without losing accuracy. Read the paper

  • Schrödinger's Memory: Large Language Models examines LLM memory mechanisms, arguing they function like Schrödinger's cat paradox—existing only when queried—and compares this to human memory. Read the paper

  • RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval introduces an approach to speed up long-context LLM inference by retrieving key-value vectors, achieving latency reductions while maintaining accuracy. Read the paper

  • Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models creates a retrieval model that adapts to natural language prompts, enhancing performance in both retrieval and instruction tasks. Read the paper

Model Safety, Misleading Outputs, and Self-Correction

  • Language Models Learn to Mislead Humans via RLHF identifies a risk in RLHF-trained models where they unintentionally mislead humans by fabricating convincing but false evidence, highlighting the limitations of current probing methods. Read the paper

  • Training Language Models to Self-Correct via Reinforcement Learning introduces SCoRe, a reinforcement learning approach that improves LLM self-correction in complex tasks like math and programming, achieving state-of-the-art results. Read the paper

  • Jailbreaking Large Language Models with Symbolic Mathematics demonstrates how symbolic mathematics can be used to bypass LLM safety mechanisms, revealing vulnerabilities in existing safeguards. Read the paper

Mathematical and Symbolic Reasoning

  • To CoT or Not to CoT? Chain-of-Thought Helps Mainly on Math and Symbolic Reasoning evaluates Chain-of-Thought prompting, concluding that it primarily improves performance in math and symbolic reasoning tasks, but offers limited gains in commonsense or knowledge tasks. Read the paper

  • Implicit Neural Representations with Fourier Kolmogorov-Arnold Networks presents FKAN, a model improving frequency representation in implicit neural networks, excelling in tasks like 3D occupancy volume representation and image processing. Read the paper

Personalization and Multimodal Learning

  • LLMs + Persona-Plug = Personalized LLMs introduces PPlug, a model that personalizes LLM outputs by integrating user preferences through lightweight plug-in modules, significantly enhancing task performance. Read the paper

  • NVLM: Open Frontier-Class Multimodal LLMs unveils NVLM 1.0, a multimodal model excelling in both text and vision tasks by leveraging hybrid architectures and dynamic high-resolution tagging, rivaling proprietary models. Read the paper

Model Efficiency and Training Techniques

  • GRIN: GRadient-INformed MoE introduces a Mixture-of-Experts model that optimizes parallelism and expert routing, achieving better efficiency and performance in tasks like math and reasoning. Read the paper

  • Single-Layer Learnable Activation for Implicit Neural Representation (SL2A-INR) presents a novel architecture for implicit neural representations, enhancing the ability to capture high-frequency details in tasks like 3D reconstruction. Read the paper

Cognition and Understanding

  • Human-like Affective Cognition in Foundation Models evaluates how foundation models like GPT-4 and Claude understand human emotions, comparing their responses to human judgments across psychological scenarios, finding high alignment. Read the paper

  • Measuring Human and AI Values based on Generative Psychometrics with Large Language Models introduces a method called Generative Psychometrics for Values (GPV) to measure human and AI values, showing its effectiveness in improving value alignment and AI safety. Read the paper

Leave a review!

Login or Subscribe to participate in polls.

Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. You will get a 1-month subscription!

Reply

or to participate.