Turing Post
Posts
FOD#57: On the Measure of Intelligence

FOD#57: On the Measure of Intelligence

Chollet's Arc and a $1,000,000 competition to create an AI that can adapt to novelty and solve simple reasoning problems

Ksenia Se
June 17, 2024

Next Week in Turing Post:

Wednesday, AI 101: a dive into memory optimization: we discuss YaFSDP and how it enhances the efficiency of distributed training;
Friday, Interview with Innovators: join us for the conversation with two main researchers behind Phi, Microsoft’s famous family of small language models.

If you like Turing Post, consider becoming a paid subscriber. You’ll immediately get full access to all our articles, investigations, and tech series →

“It is a testimony to the immaturity of our field that the question of what we mean when we talk about intelligence still doesn’t have a satisfying answer. What’s worse, very little attention has been devoted to rigorously defining it or benchmarking our progress towards it,” wrote François Chollet in his paper “On the Measure of Intelligence”. It was November, 2019.

In this elegant contemplation on what intelligence is, he highlighted two contrasting perspectives:

The Psychological Viewpoint: This approach, championed by figures like Alfred Binet, relies on psychometric tests like IQ to quantify cognitive abilities. It focuses on measuring existing skills within a limited scope.
The AI Viewpoint: Pioneered by Alan Turing, this perspective views intelligence as the ability to achieve goals across diverse environments, emphasizing adaptation and learning.

Chollet argues that both of these views are incomplete and proposes a new definition of intelligence based on skill-acquisition efficiency:

“The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.”

In the same paper, Chollet, a practitioner at heart, also introduced ARC-AGI, a dataset designed to evaluate AI generalization capabilities.

Fast forward five years. ARC-AGI remains unbeaten. Conversations around AGI have shifted towards doomsday scenarios rather than defining intelligence. To refocus efforts and incentivize progress, Chollet established the ARC Prize 2024, a competition with a $1,100,000 prize pool. The goal? To encourage the development of AI systems that can solve novel reasoning tasks without extensive training data – a crucial step towards achieving AGI. This time, everyone is welcome to the Arc.

Twitter Library

A List of Medical Large Language Models (LLMs)

Join over 60,000 readers for in-depth knowledge and forward-thinking analysis, to make smarter decisions about AI & ML. Save time. Gain wisdom. Stay ahead.

www.turingpost.com/p/medical-llms

News from The Usual Suspects ©

Microsoft delays Recall AI release and discontinue of GPT Builder for Copilot Pro
- Microsoft will delay the release of its Recall AI feature, which tracks computer usage, due to privacy concerns. Originally set for a broad release next week, it will now be available only to Windows Insider Program users for feedback before a wider rollout. This decision underscores Microsoft's commitment to security.
- On July 10, 2024, Microsoft will discontinue the GPT Builder for Copilot Pro customers and delete all privately created custom chatbots.
Mistral's Major Funding Milestone
- Mistral, a barely one-year-old French AI company, has raised $645 million at a $6.2 billion valuation. A bastion of open-source AI, the company is taking on Anthropic and OpenAI, while also receiving investment from Microsoft. Time to read our profile about them: Mistral AI's Bold Journey
Apple Introduces Private Cloud Compute (PCC) and make a no-pay deal with OpenAI
- Apple always highlights its focus on security, and last week they introduced Private Cloud Compute (PCC), a cloud intelligence system designed for private AI processing. Built with custom Apple silicon and a hardened OS, PCC extends Apple’s device security model to the cloud.
- Apple and OpenAI have formed a landmark agreement to integrate ChatGPT into Apple devices without direct monetary payment. How smart is that! Instead, Apple will promote OpenAI's technology across its products, valuing this exposure over cash. Apple is also exploring AI collaborations with Google and Anthropic, seeking to offer a range of chatbot options.
OpenAI's Executive Hires and Revenue Growth
- OpenAI has announced key executive hires: former Nextdoor CEO Sarah Friar as CFO and Kevin Weil, previously of Twitter and Instagram, as Chief Product Officer.
- OpenAI’s annualized revenue has doubled to $3.4 billion since late 2023, driven primarily by subscriptions and API access. OpenAI's valuation is around $86 billion. No matter how much drama they create, they also deliver things that work.
Samsung Showcases AI Innovations
- Samsung steps up its foundry game, unveiling 2nm and 4nm process nodes and integrated Samsung AI Solutions. These innovations aim to rival TSMC with high-performance, low-power semiconductors.
Luma Dream Machine Underwhelms
- Luma Dream Machine so far is not really impressive. Though it’s free, it’s useless. Take a look:

We are watching/reading:

Talking Tech and AI with Tim Cook! by Marques Brownlee
Jensen Huang, Founder and CEO of NVIDIA with Ali Ghodsi, Co-founder and CEO of Databricks
How Meta trains large language models at scale by Meta
The cool technical details Apple shared about their language models "thinking different" by Nathan Lambert
AI Search: The Bitter-er Lesson by Aidan McLaughlin

The freshest research papers, categorized for your convenience

Our top

Nemotron-4 340B – an open-source pipeline for generating synthetic data
Researchers from NVIDIA released the Nemotron-4 340B model family, comprising Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. These models are open access under a permissive license and demonstrate competitive performance on various benchmarks. They are optimized for deployment on a single DGX H100 system with 8 GPUs. Over 98% of the data used for model alignment was synthetically generated, highlighting the models' effectiveness in creating high-quality synthetic data.
MAGPIE: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Researchers from the University of Washington and the Allen Institute for AI developed MAGPIE, a method to generate large-scale alignment data for LLMs without prompt engineering or seed questions. By leveraging the auto-regressive nature of aligned LLMs, they generated 4 million instructions and filtered 300K high-quality instances. MAGPIE-tuned models performed comparably to those fine-tuned with much larger datasets, showcasing the method's effectiveness in producing diverse and high-quality instruction data.
HelpSteer2: Open-source dataset for training top-performing reward models
Researchers from NVIDIA present HelpSteer2, a high-quality, permissively licensed preference dataset (CC-BY-4.0) for training reward models that effectively guide LLMs in generating responses aligned with human preferences. Despite having only 10,000 response pairs, HelpSteer2 achieves state-of-the-art performance on RewardBench's primary dataset. The dataset enhances training efficiency and quality, allowing the creation of reward models that surpass both open-source and proprietary counterparts.
The Prompt Report: A Systematic Survey of Prompting Techniques
A detailed, 76 pages, review of prompting techniques used in generative AI models. The researchers from the top AI labs developed a taxonomy comprising 58 text-only and 40 multimodal prompting techniques, alongside a vocabulary of 33 terms. Worth saving.

Benchmarks and Evaluation Frameworks

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery Develops a bilingual benchmark to evaluate LLMs in computer science across 26 subfields, showing the importance of improving CS-specific reasoning for LLM advancements.
Read the paper

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning Introduces a benchmark to assess LLMs' temporal reasoning abilities using synthetic datasets, revealing strengths and weaknesses in temporal semantics and arithmetic tasks.
Read the paper

LiveBench: A Challenging, Contamination-Free LLM Benchmark Eliminates test set contamination by using frequently updated questions from current sources, challenging LLMs in various tasks like math and coding with evolving monthly benchmarks.
Read the paper

CRAG: Comprehensive RAG Benchmark Evaluates Retrieval-Augmented Generation in question answering across five domains, highlighting the effectiveness and challenges of RAG methods in handling dynamic and complex facts.
Read the paper

WILDBENCH: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild Uses real-world queries to benchmark LLMs, systematically evaluating responses with high correlation to human judgments to guide improvements in real-world applications.
Read the paper

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning Assesses LLMs' planning capabilities in tasks like Trip Planning and Calendar Scheduling, revealing significant performance drops with increased task complexity.
Read the paper

Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus Evaluates LLMs on subjective tasks through a collaborative framework, providing more consistent and robust rankings aligned with human judgments.
Read the paper

MCEVAL: Massively Multilingual Code Evaluation Supports code tasks across 40 programming languages, addressing limitations of existing benchmarks and highlighting the gap between open-source and closed-source models.
Read the paper

Hybrid and Specialized Models

TransNAR: Transformers Meet Neural Algorithmic Reasoners Integrates Transformers with Neural Algorithmic Reasoners using Graph Neural Networks, significantly improving algorithmic reasoning performance and generalization.
Read the paper

SAMBA: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling Combines selective State Space Models with Sliding Window Attention, efficiently handling sequences up to 1M tokens with improved memory recall and throughput.
Read the paper

TEXTGRAD: Automatic “Differentiation” via Text Optimizes complex AI systems using natural language feedback, enhancing zero-shot accuracy in various applications without requiring prompt tuning.
Read the paper

Efficiency and Optimization

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization Enhances LLM efficiency by replacing multiplications with shift-and-add operations, significantly reducing memory and energy consumption.
Read the paper

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Uses a novel dReLU function to enhance activation sparsity, achieving efficiency gains and performance improvements in LLM inference.
Read the paper

PowerInfer-2: Fast Large Language Model Inference on a Smartphone Enables rapid LLM inference on smartphones by decomposing computations and minimizing I/O overhead, significantly speeding up performance.
Read the paper

Robustness and Reliability

Large Language Model Confidence Estimation via Black-Box Access Estimates LLM response confidence using prompt perturbations and logistic regression, outperforming existing methods and suggesting potential for a universal confidence model.
Read the paper

Merging Improves Self-Critique Against Jailbreak Attacks Enhances LLM robustness by merging an external critic model to improve self-critique capabilities, reducing the success rate of adversarial attacks.
Read the paper

Embedding-COrrupted (ECO) Prompts: Large Language Model Unlearning via Embedding-Corrupted Prompts Ensures high unlearning quality with minimal performance impact by applying embedding corruptions during inference, scalable across various LLM sizes.
Read the paper

Specialized Applications and Techniques

Simple and Effective Masked Diffusion Language Models Proposes a novel training method for masked diffusion language models, achieving state-of-the-art performance in text generation and DNA sequence modeling.
Read the paper

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus Introduces a large-scale multilingual and multimodal document corpus, significantly improving few-shot learning performance in various multilingual image-text tasks.
Read the paper

Discovering Preference Optimization Algorithms with and for Large Language Models Develops the DiscoPOP algorithm by iteratively prompting LLMs, blending logistic and exponential losses to achieve state-of-the-art performance across various tasks.
Read the paper

Estimating the Hallucination Rate of Generative AI Estimates hallucination rates in generative AI models by evaluating response log probabilities, demonstrating accuracy in synthetic and NLP tasks.
Read the paper

Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent “Middle” Enhancement Introduces CREAM to extend LLM context windows efficiently, significantly improving long-context performance without disrupting original capabilities.
Read the paper.

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B: A Technical Report Combines LLMs with Monte Carlo Tree Search to enhance mathematical reasoning, significantly boosting performance in solving complex problems.
Read the paper

Cognitively Inspired Energy-Based World Models Mimics human cognitive processes using Energy-Based Models, showing superior scalability in computer vision tasks and promising results in NLP.
Read the paper

HUSKY: A Unified, Open-Source Language Agent for Multi-Step Reasoning Designs an open-source language agent for complex multi-step reasoning, outperforming prior agents and matching frontier models in mixed-tool reasoning tasks.
Read the paper

Leave a review!

Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. You will get a 1-month subscription!

Reply

or to participate.