• Turing Post
  • Posts
  • 🌁#79: Sora and World Models – Bringing magic to muggles

🌁#79: Sora and World Models – Bringing magic to muggles

Spatial Intelligence just got a boost! Plus, a concise coverage of the remarkably rich week in ML research and innovations

It should be illegal to ship that many updates and releases so close to the holidays, but here we are – two weeks before Christmas, with our hands full of news and research papers (thank you, OpenAI's 12 days of shipping and booming NeurIPs, very much!). Let’s dive in: Sora, Genie 2 by Google DeepMind, and World Labs by Fei Fei Li – it was truly a fascinating week. Be aware: a lot of videos in this newsletter! You might want to

But first, a reminder: we are piecing together expert views on the trajectory of ML&AI for 2025. Send to [email protected] your thoughts on what you believe 2025 will bring! Or just reply to this email.

Many many thanks to those who already shared their views.

Now, to the week’s hottest topics: Sora, Genie 2 and World Labs

It’s not exactly trivial to get access to Sora, and there are a couple of issues:

  • Lack of communication from the team: For example, OpenAI announced that Sora is included with ChatGPT Plus/Pro – it wasn’t for us. And nobody from the team could immediately clarify that. That’s frustrating. We had to buy an additional subscription.

  • A lot of demand created by their professional “12 days of Shipmas” hype-making. To the point that Sam Altman had to say, “Signups will be disabled on and off, and generations will be slow for a while.”

  • And, if you are in Europe or the UK – you simply can’t get access to Sora.

But.

If and when you finally get your hands on it – Sora is pretty magnificent. It’s actually quite incredible. Once again, OpenAI beats everyone with an intuitive user experience, delivering sophisticated technology to every noob out there. In every sense of it, bringing magic to muggles.

One thing Sora doesn’t allow, no matter how hard you try, is generating a realistic depiction of an actual person, even historical figures. (In the video above, I attempted to create Alan Turing, of course!) Considering that competing models are likely to support this soon, it’s a disadvantage – but an understandable one, given the current legal battles around copyrights OpenAI is involved in.

As noted in the presentation: if you’re expecting Sora to produce a feature film for you, that’s not going to happen. But consider how far we’ve come. Just two years ago, text-to-image generation was clumsy at best – ah, the nostalgia of extra fingers! Now, we have the ability to create entire video clips with intuitive storyboards, allowing you to turn text into video, incorporate your own images, and refine the result into something surprisingly polished.

And even if the law of physics are still suffering, the progress is enormous.

Now to the nerdy part: This exciting progress ties closely to the concept of spatial intelligence, which we use daily – whether it’s navigating a map, packing a suitcase, parking a car, or planning the steps of a complex recipe. Spatial intelligence aligns with the idea of “world models,” a term introduced by David Ha and Jürgen Schmidhuber in their 2018 paper World Models. Since then, the discussion and development have advanced considerably.

Two World Models from last week

Google DeepMind introduced Genie 2, a large-scale foundation world model capable of generating diverse, action-controllable 3D environments from a single image or text prompt. Trained on extensive video datasets, Genie 2 can simulate various scenarios, including object interactions, character animations, and physical effects like gravity and lighting. Users can interact with these generated worlds in real-time using standard inputs such as a keyboard and mouse.

This development represents a significant advancement in the creation of adaptable training grounds for AI, enabling rapid prototyping of interactive experiences and providing diverse environments for training and evaluating embodied agents.

Similarly, World Labs, co-founded by AI pioneer Fei-Fei Li, unveiled an AI system that generates interactive 3D scenes from a single image. This system allows users to explore AI-generated scenes directly in a web browser, with the ability to move within the environment and interact with various elements. The technology adapts to different art styles and scenes, bringing the physics of real life into the virtual space.

World Labs' approach focuses on creating large world models to perceive, generate, and interact with the 3D world, aiming to democratize the creation of virtual spaces and make the process faster and more accessible.

Diving into Genie 2 or World Labs’ system, you’ll discover they’re nothing short of revolutionary. These systems take the foundational principles of World Models and push them into uncharted territory, evolving into rich, interactive 3D environments.

This leap – from task-specific applications to versatile, immersive systems –demonstrates the transformative power of world models. Spatial intelligence marks a fundamental shift, breaking free from the "flat" screen paradigm to embrace the three-dimensional way our minds are naturally wired to think, explore and interact.

The possibilities are truly thrilling.

If you like Turing Post, consider becoming a paid subscriber or sharing this digest with a friend. It helps us keep Monday digests free →

Twitter library

Not a subscriber yet? Subscribe to receive our digests and articles:

AI in Practice – Rats welcome robot-rat

We are reading – Intel on our mind (is it really dying?)

  • Rene Haas highlighted Intel's struggle between vertical integration and a fabless model, citing high costs and innovation challenges. He mentioned attempting to encourage Intel to license Arm technology and acknowledged the strategic benefits of vertical integration amid rumors of Arm's interest in acquiring parts of Intel. 

  • Meanwhile, Ben Thompson argues that Intel’s decline stems from its inability to adapt to mobile and efficiency-first computing, allowing ARM and TSMC to dominate. He highlights missed opportunities, such as Intel’s refusal to embrace ARM manufacturing or prioritize power efficiency. While Pat Gelsinger’s foundry plan aimed to address these issues, it was too late to reverse Intel’s losses in AI and profitability. Thompson suggests that Intel’s revival hinges on government-backed AI initiatives, positioning it as a vital domestic foundry for U.S. technological sovereignty.

  • Semianalysis attributes Intel's decline to decades of leadership failures, poor board decisions, and a loss of cultural and technical leadership. Firing CEO Pat Gelsinger and prioritizing financial engineering over innovation worsened the situation. Intel's delays in advanced nodes allowed competitors like TSMC and AMD to dominate. ARM-based architectures and hyperscaler custom chips further erode its market. Intel Foundry Services is seen as its last chance for relevance, requiring massive investment and government support to secure U.S. semiconductor independence. The article advocates divesting non-core businesses and focusing on revitalizing the foundry as Intel's lifeline.

Top Research – System Cards, Tech reports and Surveys:

  • From 01.ai – Yi-Lightning Technical Report →read it here

  • This technical report introduces O1-CODER, an attempt to replicate OpenAI’s o1 model with a focus on coding tasks →read the paper

  • Also, this:

Models

  • Efficient Track Anything and Segment Anything Model 2 (SAM 2) also from Meta AI develops EfficientTAM for real-time video object tracking on resource-constrained devices with high accuracy and efficiency →read the paper

  • Amazon Nova Foundation Models for understanding and creative tasks, focusing on scalability, safety, multilingual support, and cost-efficiency →read the paper

  • PaliGemma 2 from Google DeepMind advances transfer learning with Vision-Language Models optimized for tasks like OCR, molecular structure recognition, and music score transcription →read the paper.

  • NVILA by Nvidia reduces training and inference costs while maintaining high accuracy for tasks like medical imaging and robotic navigation →read the paper

You can find the rest of the curated research at the end of the newsletter.

News from The Usual Suspects ©

  • Google got wow reaction from both Elon Musk and Sam Altman

  • Hugging Face: Visualizing with style

  • Microsoft is seeing the big picture

    Microsoft's new Copilot Vision brings real-time insights to Edge browser for Pro users. Aimed at enterprise decision-makers, it turns data into actionable visuals with the click of a button. Microsoft continues weaving AI deeper into everyday workflows.

  • OpenAI levels up with ChatGPT Pro and Reinforcement Fine-Tuning Research Program
    OpenAI introduces ChatGPT Pro, offering unlimited access to all models for $200/month, including the powerful GPT-4 turbocharged “o1” and expanded their RFT Program to enable developers and ML engineers to create expert models fine-tuned to excel at specific sets of complex, domain-specific tasks.

  • AWS Reinvents AI again
    AWS drops the mic with cutting-edge AI updates at re:Invent 2024. Highlights include Multi-Agent Orchestration on Bedrock, the Nova AI Model Family, and Prompt Caching for big savings. Enterprises like Moody's are already reaping the benefits of AI-first workflows.

  • Salesforce measures AI’s pulse
    Salesforce's Agentforce platform is delivering on its promise with soaring adoption KPIs. Enterprise AI agents are automating workflows, driving real ROI, and making humans feel slightly less indispensable.

  • Canada gets cooler with AI
    Cohere and CoreWeave are teaming up to build a cutting-edge data center in Canada. The collaboration promises to accelerate AI research while keeping the great white north on the innovation map.

More interesting research papers from last week

Vision-Language Model Enhancements

  • Discriminative Fine-tuning of LVLMs
    Improve LVLMs by fine-tuning with contrastive and autoregressive losses, enhancing image-text discrimination and efficiency. Read the paper

  • Florence-VL
    Enhance multimodal understanding using a generative vision encoder with depth-breadth fusion, excelling in OCR and visual tasks. Read the paper

  • VLsI
    Optimize smaller vision-language models using verbalized intermediate layers for efficiency and improved task performance. Read the paper

Datasets for LLMs and Physics Simulations

  • FineWeb2
    Hugging Face democratizes AI research with FineWeb2, a high-quality 15T token dataset for diverse pretraining needs. Read the paper

  • The Well
    Support physics-informed machine learning with diverse, high-resolution numerical simulations across domains. Read the paper

Model Optimization and Fine-Tuning

  • Weighted-Reward Preference Optimization
    Fuse capabilities of heterogeneous LLMs efficiently without requiring aligned vocabularies. Read the paper

  • TinyFusion
    Reduce diffusion transformer size and costs with adaptive pruning and distillation methods. Read the paper

  • Aim
    Optimize multi-modal inference by pruning and merging redundant tokens for efficiency. Read the paper

Sparse and Multilingual Training

  • Monet
    Enable scalable and interpretable sparse mixture-of-expert models, specializing in language and domain knowledge. Read the paper

  • Marco-LLM
    Boost multilingual performance, particularly for low-resource languages, using diverse, large-scale training. Read the paper

Task-Specific Innovations and Scaling

  • Establishing Task Scaling Laws
    Predict task-specific LLM performance efficiently using compute-reduced "ladder models." Read the paper

  • Exploring Proportional Analogies
    Assess LLM reasoning on analogies with targeted knowledge-enhanced prompting for improved accuracy. Read the paper

Multi-Agent and Collaborative Training

  • MALT
    Improve LLM reasoning by assigning collaborative roles in multi-agent setups for better task outcomes. Read the paper

  • Free Process Rewards Without Process Labels
    Train process reward models efficiently using outcome labels instead of intermediate annotations. Read the paper

RAG and OCR Challenges

  • OCR Hinders RAG
    Analyze OCR-induced noise in retrieval-augmented generation and improve robustness using combined data inputs. Read the paper

Leave a review!

Login or Subscribe to participate in polls.

Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. You will get a 1-month subscription!

Reply

or to participate.