- Turing Post
- Posts
- 🌁#79: Sora and World Models – Bringing magic to muggles
🌁#79: Sora and World Models – Bringing magic to muggles
Spatial Intelligence just got a boost! Plus, a concise coverage of the remarkably rich week in ML research and innovations
It should be illegal to ship that many updates and releases so close to the holidays, but here we are – two weeks before Christmas, with our hands full of news and research papers (thank you, OpenAI's 12 days of shipping and booming NeurIPs, very much!). Let’s dive in: Sora, Genie 2 by Google DeepMind, and World Labs by Fei Fei Li – it was truly a fascinating week. Be aware: a lot of videos in this newsletter! You might want to
But first, a reminder: we are piecing together expert views on the trajectory of ML&AI for 2025. Send to [email protected] your thoughts on what you believe 2025 will bring! Or just reply to this email.
Many many thanks to those who already shared their views.
Now, to the week’s hottest topics: Sora, Genie 2 and World Labs
It’s not exactly trivial to get access to Sora, and there are a couple of issues:
Lack of communication from the team: For example, OpenAI announced that Sora is included with ChatGPT Plus/Pro – it wasn’t for us. And nobody from the team could immediately clarify that. That’s frustrating. We had to buy an additional subscription.
A lot of demand created by their professional “12 days of Shipmas” hype-making. To the point that Sam Altman had to say, “Signups will be disabled on and off, and generations will be slow for a while.”
And, if you are in Europe or the UK – you simply can’t get access to Sora.
But.
If and when you finally get your hands on it – Sora is pretty magnificent. It’s actually quite incredible. Once again, OpenAI beats everyone with an intuitive user experience, delivering sophisticated technology to every noob out there. In every sense of it, bringing magic to muggles.
One thing Sora doesn’t allow, no matter how hard you try, is generating a realistic depiction of an actual person, even historical figures. (In the video above, I attempted to create Alan Turing, of course!) Considering that competing models are likely to support this soon, it’s a disadvantage – but an understandable one, given the current legal battles around copyrights OpenAI is involved in.
As noted in the presentation: if you’re expecting Sora to produce a feature film for you, that’s not going to happen. But consider how far we’ve come. Just two years ago, text-to-image generation was clumsy at best – ah, the nostalgia of extra fingers! Now, we have the ability to create entire video clips with intuitive storyboards, allowing you to turn text into video, incorporate your own images, and refine the result into something surprisingly polished.
And even if the law of physics are still suffering, the progress is enormous.
Now to the nerdy part: This exciting progress ties closely to the concept of spatial intelligence, which we use daily – whether it’s navigating a map, packing a suitcase, parking a car, or planning the steps of a complex recipe. Spatial intelligence aligns with the idea of “world models,” a term introduced by David Ha and Jürgen Schmidhuber in their 2018 paper World Models. Since then, the discussion and development have advanced considerably.
Two World Models from last week
Google DeepMind introduced Genie 2, a large-scale foundation world model capable of generating diverse, action-controllable 3D environments from a single image or text prompt. Trained on extensive video datasets, Genie 2 can simulate various scenarios, including object interactions, character animations, and physical effects like gravity and lighting. Users can interact with these generated worlds in real-time using standard inputs such as a keyboard and mouse.
This development represents a significant advancement in the creation of adaptable training grounds for AI, enabling rapid prototyping of interactive experiences and providing diverse environments for training and evaluating embodied agents.
Similarly, World Labs, co-founded by AI pioneer Fei-Fei Li, unveiled an AI system that generates interactive 3D scenes from a single image. This system allows users to explore AI-generated scenes directly in a web browser, with the ability to move within the environment and interact with various elements. The technology adapts to different art styles and scenes, bringing the physics of real life into the virtual space.
World Labs' approach focuses on creating large world models to perceive, generate, and interact with the 3D world, aiming to democratize the creation of virtual spaces and make the process faster and more accessible.
Diving into Genie 2 or World Labs’ system, you’ll discover they’re nothing short of revolutionary. These systems take the foundational principles of World Models and push them into uncharted territory, evolving into rich, interactive 3D environments.
This leap – from task-specific applications to versatile, immersive systems –demonstrates the transformative power of world models. Spatial intelligence marks a fundamental shift, breaking free from the "flat" screen paradigm to embrace the three-dimensional way our minds are naturally wired to think, explore and interact.
The possibilities are truly thrilling.
If you like Turing Post, consider becoming a paid subscriber or sharing this digest with a friend. It helps us keep Monday digests free →
Twitter library
Not a subscriber yet? Subscribe to receive our digests and articles:
AI in Practice – Rats welcome robot-rat
To add to that: Almost 10% Of South Korea's Workforce Is Now A Robot
We are reading – Intel on our mind (is it really dying?)
Rene Haas highlighted Intel's struggle between vertical integration and a fabless model, citing high costs and innovation challenges. He mentioned attempting to encourage Intel to license Arm technology and acknowledged the strategic benefits of vertical integration amid rumors of Arm's interest in acquiring parts of Intel.
Meanwhile, Ben Thompson argues that Intel’s decline stems from its inability to adapt to mobile and efficiency-first computing, allowing ARM and TSMC to dominate. He highlights missed opportunities, such as Intel’s refusal to embrace ARM manufacturing or prioritize power efficiency. While Pat Gelsinger’s foundry plan aimed to address these issues, it was too late to reverse Intel’s losses in AI and profitability. Thompson suggests that Intel’s revival hinges on government-backed AI initiatives, positioning it as a vital domestic foundry for U.S. technological sovereignty.
Semianalysis attributes Intel's decline to decades of leadership failures, poor board decisions, and a loss of cultural and technical leadership. Firing CEO Pat Gelsinger and prioritizing financial engineering over innovation worsened the situation. Intel's delays in advanced nodes allowed competitors like TSMC and AMD to dominate. ARM-based architectures and hyperscaler custom chips further erode its market. Intel Foundry Services is seen as its last chance for relevance, requiring massive investment and government support to secure U.S. semiconductor independence. The article advocates divesting non-core businesses and focusing on revitalizing the foundry as Intel's lifeline.
Top Research – System Cards, Tech reports and Surveys:
OpenAI o1 System Card →read it here
Here's the spiciest detail from the new o1 system card:
— Simon Willison (@simonw)
6:22 PM • Dec 5, 2024
From 01.ai – Yi-Lightning Technical Report →read it here
This technical report introduces O1-CODER, an attempt to replicate OpenAI’s o1 model with a focus on coding tasks →read the paper
Also, this:
Reading about scaling laws recently I came by the interesting point:
Focus on a balance between models' size and performance is more important that aiming for larger models.@Tsinghua_Uni and ModelBest Inc propose the idea of “capacity density” to measure how efficiently a model… x.com/i/web/status/1…
— Ksenia Se (@Kseniase_)
12:23 AM • Dec 9, 2024
Models
As we continue to explore new post-training techniques, today we're releasing Llama 3.3 — a new open source model that delivers leading performance and quality across text-based use cases such as synthetic data generation at a fraction of the inference cost. x.com/i/web/status/1…
— AI at Meta (@AIatMeta)
5:01 PM • Dec 6, 2024
Efficient Track Anything and Segment Anything Model 2 (SAM 2) also from Meta AI develops EfficientTAM for real-time video object tracking on resource-constrained devices with high accuracy and efficiency →read the paper
Amazon Nova Foundation Models for understanding and creative tasks, focusing on scalability, safety, multilingual support, and cost-efficiency →read the paper
PaliGemma 2 from Google DeepMind advances transfer learning with Vision-Language Models optimized for tasks like OCR, molecular structure recognition, and music score transcription →read the paper.
NVILA by Nvidia reduces training and inference costs while maintaining high accuracy for tasks like medical imaging and robotic navigation →read the paper
You can find the rest of the curated research at the end of the newsletter.
News from The Usual Suspects ©
Google got wow reaction from both Elon Musk and Sam Altman
We see Willow as an important step in our journey to build a useful quantum computer with practical applications in areas like drug discovery, fusion energy, battery design + more. Details here:
— Sundar Pichai (@sundarpichai)
5:06 PM • Dec 9, 2024
Hugging Face: Visualizing with style
Four new visualisations of the rise of open-source AI models in 2024 added!
- explore how tasks have been growing
- how likes connect models together
- the geography of models creators and followers— Thomas Wolf (@Thom_Wolf)
8:22 AM • Dec 9, 2024
Microsoft is seeing the big picture
Microsoft's new Copilot Vision brings real-time insights to Edge browser for Pro users. Aimed at enterprise decision-makers, it turns data into actionable visuals with the click of a button. Microsoft continues weaving AI deeper into everyday workflows.
OpenAI levels up with ChatGPT Pro and Reinforcement Fine-Tuning Research Program
OpenAI introduces ChatGPT Pro, offering unlimited access to all models for $200/month, including the powerful GPT-4 turbocharged “o1” and expanded their RFT Program to enable developers and ML engineers to create expert models fine-tuned to excel at specific sets of complex, domain-specific tasks.AWS Reinvents AI again
AWS drops the mic with cutting-edge AI updates at re:Invent 2024. Highlights include Multi-Agent Orchestration on Bedrock, the Nova AI Model Family, and Prompt Caching for big savings. Enterprises like Moody's are already reaping the benefits of AI-first workflows.Salesforce measures AI’s pulse
Salesforce's Agentforce platform is delivering on its promise with soaring adoption KPIs. Enterprise AI agents are automating workflows, driving real ROI, and making humans feel slightly less indispensable.Canada gets cooler with AI
Cohere and CoreWeave are teaming up to build a cutting-edge data center in Canada. The collaboration promises to accelerate AI research while keeping the great white north on the innovation map.
More interesting research papers from last week
Vision-Language Model Enhancements
Discriminative Fine-tuning of LVLMs
Improve LVLMs by fine-tuning with contrastive and autoregressive losses, enhancing image-text discrimination and efficiency. Read the paperFlorence-VL
Enhance multimodal understanding using a generative vision encoder with depth-breadth fusion, excelling in OCR and visual tasks. Read the paperVLsI
Optimize smaller vision-language models using verbalized intermediate layers for efficiency and improved task performance. Read the paper
Datasets for LLMs and Physics Simulations
FineWeb2
Hugging Face democratizes AI research with FineWeb2, a high-quality 15T token dataset for diverse pretraining needs. Read the paperThe Well
Support physics-informed machine learning with diverse, high-resolution numerical simulations across domains. Read the paper
Model Optimization and Fine-Tuning
Weighted-Reward Preference Optimization
Fuse capabilities of heterogeneous LLMs efficiently without requiring aligned vocabularies. Read the paperTinyFusion
Reduce diffusion transformer size and costs with adaptive pruning and distillation methods. Read the paperAim
Optimize multi-modal inference by pruning and merging redundant tokens for efficiency. Read the paper
Sparse and Multilingual Training
Monet
Enable scalable and interpretable sparse mixture-of-expert models, specializing in language and domain knowledge. Read the paperMarco-LLM
Boost multilingual performance, particularly for low-resource languages, using diverse, large-scale training. Read the paper
Task-Specific Innovations and Scaling
Establishing Task Scaling Laws
Predict task-specific LLM performance efficiently using compute-reduced "ladder models." Read the paperExploring Proportional Analogies
Assess LLM reasoning on analogies with targeted knowledge-enhanced prompting for improved accuracy. Read the paper
Multi-Agent and Collaborative Training
MALT
Improve LLM reasoning by assigning collaborative roles in multi-agent setups for better task outcomes. Read the paperFree Process Rewards Without Process Labels
Train process reward models efficiently using outcome labels instead of intermediate annotations. Read the paper
RAG and OCR Challenges
OCR Hinders RAG
Analyze OCR-induced noise in retrieval-augmented generation and improve robustness using combined data inputs. Read the paper
Leave a review! |
Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. You will get a 1-month subscription!
Reply