- Turing Post
- Posts
- š¦øš»#4: From Butler to J.A.R.V.I.S to Butler Again?
š¦øš»#4: From Butler to J.A.R.V.I.S to Butler Again?
History of Jarvis and and a deep dive into recent Ļ0 model from Physical Intelligence
Uff, this chapter wonāt be conventional, as weāre setting aside the usual history of agents. Originally, we had that chapter prepared, but a couple of recent developments ā Googleās Project Jarvis and the Ļ0 model from Physical Intelligence ā gave us a fresh angle. So, todayās agentic episode will dive into the iconic Just A Rather Very Intelligent System, aka J.A.R.V.I.S. ā a brilliant (and positive!) example of Capable AI, but alsoā¦ a butler.
In todayās episode, weāll explore:
Oh, J.A.R.V.I.S. (while everyone forgot about H.O.M.E.R.)
Zuckerbergās Jarvis and Googleās Project Jarvis
Getting back to being a butler: models for embodiment
Overview of Ļ0, a versatile robotic model
Letās dive in!
The idea of J.A.R.V.I.S. ā and the vision of a tech-driven superhero ā deeply inspired my colleague Alyona, so much so that she chose to pursue engineering, focusing on aircraft control systems and autopilot programming. She wanted to learn the advanced tech needed to create intelligent systems like J.A.R.V.I.S. Alas, this was back in 2015, and personal assistant technology was far from ready to meet that dream.
J.A.R.V.I.S. made its cinematic debut in 2008ās Iron Man, voiced by Paul Bettany, but if we rewind to the original Marvel comics, Jarvis wasnāt AI at all. He was Edwin Jarvis, a loyal, very human butler who managed Tony Starkās home and kept the Avengersā mansion in order ā something many of us would happily delegate to AI today.
Then, in 2008, Marvel decided to merge the idea of Edwin Jarvis with a lesser-known comic character: H.O.M.E.R. (Heuristically Operative Matrix Emulation Rostrum), an early, prototype-style digital assistant. While H.O.M.E.R. didnāt have J.A.R.V.I.S.ās interactive flair, it laid the groundwork for Starkās evolving AI systems by handling data processing and basic operations. Practical but limited, H.O.M.E.R. was a task-oriented system that worked on strictly functional levels ā without the learning and interaction that J.A.R.V.I.S. would later embody.
Basically, H.O.M.E.R. is what we have now.
J.A.R.V.I.S. is what weāre still trying to build.
But what weāre also looking forward to is for AI to become our butler, or simply a physical helper, right?
Zuckerbergās Jarvis and Googleās Project Jarvis
Meta AI
Not that many people remember it, but in 2016, Facebook CEO Mark Zuckerberg embarked on a personal project to develop an AI assistant for his home, inspired by the same fictional J.A.R.V.I.S. from the Iron Man series. His AI, also named Jarvis, was designed to control various household functions, including lighting, temperature, music, and security systems. It could recognize guests at the door, manage appliances, and even entertain his daughter, Max. Zuckerberg utilized programming languages like Python, PHP, and Objective-C, employing AI techniques like natural language processing and facial recognition to build this system. He spent about 100 hours (!) building Jarvis that year. Today, with so many coding assistants, that time would be reduced toā¦ a day? If you're still wondering whether AI can be helpful, that's the right comparison to keep in mind. Overall, his blog is a fascinating read in retrospect.
Mark Zuckerbergās post from 2016
In 2023, that Jarvis became an inspiration and basis for Meta AI agents.
Googleās Project Jarvis
The most recent Jarvis-inspired project is Googleās upcoming AI agent (also called Jarvis), designed to handle consumer tasks by interacting directly with a userās web browser, primarily Chrome. Unlike other digital AI agents that are more business or code-oriented, Googleās Jarvis simplifies everyday web-based tasks for consumers, such as booking flights, purchasing items, and managing returns. Powered by Googleās Gemini LLM, Jarvis interprets browser screenshots to navigate pages and perform actions like clicking buttons and entering text. According to The Information, it currently operates at a slower pace as it carefully processes each action. Initial releases are expected to reach a limited group for testing its usability and privacy handling, particularly since it requires access to sensitive information to complete tasks.
Getting Back to Being a Butler: Robot Models
With so many digital AI agents currently swarming the space, many people are beginning to wonder why AI handles all the creative work (writing poetry, drawing) but still canāt fold laundry or wash dishes. Thatās why a tweet from Sergey Levine caught my eye:
The rest of this explanatory article is available exclusively to our Premium users. If you're working on, or considering building, an AI agent (or more likely an agentic workflow), the following information will help ā
Really excited to share what I've been working on with my colleagues at Physical Intelligence! We've developed a prototype robotic foundation model that can fold laundry, assemble a box, bus a table, and many other things. We've written a paper and blog post about it. š§µš
ā Sergey Levine (@svlevine)
5:32 PM ā¢ Oct 31, 2024
Ļ0, a versatile robot model, is designed to take on a range of physical tasks using visual and tactile data, making it closer to what we might imagine as a real-world Jarvis, the actual butler.
What I found fascinating, that unlike task-specific robots, Ļ0 is developed as a foundation model with the versatility to handle various, unstructured activities.
Overview of Ļ0: A Vision-Language-Action Flow Model for General Robot Control
Image Credit: The original paper
Core Idea
The paper Ļ0: A Vision-Language-Action Flow Model for General Robot Control (Black et al.) introduces Ļ0 (pi-zero), a generalist robot model that integrates vision, language, and action capabilities for executing complex, multi-step tasks across various robotic platforms. Using a vision-language model (VLM) as a foundational backbone, Ļ0 enables diverse robots ā single-arm, dual-arm, and mobile manipulators ā to perform dexterous tasks like folding laundry and building boxes.
Technical Approach
Ļ0 is built on a pre-trained vision-language model and enhanced with a "flow matching" technique for action control, facilitating precise, real-time movements. Leveraging a cross-embodiment dataset, Ļ0 is trained on over 10,000 hours of robot data from multiple platforms and environments. This comprehensive training enables Ļ0 to generalize across varied tasks with robustness and adaptability.
(editorās note: "flow matching" technique is getting more popular and promises interesting results. We will be covering it more in the future.)
Pre-Training & Fine-Tuning Phases
The Ļ0's training is divided into:
Pre-Training: Ļ0 learns generalized knowledge from a broad data pool, adapting to varied robot configurations and environmental conditions.
Post-Training: Fine-tuning focuses on specific high-quality task data, enhancing Ļ0's ability to handle intricate, sequential operations like stacking items, setting tables, and responding to language instructions in real time.
Evaluation & Findings
Ļ0 demonstrates strong zero-shot performance and improved accuracy on complex tasks after fine-tuning, outperforming previous models in dexterity and generalization. Tasks tested include both simple (e.g., item stacking) and complex, multi-stage tasks (e.g., folding laundry and packing groceries). Ļ0's architecture enables a significant leap in robotic versatility and adaptability, with substantial performance boosts due to VLM initialization and flow matching.
The study emphasizes the need for large, diverse datasets in pre-training and the importance of understanding data composition. Expanding to domains like autonomous driving and legged locomotion will challenge and enhance Ļ0's generalist capabilities.
Key contributions
The Ļ0 model serves as a versatile robot policy, showcasing the potential for general-purpose robot foundation models capable of handling diverse and complex tasks. By integrating flow matching with a vision-language model (VLM), it enables precise, continuous action generation crucial for high-frequency, dexterous tasks. Extensive pre-training on large-scale, diverse data has shown to significantly enhance the modelās adaptability and robustness across various environments and scenarios, making it a promising foundation for real-world robotic control mechanisms and adaptive strategies.
Conclusion
In this episode, we explored two significant advancements bringing us closer to realizing J.A.R.V.I.S. ā not only as a highly capable digital assistant (AI agent) but also as an embodied butler that tackles mundane tasks and household chores. While much discussion centers around digital agents, sometimes what we truly need is someone to fold our laundry and wash our dishes. Ļ0 marks a step toward universal robot foundation models capable of handling a wide range of real-world, dexterous tasks. Its framework combines foundational VLM principles with tailored, robot-specific training, offering insights for both the future of robot learning and practical applications across diverse industries. Ultimately, Ļ0 brings us one step closer to a future where intelligent, adaptable robots seamlessly support our daily lives, both in the digital and physical realms. Alyona will finally be able to build her own Jarvis.
How did you like it? |
Thank you for reading! Share this article with three friends and get a 1-month subscription free! š¤
Reply