• Turing Post
  • Posts
  • šŸ¦øšŸ»#4: From Butler to J.A.R.V.I.S to Butler Again?

šŸ¦øšŸ»#4: From Butler to J.A.R.V.I.S to Butler Again?

History of Jarvis and and a deep dive into recent Ļ€0 model from Physical Intelligence

Uff, this chapter wonā€™t be conventional, as weā€™re setting aside the usual history of agents. Originally, we had that chapter prepared, but a couple of recent developments ā€“ Googleā€™s Project Jarvis and the Ļ€0 model from Physical Intelligence ā€“ gave us a fresh angle. So, todayā€™s agentic episode will dive into the iconic Just A Rather Very Intelligent System, aka J.A.R.V.I.S. ā€“ a brilliant (and positive!) example of Capable AI, but alsoā€¦ a butler.

In todayā€™s episode, weā€™ll explore:

  • Oh, J.A.R.V.I.S. (while everyone forgot about H.O.M.E.R.)

  • Zuckerbergā€™s Jarvis and Googleā€™s Project Jarvis

  • Getting back to being a butler: models for embodiment

  • Overview of Ļ€0, a versatile robotic model

Letā€™s dive in!

The idea of J.A.R.V.I.S. ā€“ and the vision of a tech-driven superhero ā€“ deeply inspired my colleague Alyona, so much so that she chose to pursue engineering, focusing on aircraft control systems and autopilot programming. She wanted to learn the advanced tech needed to create intelligent systems like J.A.R.V.I.S. Alas, this was back in 2015, and personal assistant technology was far from ready to meet that dream.

J.A.R.V.I.S. made its cinematic debut in 2008ā€™s Iron Man, voiced by Paul Bettany, but if we rewind to the original Marvel comics, Jarvis wasnā€™t AI at all. He was Edwin Jarvis, a loyal, very human butler who managed Tony Starkā€™s home and kept the Avengersā€™ mansion in order ā€“ something many of us would happily delegate to AI today.

Then, in 2008, Marvel decided to merge the idea of Edwin Jarvis with a lesser-known comic character: H.O.M.E.R. (Heuristically Operative Matrix Emulation Rostrum), an early, prototype-style digital assistant. While H.O.M.E.R. didnā€™t have J.A.R.V.I.S.ā€™s interactive flair, it laid the groundwork for Starkā€™s evolving AI systems by handling data processing and basic operations. Practical but limited, H.O.M.E.R. was a task-oriented system that worked on strictly functional levels ā€“ without the learning and interaction that J.A.R.V.I.S. would later embody.

Basically, H.O.M.E.R. is what we have now.

J.A.R.V.I.S. is what weā€™re still trying to build.

But what weā€™re also looking forward to is for AI to become our butler, or simply a physical helper, right?

Zuckerbergā€™s Jarvis and Googleā€™s Project Jarvis

Meta AI

Not that many people remember it, but in 2016, Facebook CEO Mark Zuckerberg embarked on a personal project to develop an AI assistant for his home, inspired by the same fictional J.A.R.V.I.S. from the Iron Man series. His AI, also named Jarvis, was designed to control various household functions, including lighting, temperature, music, and security systems. It could recognize guests at the door, manage appliances, and even entertain his daughter, Max. Zuckerberg utilized programming languages like Python, PHP, and Objective-C, employing AI techniques like natural language processing and facial recognition to build this system. He spent about 100 hours (!) building Jarvis that year. Today, with so many coding assistants, that time would be reduced toā€¦ a day? If you're still wondering whether AI can be helpful, that's the right comparison to keep in mind. Overall, his blog is a fascinating read in retrospect.

Mark Zuckerbergā€™s post from 2016

In 2023, that Jarvis became an inspiration and basis for Meta AI agents.

Googleā€™s Project Jarvis

The most recent Jarvis-inspired project is Googleā€™s upcoming AI agent (also called Jarvis), designed to handle consumer tasks by interacting directly with a userā€™s web browser, primarily Chrome. Unlike other digital AI agents that are more business or code-oriented, Googleā€™s Jarvis simplifies everyday web-based tasks for consumers, such as booking flights, purchasing items, and managing returns. Powered by Googleā€™s Gemini LLM, Jarvis interprets browser screenshots to navigate pages and perform actions like clicking buttons and entering text. According to The Information, it currently operates at a slower pace as it carefully processes each action. Initial releases are expected to reach a limited group for testing its usability and privacy handling, particularly since it requires access to sensitive information to complete tasks.

Getting Back to Being a Butler: Robot Models

With so many digital AI agents currently swarming the space, many people are beginning to wonder why AI handles all the creative work (writing poetry, drawing) but still canā€™t fold laundry or wash dishes. Thatā€™s why a tweet from Sergey Levine caught my eye:

The rest of this explanatory article is available exclusively to our Premium users. If you're working on, or considering building, an AI agent (or more likely an agentic workflow), the following information will help ā†’

Ļ€0, a versatile robot model, is designed to take on a range of physical tasks using visual and tactile data, making it closer to what we might imagine as a real-world Jarvis, the actual butler.

What I found fascinating, that unlike task-specific robots, Ļ€0 is developed as a foundation model with the versatility to handle various, unstructured activities.

Overview of Ļ€0: A Vision-Language-Action Flow Model for General Robot Control

Image Credit: The original paper

Core Idea

The paper Ļ€0: A Vision-Language-Action Flow Model for General Robot Control (Black et al.) introduces Ļ€0 (pi-zero), a generalist robot model that integrates vision, language, and action capabilities for executing complex, multi-step tasks across various robotic platforms. Using a vision-language model (VLM) as a foundational backbone, Ļ€0 enables diverse robots ā€“ single-arm, dual-arm, and mobile manipulators ā€“ to perform dexterous tasks like folding laundry and building boxes.

Technical Approach

Ļ€0 is built on a pre-trained vision-language model and enhanced with a "flow matching" technique for action control, facilitating precise, real-time movements. Leveraging a cross-embodiment dataset, Ļ€0 is trained on over 10,000 hours of robot data from multiple platforms and environments. This comprehensive training enables Ļ€0 to generalize across varied tasks with robustness and adaptability.

(editorā€™s note: "flow matching" technique is getting more popular and promises interesting results. We will be covering it more in the future.)

Pre-Training & Fine-Tuning Phases

The Ļ€0's training is divided into:

  • Pre-Training: Ļ€0 learns generalized knowledge from a broad data pool, adapting to varied robot configurations and environmental conditions.

  • Post-Training: Fine-tuning focuses on specific high-quality task data, enhancing Ļ€0's ability to handle intricate, sequential operations like stacking items, setting tables, and responding to language instructions in real time.

Evaluation & Findings

Ļ€0 demonstrates strong zero-shot performance and improved accuracy on complex tasks after fine-tuning, outperforming previous models in dexterity and generalization. Tasks tested include both simple (e.g., item stacking) and complex, multi-stage tasks (e.g., folding laundry and packing groceries). Ļ€0's architecture enables a significant leap in robotic versatility and adaptability, with substantial performance boosts due to VLM initialization and flow matching.

The study emphasizes the need for large, diverse datasets in pre-training and the importance of understanding data composition. Expanding to domains like autonomous driving and legged locomotion will challenge and enhance Ļ€0's generalist capabilities.

Key contributions

The Ļ€0 model serves as a versatile robot policy, showcasing the potential for general-purpose robot foundation models capable of handling diverse and complex tasks. By integrating flow matching with a vision-language model (VLM), it enables precise, continuous action generation crucial for high-frequency, dexterous tasks. Extensive pre-training on large-scale, diverse data has shown to significantly enhance the modelā€™s adaptability and robustness across various environments and scenarios, making it a promising foundation for real-world robotic control mechanisms and adaptive strategies.

Conclusion

In this episode, we explored two significant advancements bringing us closer to realizing J.A.R.V.I.S. ā€“ not only as a highly capable digital assistant (AI agent) but also as an embodied butler that tackles mundane tasks and household chores. While much discussion centers around digital agents, sometimes what we truly need is someone to fold our laundry and wash our dishes. Ļ€0 marks a step toward universal robot foundation models capable of handling a wide range of real-world, dexterous tasks. Its framework combines foundational VLM principles with tailored, robot-specific training, offering insights for both the future of robot learning and practical applications across diverse industries. Ultimately, Ļ€0 brings us one step closer to a future where intelligent, adaptable robots seamlessly support our daily lives, both in the digital and physical realms. Alyona will finally be able to build her own Jarvis.

How did you like it?

Login or Subscribe to participate in polls.

Reply

or to participate.