Topic 4: What is JEPA?

we discuss the Joint Embedding Predictive Architecture (JEPA), how it differs from transformers and provide you with list of models based on JEPA

Introduction

Current AI architectures like Transformers are powerful and have achieved impressive results, including generalizing on previously unseen data and emergent abilities as the models are scaled. At the same time, they are still constrained compared to humans and animals who don’t need to see millions of data points before making the right conclusions or learning new skills like speaking. Examples include crows that can solve puzzles as five-year-olds, orcas’ sophisticated hunting methods that often deploy brutal, coordinated attacks, and elephants' cooperative abilities.

The Moravec paradox highlights that tasks that are difficult for humans, such as computing, are simple for computers to handle because they can be described and modeled easily. However, perception and sensory processing, which are natural for humans, are challenging for machines to master. Simply scaling the model and providing it with more data might not be a viable solution. Some argue that this approach will not lead to qualitatively different results enabling AI models to reach a new level of reasoning or world perception. Therefore, alternative methods must be explored to enable AI to attain human-level intelligence. Yann LeCun (one of AI Godfathers) insists that JEPA is the first step.

In today’s episode, we will cover:

  • What are the limitations of LLMs?

  • So what’s the potential solution?

  • How does JEPA work?

  • What can one build on JEPA?

  • I-JEPA – JEPA for Images

  • MC-JEPA - Multitasking JEPA

  • V-JEPA – JEPA for Video

  • Generalizing JEPA

  • Bonus: All resources in one place

Yann LeCun is the one who was always rational about the latest models and exposed their limitations and educated the public that the fear of AGI and AI taking over humans is an unreasonable fear. In February 2022, Yann LeCun proposed his vision of achieving human-level reasoning by AI. And Joint Embedding Predictive Architecture (JEPA) was at the core of his vision. Let’s figure out what it is!

What are limitations of LLMs?

Yann LeCun also gave several talks presenting his vision for objective-driven AI: talk 1 (March 28, 2024), and talk 2 (September 9, 2023). There, he extensively discussed the limitations of large language models (LLMs):

  • LLMs have no common sense: LLMs have limited knowledge of the underlying reality and make strange mistakes called “hallucinations.” This paper showed that LLMs are good at formal linguistic competence – knowledge of linguistic rules and patterns, while their performance on functional linguistic competence – understanding and using language in the world – remains unstable.

  • LLMs have no memory and can’t plan their answer: PlanBench benchmark proved that. 

So what’s the potential solution?

To propose new ideas, it is always great to come to the roots and fundamental disciplines. For the task of building intelligent AI, one needs to recap cognitive science, psychology, neuroscience along engineering sciences. Actually, this is the strategy that the creators of AI had taken in the 1960s. Professor LeCun did the same in the 2020s and devised the important parts for success we discuss below.

World models

The fundamental part of LeCun’s vision is the concept of "world models," which are internal representations of how the world functions. He argues that giving the model a context of the world around it could improve its results.

“The idea that humans, animals, and intelligent systems use world models goes back many decades in psychology and in fields of engineering such as control and robotics.”

Yann LeCun

Self-supervised learning

Another important aspect is using self-supervised learning (SSL) akin to babies who learn the world by observing it. Models like GPT, BERT, LLaMa and other foundation models are based on SSL and have changed the way we use machine learning.

Abstract representations

Apart from SSL, the model also needs to understand what should be captured by its sensors and what’s not. In other words, the model needs to contrast the relevant information in each state of the model. For example, the human eye is perfectly wired for that. What may seem to be a limitation in fact allows us to extract the essence.

The invisible gorilla study published in 1999 is the most famous example of a phenomenon called “inattentional blindness.” When we pay close attention to one thing, we often fail to notice other things – even if they are obvious. This is just one example of how our eyes function, scientists also showed that our eyes need some time to refocus on things just like a camera on your smartphone.

Using this analogy, Yann LeCun proposed that a model should use abstract representations* of images rather than comparing the pixels.

*Abstract representations simplify complex information into a form that is more manageable and meaningful for specific tasks or analyses. By focusing on the essential aspects and ignoring the less important details, these representations help systems (whether human or machine) to process information more efficiently and effectively.

Architecture – Objective-Driven AI

LeCun proposes a modular, configurable architecture for autonomous intelligence, emphasizing the development of self-supervised learning methods to enable AI to learn these world models without extensive labeled data.

Here’s a detailed view of the components of the system architecture for autonomous intelligence:

  • Configurator: Acts as the executive control center of the AI system by dynamically configuring other components of the system based on the specific task or context. For instance, it adjusts the parameters of the perception, world model, and actor modules to optimize performance for the given task.

  • Perception module: Captures and interprets sensory data from various sensors to estimate the current state of the world. This component is a basis for all higher-level processing and decision-making.

  • World model module: Predicts future states of the environment and fills in missing information. It acts as a simulator, using current and past data to forecast future conditions and possible scenarios. This component is the key for AI to perform hypothetical reasoning and planning, essential for navigating complex, dynamic environments.

  • Cost module: Evaluates the potential consequences of actions in terms of predefined costs associated with a given state or action. It has two submodules:

    • Intrinsic cost: Hard-wired, calculating immediate discomfort or risk

    • Critic: Trainable, estimating future costs based on current actions

  • Actor module: Decides and proposes specific actions based on the predictions and evaluations provided by other components of the architecture. It computes optimal action sequences that minimize the predicted costs, often using methods akin to those in optimal control theory.

  • Short-term memory: Keeps track of the immediate history of the system’s interactions with the environment. It stores recent data on the world state, actions taken, and the associated costs, allowing the system to reference this information in real-time decision-making.

How does Joint Embedding Predictive Architecture (JEPA) work?

Joint Embedding Predictive Architecture (JEPA) is a central element in the pursuit of developing AI that can understand and interact with the world as humans do. It encapsulates the key elements we mentioned above. JEPA allows the system to handle uncertainty and ignore irrelevant details while maintaining essential information for making predictions.

It works based on these elements:

  • Inputs: JEPA takes pairs of related inputs. For example, sequential frames of a video (x could be a current frame, and y the next frame)

  • Encoders: They transform the inputs, x and y, into abstract representations (sx and sy) which capture only essential features of the inputs and omit irrelevant details.

  • Predictor module: It is trained to predict the abstract representation of the next frame, sy, based on the abstract representation of the current frame, sx.

JEPA handles uncertainty in predictions in either of the two ways:

  • During the encoding phase, when the encoder drops irrelevant information. For example, the encoder checks which features of the input data are too uncertain or noisy and decides not to include these in the abstract representation.

  • After the encoding, based on the latent variable (z). Latent Variable z represents elements present in sy but not observable in sx. To handle uncertainty, z is varied across a predefined set of values, each representing different hypothetical scenarios or aspects of the future state y that might not be directly observable from x. By altering z, the predictive model can simulate how small changes in unseen factors could influence the upcoming state.

Interestingly, several JEPAs could be combined into a multistep/recurrent JEPA or stacked into a Hierarchical JEPA that could be used to perform predictions at several levels of abstraction and several time scales.

What can one build on JEPA?

Following the proposed JEPA architecture, Meta AI researchers along with Yann LeCun as a co-author published several specialized models. What are they?

Image-based Joint-Embedding Predictive Architecture (I-JEPA) – JEPA for Images

I-JEPA, proposed in June 2023, was the first model based on JEPA.

I-JEPA is a non-generative, self-supervised learning framework designed for processing images. It works by masking parts of the images and then trying to predict those masked parts:

  • Masking: The image is divided into numerous patches. Some of these patches, referred to as "target blocks," are masked (hidden) so that the model doesn’t have information about them

  • Context sampling: A portion of the image, called the "context block," is left unmasked. This part is used by the context encoder to understand the visible aspects of the image.

  • Prediction: The predictor then tries to predict the hidden parts (target blocks) based only on what it can see in the context block.

  • Iteration: This process involves updating the model's parameters to reduce the difference between predicted and actual patches.

I-JEPA consists of three parts each of which is a Vision Transformer (ViT):

  • Context encoder: Processes parts of the image that are visible, known as the "context block"

  • Predictor: Uses the output from the context encoder to predict what the masked (hidden) parts of the image look like

  • Target encoder: Generates representations from the target blocks (hidden parts) that the model uses to learn and make predictions about hidden parts of the image.

The overall goal of I-JEPA is to train the predictor to accurately predict the representations of the hidden image parts from the visible context. This self-supervised learning process allows the model to learn powerful image representations without relying on explicit labels.

MC-JEPA (Motion-Content Joint-Embedding Predictive Architecture) – Multitasking JEPA

MC-JEPA is another JEPA variation designed to simultaneously interpret video data: dynamic elements (motion) and static details (content) using a shared encoder. It was proposed just a month after I-JEPA, in July 2023.

MC-JEPA is a more comprehensive and robust visual representation model that can be used in real-world applications in computer vision like autonomous driving, video surveillance, and activity recognition.

Video-based Joint-Embedding Predictive Architecture (V-JEPA) – JEPA for Video

V-JEPA is designed to enhance AI's understanding of video content which was marked as an important future direction after the initial I-JEPA publication.

V-JEPA consists of two main components:

  • Encoder: Transforms input video frames into a high-dimensional space where similar features are closer together. The encoder captures essential visual cues from the video.

  • Predictor: Takes the encoded features of one part of the video and predicts the features of another part. This prediction is based on learning the temporal and spatial transformations within the video, aiding in understanding motion and changes over time.

V-JEPA's design allows it to learn from videos in a way that mimics some aspects of human learning – observing and predicting the visual world without needing explicit annotations. The model's ability to generalize from unsupervised video data to diverse visual tasks makes it a powerful tool for advancing how machines understand and interact with dynamic visual environments.

Generalizing JEPA

The latest paper published in March 2024, "Learning and Leveraging World Models in Visual Representation Learning," introduces the concept of Image World Models (IWM) and explores how the use of JEPA architecture can be generalized to a broader set of corruptions – changes in input images like color jitters, blurs – apart from masking.

The study explores two types of world models:

  • Invariant models: Recognize and maintain stable, unchanged features across different scenarios

  • Equivariant models: Adapt to changes in the input data, preserving the relationships and transformations that occur

The research discovered that machines can more accurately predict and adjust to visual changes by utilizing these world models. This resulted in the development of more resilient and adaptable systems. This method challenges traditional AI approaches and provides a new means to improve the effectiveness of machine learning models without requiring direct supervision.

Bonus: All resources in one place

Original models

Yann LeCun talks:

JEPA-inspired models

We also created for you a list of related models inspired by JEPA architecture. They are grouped based on their application domains:

Audio and Speech Applications
  1. A-JEPA: Focused on audio data using masked-modeling principles for improving contextual semantic understanding in audio and speech classification tasks.

  2. Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning: Analyzes masking strategies and sample durations in self-supervised audio representation learning.

Visual and Spatial Data Applications
  1. S-JEA: Enhances visual representation learning through hierarchical semantic representations in stacked joint embedding architectures.

  2. DMT-JEPA: Targets image modeling with a focus on local semantic understanding, applicable to classification, object detection, and segmentation.

  3. JEP-KD: Aligns visual speech recognition models with audio features, improving performance in visual speech recognition.

  4. Point-JEPA: Applied to point cloud data, enhancing efficiency and representation learning in spatial datasets.

  5. Signal-JEPA: Focuses on EEG signal processing, improving cross-dataset transfer and classification in EEG analysis.

Graph and Dynamic Data Applications
  1. Graph-JEPA: First joint-embedding architecture for graphs, using hyperbolic coordinate prediction for subgraph representation.

  2. ST-JEMA: Enhances learning of dynamic functional connectivity from fMRI data, focusing on high-level semantic representations.

Time-Series and Remote Sensing Applications
  1. LaT-PFN: Combines time-series forecasting with joint embedding architecture, leveraging related series for robust in-context learning.

  2. Time-Series JEPA: Optimizes remote control over limited-capacity networks through spatio-temporal correlations in sensor data.

  3. Predicting Gradient is Better: Utilizes self-supervised learning for SAR ATR, leveraging gradient features for automatic target recognition.

Evaluation and Methodological Studies
  1. LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures: Introduces a metric for evaluating representations in joint-embedding self-supervised learning architectures, focusing on linear probing performance.

How did you like it?

Login or Subscribe to participate in polls.

Thank you for reading! Share this article with three friends and get a 1-month subscription free! 🤍

Join the conversation

or to participate.