This website uses cookies

Read our Privacy policy and Terms of use for more information.

TL;DR: From I-JEPA to ThinkJEPA, the JEPA family has evolved from image representation learning into video, audio, 3D, robotics, causal reasoning, and world modeling. This map traces 14 major JEPA milestones and shows how the architecture is moving toward predictive, embodied AI.

This article is a map of that evolution: which models appeared, what each one added, and how the architecture moved from static perception toward predictive AI systems.

Recent papers such as V-JEPA 2.1, LeWorldModel, and ThinkJEPA make this trajectory much clearer. They show that JEPA is no longer one model or one research direction. It has become a growing family of approaches for learning abstract representations, predicting future states, and supporting planning in more dynamic environments.

This is why we reconstructed the full timeline. Instead of explaining JEPA from scratch, this guide follows its model family: the early image-based breakthroughs, the move into video and audio, the expansion into 3D and point clouds, and the newer branches focused on action, causality, robotics, and world models.

If you are new to the basics, JEPA stands for Joint Embedding Predictive Architecture, a self-supervised learning framework proposed by Yann LeCun. It learns by predicting target embeddings of masked or future inputs in latent space, rather than reconstructing raw pixels, tokens, or signals. For a full beginner explanation, start with our basic JEPA guide.

Here, the focus is different: the evolution of JEPA models and the typology of the JEPA family. So let’s walk through 14 of the most important milestones and see how the field moved from representation learning toward world modeling.

The JEPA model family: from representation learning to world models

JEPA / H-JEPA

This is the conceptual root and a starting point. Yann LeCun’s framework defines JEPA as prediction in representation space, and H-JEPA adds the crucial idea of hierarchical, multi-timescale world modeling and planning. → Read more

I-JEPA

The first major concrete success. I-JEPA showed that JEPA could learn semantic image representations without hand-crafted augmentations, and that the approach scaled well with Vision Transformers and large datasets. This is the point where JEPA became a serious practical recipe, competing with masked modeling and contrastive SSL approaches. → Read more

MC-JEPA

MC-JEPA is more of an exploratory step than a core milestone. It attempts to jointly learn motion and content features in a shared encoder, helping illustrate early efforts to extend JEPA from static images toward dynamic understanding. → Read more

V-JEPA

One of the central pillars of the JEPA story is a leap from images to video-based latent prediction. V-JEPA showed that predictive feature learning can scale to large video datasets and learn strong motion and appearance representations without relying on reconstruction or contrastive objectives. → Read more

Audio-JEPA

Audio-JEPA proved that JEPA was not only for vision, it is modality-general. It extends the approach to audio spectrograms using latent prediction and time-frequency-aware masking, showing strong performance on audio and speech tasks. → Read more

Point-JEPA

Point-JEPA is one of the key 3D branches. It adapts JEPA specifically to point cloud data, avoids raw-space reconstruction, and shows that JEPA can work efficiently on geometric representations. → Read more

3D-JEPA

Broadens the 3D story beyond point clouds into more general 3D representation learning. JEPA becomes a framework for full 3D semantics. → Read more

ACT-JEPA

The clearest bridge from JEPA to action and policy learning. It jointly predicts action sequences and latent observation sequences, showing improved world-model quality and better task performance. This is where JEPA starts to look like a full control architecture. → Read more

V-JEPA 2

It is the point where JEPA becomes an explicit world model for understanding, prediction, and planning. It demonstrates zero-shot robotic planning with visual subgoals in unseen environments. This is a major milestone in the family. → Read more

LeJEPA

This is the theory-and-training cleanup layer. LeJEPA introduces a simpler and more stable objective (SIGReg), argues for isotropic embedding structure, removes many heuristics, and emphasizes scalability and efficiency. It helps make JEPA more principled and easier to train. Read more in our article

Causal-JEPA

A conceptual extension that pushes JEPA toward object-centric and causal reasoning. By introducing object-level masking, it encourages learning more structured and causally meaningful representations, with improvements in reasoning and planning efficiency. → Read more

V-JEPA 2.1

If V-JEPA 2 is the world-model milestone, this is the representation-quality upgrade. V-JEPA 2.1 extends the V-JEPA 2 line with dense predictive losses, improved self-supervision, and better feature quality across images and videos, while also improving robotics and dense understanding benchmarks. → Read more

LeWorldModel

Presents a clean, end-to-end JEPA-style world model trained from raw pixels with a minimal objective, reducing training complexity and enabling faster planning compared to heavier foundation-model-based pipelines. → Read more

ThinkJEPA

Represents a forward-looking direction: combining JEPA world models with a semantic “thinking” pathway derived from vision-language models. It targets long-horizon reasoning and planning, going beyond local prediction. → Read more

Also, subscribe to our X, Threads and BlueSky to get unique content on every social media

FAQ

What is the difference between JEPA and VL-JEPA?

JEPA is the broader family of architectures for learning predictive representations in latent space. VL-JEPA is a vision-language version of that idea, designed to connect visual inputs with language-related representations. In simple terms, JEPA is the general framework; VL-JEPA applies the framework to vision-language learning.

What is the difference between JEPA and V-JEPA?

JEPA is the general framework for predicting embeddings of missing or future inputs. V-JEPA is a video-focused version of JEPA that learns from video data and predicts visual representations over time. V-JEPA is especially important because it moves JEPA from static perception toward dynamic world modeling.

Is V-JEPA open source?

Yes, Meta released V-JEPA research artifacts publicly, including code and model checkpoints. For the newest V-JEPA versions, availability may vary, so readers should check the official Meta AI or FAIR repository linked in the article.

Why are there so many JEPA variants?

Different JEPA variants adapt the same core idea to different data types and goals, including images, video, audio, 3D scenes, point clouds, robotics, action prediction, causal reasoning, and world modeling. Together, they show how JEPA has become a broader family of predictive AI architectures.

Why does the JEPA model family matter?

The JEPA family matters because it explores an alternative path to intelligence: learning structured representations of the world through prediction in latent space. This direction is especially relevant for world models, robotics, embodied AI, planning, and systems that need more than surface-level pattern matching.

Reply

Avatar

or to participate

Keep Reading