All JEPA Models: 14 Milestones From I-JEPA to ThinkJEPA

TL;DR: From I-JEPA to ThinkJEPA, the JEPA family has evolved from image representation learning into video, audio, 3D, robotics, causal reasoning, and world modeling. This map traces 14 major JEPA milestones and shows how this architecture is evolving toward predictive, embodied AI.

This article is a map of that evolution: which models appeared, what each one added, and how the architecture moved from static perception toward predictive AI systems.

I-JEPA, V-JEPA, LeJEPA, V-JEPA 2, and ThinkJEPA show how the JEPA family evolved from image representation learning into video, audio, 3D, robotics, causal reasoning, and world modeling. What started as Yann LeCun’s idea of prediction in embedding space has become a broader research path for building AI systems that understand what matters in the world, not just what pixels or tokens look like.

Recent papers such as V-JEPA 2.1, LeWorldModel, and ThinkJEPA make this trajectory much clearer. JEPA is no longer one model or one research direction. It has become a growing family of approaches for learning abstract representations, predicting future states, and supporting planning in more dynamic environments.

This is why we reconstructed the full timeline. Instead of explaining JEPA from scratch, this guide follows its model family: the early image-based breakthroughs, the move into video and audio, the expansion into 3D and point clouds, and the newer branches focused on action, causality, robotics, and world models.

If you are new to the basics, JEPA stands for Joint Embedding Predictive Architecture, a self-supervised learning framework proposed by Yann LeCun. It learns by predicting target embeddings of masked or future inputs in latent space, rather than reconstructing raw pixels, tokens, or signals. For a full beginner explanation, start with What Is JEPA? Joint Embedding Predictive Architecture.

Here, the focus is different: the evolution of JEPA models and the typology of the JEPA family. So let’s walk through 14 of the most important milestones and see how the field moved from representation learning toward world modeling.

What Is the JEPA Model Family?

JEPA / H-JEPA

This is the conceptual root and a starting point. Yann LeCun’s framework defines JEPA as prediction in representation space, and H-JEPA adds the crucial idea of hierarchical, multi-timescale world modeling and planning. → Read more

I-JEPA

I-JEPA is an image-based JEPA model that learns semantic visual representations by predicting masked image regions in embedding space.

It is the first major concrete success. I-JEPA showed that JEPA could learn semantic image representations without hand-crafted augmentations, and that the approach scaled well with Vision Transformers and large datasets. This is the point where JEPA became a serious practical recipe, competing with masked modeling and contrastive SSL approaches. → Read more

MC-JEPA

MC-JEPA is a motion-content JEPA model designed to learn both dynamic motion features and static content features from video.

It is more of an exploratory step than a core milestone. It attempts to jointly learn motion and content features in a shared encoder, helping illustrate early efforts to extend JEPA from static images toward dynamic understanding. → Read more

V-JEPA

V-JEPA is a video-based JEPA model that learns spatio-temporal representations by predicting masked video features in latent space.

One of the central pillars of the JEPA story is a leap from images to video-based latent prediction. V-JEPA showed that predictive feature learning can scale to large video datasets and learn strong motion and appearance representations without relying on reconstruction or contrastive objectives. → Read more

Audio-JEPA

Audio-JEPA is an audio-focused JEPA model that applies latent predictive learning to spectrograms and speech-related representations.

Audio-JEPA proved that JEPA was not only for vision, it is modality-general. It extends the approach to audio spectrograms using latent prediction and time-frequency-aware masking, showing strong performance on audio and speech tasks. → Read more

Point-JEPA

Point-JEPA is a 3D point-cloud JEPA model that learns geometric representations without relying on raw-space reconstruction.

Point-JEPA is one of the key 3D branches. It adapts JEPA specifically to point cloud data, avoids raw-space reconstruction, and shows that JEPA can work efficiently on geometric representations. → Read more

3D-JEPA

3D-JEPA is a JEPA-style approach for learning broader 3D scene and object representations in latent space.

Broadens the 3D story beyond point clouds into more general 3D representation learning. JEPA becomes a framework for full 3D semantics. → Read more

ACT-JEPA

ACT-JEPA is an action-focused JEPA model that connects latent world modeling with action sequence prediction and policy learning.

The clearest bridge from JEPA to action and policy learning. It jointly predicts action sequences and latent observation sequences, showing improved world-model quality and better task performance. This is where JEPA starts to look like a full control architecture. → Read more

V-JEPA 2

V-JEPA 2 is a world-model version of V-JEPA designed for understanding, prediction, and planning in physical environments.

It is the point where JEPA becomes an explicit world model for understanding, prediction, and planning. It demonstrates zero-shot robotic planning with visual subgoals in unseen environments. This is a major milestone in the family. → Read more

LeJEPA

LeJEPA is a theoretically grounded JEPA variant that simplifies training with SIGReg and focuses on stable, scalable representation learning.

This is the theory-and-training cleanup layer. LeJEPA introduces a simpler and more stable objective (SIGReg), argues for isotropic embedding structure, removes many heuristics, and emphasizes scalability and efficiency. It helps make JEPA more principled and easier to train. → Read more in our article

Causal-JEPA

Causal-JEPA is a JEPA extension that pushes latent prediction toward object-centric and causal reasoning.

A conceptual extension that pushes JEPA toward object-centric and causal reasoning. By introducing object-level masking, it encourages learning more structured and causally meaningful representations, with improvements in reasoning and planning efficiency. → Read more

V-JEPA 2.1

V-JEPA 2.1 is a representation-quality upgrade to V-JEPA 2, improving dense predictive features across images, video, and robotics benchmarks.

If V-JEPA 2 is the world-model milestone, this is the representation-quality upgrade. V-JEPA 2.1 extends the V-JEPA 2 line with dense predictive losses, improved self-supervision, and better feature quality across images and videos, while also improving robotics and dense understanding benchmarks. → Read more

LeWorldModel

LeWorldModel is an end-to-end JEPA-style world model trained from raw pixels with a minimal predictive objective.

Presents a clean, end-to-end JEPA-style world model trained from raw pixels with a minimal objective, reducing training complexity and enabling faster planning compared to heavier foundation-model-based pipelines. → Read more

ThinkJEPA

ThinkJEPA is a forward-looking JEPA model that combines world modeling with a semantic reasoning pathway for long-horizon planning.

Represents a forward-looking direction: combining JEPA world models with a semantic “thinking” pathway derived from vision-language models. It targets long-horizon reasoning and planning, going beyond local prediction. → Read more

FAQ

What is the JEPA model family?

The JEPA model family is a group of AI models based on Joint Embedding Predictive Architecture. They all share the same core idea: predict useful representations in latent space instead of reconstructing raw pixels, tokens, audio, or other low-level signals.

What is the difference between JEPA and VL-JEPA?

JEPA is the broader architecture family for learning predictive representations in latent space. VL-JEPA is a vision-language version of JEPA that connects visual inputs with language-related representations. In simple terms, JEPA is the general framework; VL-JEPA applies it to vision-language learning.

What is the difference between JEPA and V-JEPA?

JEPA is the general framework for predicting embeddings of missing or future inputs. V-JEPA is the video-focused version that learns from video data and predicts visual representations over time. V-JEPA matters because it moves JEPA from static perception toward dynamic world modeling.

Is V-JEPA open source?

Yes, Meta released V-JEPA research artifacts publicly, including code and model checkpoints. For newer versions such as V-JEPA 2 or V-JEPA 2.1, availability may vary, so readers should check the official Meta AI or FAIR repository linked in the article.

Why are there so many JEPA variants?

Different JEPA variants adapt the same core idea to different data types and goals, including images, video, audio, 3D scenes, point clouds, robotics, action prediction, causal reasoning, and world modeling. Together, they show how JEPA became a broader family of predictive AI architectures.

Subscribe to get it in your inbox

Also, subscribe to our X, Threads and BlueSky to get unique content on every social media

14 JEPA Milestones: The Complete Map of AI Progress