World Models will push the frontier for LLMs
Large Language Models are trained using the next-token prediction objective. This means that the vocabulary distribution of the model needs to be close to that of the training dataset. This objective is highly effective for language because it is already a discrete and compressed representation of world knowledge.
Since the middle of 20241, there has been a push for multimodality support in Large Language Models. GPT-4o was the first commercialized multimodal LLM; it supports video and images as well as text.
Today, LLMs are more capable and (allegedly) much larger than GPT-4o. This scaling up in model size raises some questions about compute efficiency for training and inference. World Models might be able to push the frontier by enabling more efficient and “compressed” models.
However, there are multiple approaches to using World Models to build Multimodal Large Language Models (MLLMs). My favorite is to use Latent World Models. This means the model encodes multiple modalities in a single embedding space (through fusion), which can feed a decoder head to translate it into language space.
This approach was tested in depth by the paper "VL-JEPA: Joint Embedding Predictive Architecture for Vision-language" (Chen et al., 2026). They trained the multimodal encoder with the classic JEPA recipe, which includes multiple heuristics (these can be removed by using "LeJEPA"2). They did not use the Cross Entropy Loss at all (it may be better to include it to make everything more stable) and still obtained a 1.2 billion parameter Vision-Language model that can outperform 8 billion parameter models from 2024 and 2025. It is not SOTA, but it shows another way to do things.
I see a future where Large Language Models are powered by World Model backbones, which process everything in latent space and can later decode everything using a classic decoder head. This means those LLMs will be better at reasoning in physical space and understanding what is physically possible and impossible. They can even learn new “worlds” (datasets); they are not limited to physics or even to videos and images. Audio and speech, time series, temperature, touch, ...
The future is exciting!
Release of GPT-4o by OpenAI in May 2024.↩
We can remove the heuristics of JEPA by using SigREG, which normalizes the latent vectors by forcing them to fit into a specific geometry. More details in the paper.↩