- Turing Post
- Posts
- Token 1.24: Understanding Multimodal Models
Token 1.24: Understanding Multimodal Models
and what so special about them in 2024
Introduction
At the end of 2023, many experts predicted that 2024 would become the year of multimodal models. For example, Sara Hooker, VP of Research at Cohere, told us, 'Multimodal will become a ubiquitous term.'
But if we sift through our memories, we'll recall that Venture Beat already published an article in 2021 titled 'Multimodal models are fast becoming a reality ā consequences be damned.' Why didnāt it become a reality then, but now is most likely indeed becoming a ubiquitous term?
Letās take a look:
Microsoftās NUWA generating video from text in 2021
OpenAIās SORA generating video from text in 2024
OpenAIās SORA generating video from text in 2024
And though something weird is happening with the guy's limb in the first video by Sora, it's obvious: multimodal models have become a reality. From now on, they will only become more realistic. In this article, we explore the latest developments in multimodal models, how this field has evolved, and its future directions. As a bonus, we'll list the latest multimodal models with links to comprehensive surveys on the topic.
What does multi-modal mean?
Language models vs. Multimodal models
Single modality ā Multimodality, whatās the sense of this transition?
Path to Multimodal Foundation Models
Multimodal Fusion: how to truly combine different data types?
Real-world examples of existing models
Resources of helpful surveys
Future directions
What does multimodal mean?
In simple terms, it means the model that can work with two or more types of data or āmodalitiesā. It can process and fuse the information from a combination of modalities. Primary modalities are:
Text: Written text or code represented as text
Vision: Images, videos which are just a set of images
Audio: Speech, music, or any ambient sounds
Sensory data: Information from sensors like temperature, motion, pressure, and other measurements relevant to the specific use case
In the context of foundation models, multimodal usually means combining textual and visual data. However, there are other combinations of modalities to explore that could enable progress on higher-order skills illustrated in the image below.
An illustration of different data sources and foundation model's skills. Image Credit: https://crfm.stanford.edu/assets/report.pdf
When you start reading research papers, you encounter a myriad of terms related to multimodal foundation models. Letās sort it out ā
The rest of this article is available to our Premium users only. To learn more about Multimodal models and be able to use them in your work, please ā>
How did you like it? |
Thank you for reading, please feel free to share with your friends and colleagues and receive one month of Premium subscription for free š¤
Reply