Personal tools

Multimodal AI

The Robotic Underwater Vehicle Orpheus_012425A
[The robotic underwater vehicle Orpheus is venturing into uncharted areas of the deep ocean (Credit: Marine Imaging Technologies, LLC/Woods Hole Oceanographic Institution) - BBC]
 

- Overview

The field of artificial intelligence (AI) has made tremendous progress over the past decade. While traditional AI models and techniques have primarily focused on data analysis, current technologies such as deep learning (DL), machine learning (ML), natural language processing (NLP), and generative AI (GenAI) take a broader approach to processing data.

To this end, developers and data scientists have come up with a variety of techniques to enhance large-scale data processing. One such technology is multimodal AI. This revolutionary technology integrates information from disparate sources to better understand the data at hand, allowing organizations to unlock new insights and support a wider range of applications.

To understand what multimodal AI is, you first need to understand the concept of modality. Modality, in its simplest form, refers to how something happens or is experienced. From this perspective, anything that involves multiple modalities can be described as multimodal.

Multimodal models are ML models that can process information from different modalities, including images, videos, and text. For example, Google's multimodal model Gemini can receive a photo of a plate of cookies and produce a written recipe in response, or vice versa.


- Multimodal: AI's New Frontier

Multimodal AI, a cutting-edge technology in AI, enhances data processing by integrating information from diverse sources like images, videos, and text. This approach allows AI models to understand data more comprehensively and supports a wider range of applications. 

A key aspect of multimodal AI is its ability to handle different "modalities" – ways that information is experienced or perceived – allowing for richer, more nuanced analysis. 

Key characteristics: 

  • Traditional AI Limitations: Traditional AI models often focused on analyzing data from a single source or modality (e.g., text or images).
  • Multimodal AI's Advantage: Multimodal AI overcomes this limitation by combining information from various modalities. For example, it can analyze an image of a recipe and generate a text description, or vice versa.
  • Concept of Modality: Modality, in the context of AI, refers to how information is presented or perceived. Examples include visual data (images, videos), textual data, and audio data.
  • Multimodal Models: These are machine learning (ML) models designed to process and understand data from multiple modalities, allowing for more holistic and insightful analysis, according to Google Cloud.
  • Examples: Google's Gemini is a notable example of a multimodal model that can process text, images, and potentially other modalities to perform various tasks.

 

In practice, generative AI (GenAI) tools use different strategies for different types of data when building large data models - the complex neural networks that organize vast amounts of information. 

For example, those that draw on textual sources segregate individual tokens, usually words. Each token is assigned an “embedding” or “vector”: a numerical matrix representing how and where the token is used compared to others. 

Collectively, the vector creates a mathematical representation of the token’s meaning. An image model, on the other hand, might use pixels as its tokens for embedding, and an audio one sound frequencies.

A multimodal AI model typically relies on several unimodal ones. This involves “almost stringing together” the various contributing models. Doing so involves various techniques to align the elements of each unimodal model, in a process called fusion. 

For example, the word “tree”, an image of an oak tree, and audio in the form of rustling leaves might be fused in this way. This allows the model to create a multifaceted description of reality. 

 

[More to come ...]


Document Actions