# Multimodal AI: What ChatGPT and Google Gemini Can Now Do

**URL:** https://www.singlegrain.com/blog/ms/multimodal-ai/  
**Published:** 2024-04-04  
**Updated:** 2024-09-04  
**Author:** Eric Siu  
**Summary:** Updated July 2024 Brace yourself\. The next stage of AI is being ushered in – it’s multimodal AI\. Multimodal AI is a significant step towards more intelligent and versatile AI\.\.\.  

---

_Updated July 2024_

Brace yourself. The next stage of AI is being ushered in – it’s multimodal AI.

Multimodal AI is a significant step towards more intelligent and versatile AI systems that are capable of understanding and interacting with the world in a more human-like manner.

In this post, we’re going to give a breakdown of the new functionality that you can take advantage of in ChatGPT and Google Gemini, specifically focusing on the interconnectivity between these tools and image observation.

[Get My Free Marketing Plan](javascript:;)


### [**TABLE OF CONTENTS:**](javascript:;)

- **[What Is Multimodal AI?](#h.tckv5b9d99dk)**
- **[A Glimpse into ChatGPT's Multimodal Capabilities](#a-glimpse-into-chatgpts-multimodal-capabilities)**
- **[Google Gemini’s Integrations](#h.6jly5d164rbd)**
- **[Image Interpretation](#h.r1k97lfuzcco)**
- **[The Road Ahead for Multimodal AI](#the-road-ahead-for-multimodal-ai)**
- **[Recommended Video](#recommended-video)**





## What Is Multimodal AI?

**Multimodal AI** is a type of artificial intelligence that can understand and generate multiple forms of data inputs, such as text, images and sound, simultaneously.

And it’s as big of a deal as it sounds.

Multimodal AI systems are trained on large datasets of multimodal data, which allows them to learn the relationships between different modalities and how to fuse them together effectively. Once trained, [these systems can be used for a variety of tasks](https://www.singlegrain.com/blog/ms/marketing-ai/), including:

- **Image captioning:** Generating text descriptions of images.
- **Text-to-image generation:** Generating images from text descriptions.
- **Video understanding:** Summarizing the content of videos, answering questions about videos, and [detecting objects and events in videos](https://www.singlegrain.com/blog/ms/ai-generated-video/).
- **Human-computer interaction:** Enabling more natural and intuitive communication between humans and computers.
- **Robotics:** Helping robots better understand and interact with the real world.

This evolution offers substantial potential, especially when it comes to real-world applications.

## A Glimpse into ChatGPT’s Multimodal Capabilities

[ChatGPT’s](https://www.singlegrain.com/blog/ms/chatgpt-prompts-for-marketing/) multimodal capabilities allow it to interact with users in a more natural and intuitive way. It can now see, hear and speak, which means that users can provide input and receive responses in a variety of ways.

Here are some specific examples of [ChatGPT’s multimodal capabilities](https://openai.com/blog/chatgpt-can-now-see-hear-and-speak):

- **Image input:** Users can upload images to ChatGPT as prompts, and the chatbot will generate responses based on what it sees. For example, you could upload a photo of a recipe and ask ChatGPT to generate a list of ingredients or instructions. We’ll expand on this shortly.
- **Voice input:** People can also use voice [prompts to interact with ChatGPT](https://www.singlegrain.com/blog/ms/marketing-strategies/). This can be useful for hands-free tasks, such as asking ChatGPT to play a song while driving.
- **Voice output:** ChatGPT can also generate responses in one of five different natural-sounding voices. This means that users can have a more normal and conversational experience with the chatbot.
- **DALL-E integration:** ChatGPT Plus and Enterprise users can now generate images from text descriptions directly within the ChatGPT interface with the DALL-E GPT, like this one (“Generate an image of a human chatting with an AI robot”):

![DALL·E-generated image of woman conversing with an AI robot](https://www.singlegrain.com/wp-content/uploads/2023/10/DALL·E-2023-10-25-14.24.29-Photo-of-an-elderly-Asian-woman-in-a-park-holding-a-conversation-with-a-humanoid-AI-robot-that-has-a-lifelike-face-and-gentle-expressions.-The-backgr.png)

- **As of April 3, 2024**, you can now edit your DALL-E images right in ChatGPT:

> You can now edit DALL·E images in ChatGPT across web, iOS, and Android. [pic.twitter.com/AJvHh5ftKB](https://t.co/AJvHh5ftKB)
> 
> — OpenAI (@OpenAI) [April 3, 2024](https://twitter.com/OpenAI/status/1775569161759985737?ref_src=twsrc%5Etfw)

![Edit DALL-E images in ChatGPT](https://www.singlegrain.com/wp-content/uploads/2024/04/Edit-DALL-E-images-in-ChatGPT-1280x911.png)

- Plus you can quickly choose among several image styles:

[![OpenAI DALL·E GPT tweet](https://www.singlegrain.com/wp-content/uploads/2023/10/OpenAI-DALL·E-GPT.png)](https://twitter.com/OpenAI/status/1775569163257332169)

## Google Gemini’s Integrations

While ChatGPT is making waves with its multimodal approach, [Google Gemini](https://www.singlegrain.com/blog/ms/google-bard/) is emerging as a strong contender in the AI sphere.

Many users have noted its proficiency, even going as far as to say that Gemini[ surpasses ChatGPT](https://www.singlegrain.com/blog/ms/chatgpt-alternative/) in certain areas. The argument in favor of Gemini often centers on the freshness of its data.

ChatGPT, despite its upcoming versions, relies on slightly outdated data sets (its current knowledge base cuts off at September 2021), which affects its relevancy in up-to-date and evolving topics.

[Google Gemini boasts integrations](https://www.singlegrain.com/blog/a/google-bard-youtube-integration/) with various data sources, such as:

- Google Flights
- Google Maps
- Google Hotels
- the broader Google Workspace
- and now YouTube

That’s just a handful of the product integrations Google Gemini is capable of. Also, because it does not have a knowledge cut-off date, it can access information through Google Search, which means it can communicate more dynamically with tools like Maps and Hotels, providing (almost) real-time updates on queries related to those topics.

![Image1](https://www.singlegrain.com/wp-content/uploads/2023/10/image1-20.png)A simple query, like seeking insights about a YouTube influencer, can yield detailed results about the channels they operate, their primary content themes, and much more.

The difference in utility between ChatGPT and Google Gemini is evident, with each having its unique strengths. Some users lean towards Geminifor certain tasks, while ChatGPT remains the go-to for others. The competition between the two ensures that AI tools will continually evolve, offering users enhanced capabilities.

## Image Interpretation

Both Google Gemini and ChatGPT use multimodal AI to [describe photos](https://www.wired.com/story/chatgpt-plus-image-feature-openai/) by combining their knowledge of language and images:

![Screenshot of chatgbt anayzing photo of plug](https://media.wired.com/photos/651759def8a5c34a7770661d/master/w_1600%2Cc_limit/chatgbt-gear-image_6487327-(1).jpg)

This is helpful for marketers because it allows them to generate more accurate and informative descriptions of their products and services.

For example, you could use Gemini or ChatGPT to generate a description of a new clothing item that would be more likely to capture the attention of potential customers. Or, you could use these models to generate descriptions of your products in different languages, which could help you reach a wider audience.

Here are some specific ways that marketers can use Gemini and ChatGPT to describe photos:

- **Generate product descriptions:** This can help marketers to increase sales and improve the customer experience.
- **Create marketing campaigns:** A marketer could use these models to generate different ad copy for different social media platforms based on the graphics or images provided.
- **Improve SEO:** Gemini and ChatGPT can be used to generate descriptions of photos that are optimized for search engines. This can help marketers improve the ranking of their websites in search results.

## The Road Ahead for Multimodal AI

The rapid advancements in AI tools like [ChatGPT](https://www.singlegrain.com/blog/ms/chatgpt-marketing/) and Google Gemini are undoubtedly exciting. However, a note of caution: these tools are still in their developmental phase. Expecting flawless operation might lead to disappointment. Over the next couple years, these tools will likely become more refined and accurate –  and inaccuracies will still persist.

The key to harnessing the power of these AI tools lies in the synergy between [human and machine](https://www.singlegrain.com/blog/ms/ai-vs-human/). Relying solely on AI might not yield the best results. But combined with human judgment and expertise, these tools can become a formidable asset.

As always, with technology evolving at breakneck speeds, staying updated on these tools will ensure that users are always ahead of the curve.

### **If you’re ready to level up your brand with AI tools, Single Grain’s [AI experts](https://www.singlegrain.com/ai-transformation-services/) can help!👇**

[Get My Free Marketing Plan](javascript:;)

## Recommended Video

 ![Video thumbnail](https://i.ytimg.com/vi/Qd_EsiSpMvs/maxresdefault.jpg)

 

 





_For more insights and lessons about marketing, check out our [Marketing School podcast](https://www.youtube.com/@MarketingSchoolPod) on YouTube._

_Additional content contributed by [Sam Pak](https://www.linkedin.com/in/samuelepak/)._
