Introduction to Aya Vision

Introducing Aya Vision - a state-of-the-art open-weights multimodal multilingual model.

In this notebook, we will explore the capabilities of Aya Vision, which can take text and image inputs to generates text responses.

The following links provide further details about the Aya Vision model:

This tutorial will provide a walkthrough of the various use cases that you can build with Aya Vision. By the end of this notebook, you will have a solid understanding of how to use Aya Vision for a wide range of applications.

The list of possible use cases with multimodal models is endless, but this notebook will cover the following:

  • Setup
  • Question answering
  • Multilingual multimodal understanding
  • Captioning
  • Recognizing text
  • Classification
  • Comparing multiple images
  • Conclusion

Setup

First, install the Cohere Python SDK and create a client.

PYTHON
1%pip install cohere -q
PYTHON
1import cohere
2import base64
3
4co = cohere.ClientV2(
5 "COHERE_API_KEY"
6) # Get your free API key here: https://dashboard.cohere.com/api-keys

Next, let’s set up a function to generate text responses, given an image and a message. It uses the Cohere API via the Chat endpoint to call the Aya Vision model.

To pass an image to the API, pass a Base64-encoded image as the image_url argument in the messages parameter. To convert and image into a Base64-encoded version, we can use the base64 library as in this example.

PYTHON
1# Define the model
2model="c4ai-aya-vision-32b"
3
4def generate_text(image_path, message):
5 """
6 Generate text responses from Aya Vision model based on an image and text prompt.
7
8 Args:
9 image_path (str): Path to the image file
10 message (str): Text prompt to send with the image
11
12 Returns:
13 None: Prints the model's response
14 """
15
16 # Define an image in Base64-encoded format
17 with open(image_path, "rb") as img_file:
18 base64_image_url = f"data:image/jpeg;base64,{base64.b64encode(img_file.read()).decode('utf-8')}"
19
20 # Make an API call to the Cohere Chat endpoint, passing the user message and image
21 response = co.chat(
22 model=model,
23 messages=[
24 {
25 "role": "user",
26 "content": [
27 {"type": "text", "text": message},
28 {"type": "image_url", "image_url": {"url": base64_image_url}},
29 ],
30 }
31 ],
32 )
33
34 # Print the response
35 print(response.message.content[0].text)

Let’s also set up a function to render images on this notebook as we go through the use cases.

Note: the images used in this notebook can be downloaded here

PYTHON
1from IPython.display import Image, display
2
3def render_image(image_path):
4 """
5 Display an image in the notebook with a fixed width.
6
7 Args:
8 image_path (str): Path to the image file to display
9 """
10 display(Image(filename=image_path, width=400))

Question answering

One of the more common use cases is question answering. Here, the model is used to answer questions based on the content of an image.

By providing an image and a relevant question, the model can analyze the visual content and generate a text response. This is particularly useful in scenarios where visual context is important, such as identifying objects, understanding scenes, or providing descriptions.

PYTHON
1image_path = "image1.jpg"
2render_image(image_path)
PYTHON
1message = "Where is this art style from and what is this dish typically used for?"
2generate_text(image_path, message)
1The art style on this dish is typical of traditional Moroccan or North African pottery. It's characterized by intricate geometric patterns, bold colors, and a mix of stylized floral and abstract designs.
2
3This type of dish is often used as a spice container or for serving small portions of food. In Moroccan cuisine, similar dishes are commonly used to hold spices like cumin, cinnamon, or paprika, or to serve condiments and appetizers.
4
5The design and craftsmanship suggest this piece is likely handmade, which is a common practice in Moroccan pottery. The vibrant colors and detailed patterns make it not just a functional item but also a decorative piece that adds to the aesthetic of a dining table or kitchen.

Multilingual multimodal understanding

Aya Vision can process and respond to prompts in multiple languages, demonstrating its multilingual capabilities. This feature allows users to interact with the model in their preferred language, making it accessible to a global audience. The model can analyze images and provide relevant responses based on the visual content, regardless of the language used in the query.

Here is an example in Persian:

PYTHON
1image_path = "image2.jpg"
2render_image(image_path)
PYTHON
1message = "آیا این یک هدیه مناسب برای یک کودک 3 ساله است؟"
2generate_text(image_path, message)
1بله، این یک هدیه مناسب برای یک کودک سه ساله است. این مجموعه لگو دوپلوی "پل آهنی و مسیر قطار" به طور خاص برای کودکان دو تا چهار ساله طراحی شده است. قطعات بزرگ و رنگارنگ آن برای دست‌های کوچک راحت است و به کودکان کمک می‌کند تا مهارت‌های حرکتی ظریف خود را توسعه دهند. این مجموعه همچنین خلاقیت و بازی تخیلی را تشویق می‌کند، زیرا کودکان می‌توانند با قطعات مختلف برای ساختن پل و مسیر قطار بازی کنند. علاوه بر این، لگو دوپلو به دلیل ایمنی و سازگاری با کودکان خردسال شناخته شده است، که آن را به انتخابی ایده‌آل برای هدیه دادن به کودکان سه ساله تبدیل می‌کند.

And here’s an example in Indonesian:

PYTHON
1image_path = "image3.jpg"
2render_image(image_path)
PYTHON
1message = "Gambar ini berisikan kutipan dari tokoh nasional di Indonesia, siapakah tokoh itu?"
2generate_text(image_path, message)
1Gambar ini berisikan kutipan dari Soekarno, salah satu tokoh nasional Indonesia yang terkenal. Ia adalah Presiden pertama Indonesia dan dikenal sebagai salah satu pemimpin pergerakan kemerdekaan Indonesia. Kutipan dalam gambar tersebut mencerminkan pemikiran dan visi Soekarno tentang pembangunan bangsa dan pentingnya kontribusi generasi muda dalam menciptakan masa depan yang lebih baik.

Captioning

Instead of asking about specific questions, we can also get the model to provide a description of an image as a whole, be it detailed descriptions or simple captions.

This can be particularly useful for creating alt text for accessibility, generating descriptions for image databases, social media content creation, and others.

PYTHON
1image_path = "image4.jpg"
2render_image(image_path)
PYTHON
1message = "Describe this image in detail."
2
3generate_text(image_path, message)
1In the heart of a vibrant amusement park, a magnificent and whimsical dragon sculpture emerges from the water, its scales shimmering in hues of red, green, and gold. The dragon's head, adorned with sharp teeth and piercing yellow eyes, rises above the surface, while its body coils gracefully beneath the waves. Surrounding the dragon are colorful LEGO-like structures, including a bridge with intricate blue and purple patterns and a tower that reaches towards the sky. The water, a striking shade of turquoise, is contained by a wooden fence, and beyond the fence, lush green trees provide a natural backdrop. The scene is set against a cloudy sky, adding a touch of drama to this fantastical display.

Recognizing text

The model can recognize and extract text from images, which is useful for reading signs, documents, or other text-based content in photographs. This capability enables applications that can answer questions about text content.

PYTHON
1image_path = "image5.jpg"
2render_image(image_path)
PYTHON
1message = "How many bread rolls do I get?"
2
3generate_text(image_path, message)
1You get 6 bread rolls in the pack.

Classification

Classification allows the model to categorize images into predefined classes or labels. This is useful for organizing visual content, filtering images, or extracting structured information from visual data.

PYTHON
1image_path1 = "image6.jpg"
2image_path2 = "image7.jpg"
3render_image(image_path1)
4render_image(image_path2)
PYTHON
1message = "Please classify this image as one of these dish types: japanese, malaysian, turkish, or other.Respond in the following format: dish_type: <the_dish_type>."
2
3images = [
4 image_path1, # turkish
5 image_path2, # japanese
6]
7
8for item in images:
9 generate_text(item, message)
10 print("-" * 30)
1dish_type: turkish
2------------------------------
3dish_type: japanese
4------------------------------

Comparing multiple images

This section demonstrates how to analyze and compare multiple images simultaneously. The API allows passing more than one image in a single call, enabling the model to perform comparative analysis between different visual inputs.

PYTHON
1image_path1 = "image6.jpg"
2image_path2 = "image7.jpg"
3render_image(image_path1)
4render_image(image_path2)
PYTHON
1message = "Compare these two dishes."
2
3with open(image_path1, "rb") as img_file1:
4 base64_image_url1 = f"data:image/jpeg;base64,{base64.b64encode(img_file1.read()).decode('utf-8')}"
5
6with open(image_path2, "rb") as img_file2:
7 base64_image_url2 = f"data:image/jpeg;base64,{base64.b64encode(img_file2.read()).decode('utf-8')}"
8
9response = co.chat(
10 model=model,
11 messages=[
12 {
13 "role": "user",
14 "content": [
15 {"type": "text", "text": message},
16 {"type": "image_url", "image_url": {"url": base64_image_url1}},
17 {"type": "image_url", "image_url": {"url":base64_image_url2}}
18 ],
19 }
20 ],
21)
22
23print(response.message.content[0].text)
1The first dish is a Japanese-style bento box containing a variety of items such as sushi rolls, tempura shrimp, grilled salmon, rice, and vegetables. It is served in a clear plastic container with individual compartments for each food item. The second dish is a Turkish-style meal featuring baklava, a sweet pastry made with layers of phyllo dough, nuts, and honey. It is accompanied by a small bowl of cream and a red flag with a gold emblem. The baklava is presented on a black plate, while the bento box is placed on a tray with a red and gold napkin. Both dishes offer a unique culinary experience, with the Japanese bento box providing a balanced meal with a mix of proteins, carbohydrates, and vegetables, and the Turkish baklava offering a rich, sweet dessert.

Conclusion

In this notebook, we’ve explored the capabilities of the Aya Vision model through various examples.

The Aya Vision model shows impressive capabilities in understanding visual content and providing detailed, contextual responses. This makes it suitable for a wide range of applications including content analysis, accessibility features, educational tools, and more.

The API’s flexibility in handling different types of queries and multiple images simultaneously makes it a powerful tool if you are looking to integrate advanced computer vision capabilities into your applications.

Built with