Introduction to Embeddings at Cohere

Embeddings are a way to represent the meaning of texts, images, or information as a list of numbers. Using a simple comparison function, we can then calculate a similarity score for two embeddings to figure out whether two pieces of information are about similar things. Common use-cases for embeddings include semantic search, clustering, and classification.

In the example below we use the embed-v4.0 model to generate embeddings for 3 phrases and compare them using a similarity function. The two similar phrases have a high similarity score, and the embeddings for two unrelated phrases have a low similarity score:

PYTHON

1 import cohere
2 import numpy as np
3 
4 co = cohere.ClientV2(api_key="YOUR_API_KEY")
5 
6 # get the embeddings
7 phrases = ["i love soup", "soup is my favorite", "london is far away"]
8 
9 model = "embed-v4.0"
10 input_type = "search_query"
11 
12 res = co.embed(
13     texts=phrases,
14     model=model,
15     input_type=input_type,
16     output_dimension=1024,
17     embedding_types=["float"],
18 )
19 
20 (soup1, soup2, london) = res.embeddings.float
21 
22 
23 # compare them
24 def calculate_similarity(a, b):
25     return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
26 
27 
28 print(
29     f"For the following sentences:\n1: {phrases[0]}\n2: {phrases[1]}\3 The similarity score is: {calculate_similarity(soup1, soup2):.2f}\n"
30 )
31 print(
32     f"For the following sentences:\n1: {phrases[0]}\n2: {phrases[2]}\3 The similarity score is: {calculate_similarity(soup1, london):.2f}"
33 )

The `input_type` parameter

Cohere embeddings are optimized for different types of inputs.

When using embeddings for semantic search, the search query should be embedded by setting input_type="search_query"
When using embeddings for semantic search, the text passages that are being searched over should be embedded with input_type="search_document".
When using embedding for classification and clustering tasks, you can set input_type to either ‘classification’ or ‘clustering’ to optimize the embeddings appropriately.
When input_type='image' for embed-v3.0, the expected input to be embedded is an image instead of text. If you use input_type=images with embed-v4.0 it will default to search_document. We recommend using search_document when working with embed-v4.0.

Multilingual Support

embed-v4.0 is a best-in-class best-in-class multilingual model with support for over 100 languages, including Korean, Japanese, Arabic, Chinese, Spanish, and French.

PYTHON

1 import cohere
2 
3 co = cohere.ClientV2(api_key="YOUR_API_KEY")
4 
5 texts = [
6     "Hello from Cohere!",
7     "مرحبًا من كوهير!",
8     "Hallo von Cohere!",
9     "Bonjour de Cohere!",
10     "¡Hola desde Cohere!",
11     "Olá do Cohere!",
12     "Ciao da Cohere!",
13     "您好，来自 Cohere！",
14     "कोहेरे से नमस्ते!",
15 ]
16 
17 response = co.embed(
18     model="embed-v4.0",
19     texts=texts,
20     input_type="classification",
21     output_dimension=1024,
22     embedding_types=["float"],
23 )
24 
25 embeddings = response.embeddings.float  # All text embeddings
26 print(embeddings[0][:5])  # Print embeddings for the first text

Image Embeddings

The Cohere Embedding platform supports image embeddings for embed-v4.0 and the embed-v3.0 family. There are two ways to access this functionality:

Pass image to the input_type parameter. Here are the steps:
- Pass image to the input_type parameter
- Pass your image URL to the images parameter
Pass your image URL to the new images parameter. Here are the steps:
- Pass in a input list of dicts with the key content
- content contains a list of dicts with the keys type and image

When using the images parameter the following restrictions exist:

Pass image to the input_type parameter (as discussed above).
Pass your image URL to the new images parameter.

Be aware that image embedding has the following restrictions:

If input_type='image', the texts field must be empty.
The original image file type must be in a png, jpeg, webp, or gif format and can be up to 5 MB in size.
The image must be base64 encoded and sent as a Data URL to the images parameter.
Our API currently does not support batch image embeddings for embed-v3.0 models. For embed-v4.0, however, you can submit up to 96 images.

When using the inputs parameter the following restrictions exist (note these restrictions apply to embed-v4.0):

The maximum size of payload is 20mb
All images larger than 2,458,624 pixels will be downsampled to 2,458,624 pixels
All images smaller than 3,136 (56x56) pixels will be upsampled to 3,136 pixels
input_type must be set to one of the following
- search_query
- search_document
- classification
- clustering

Here’s a code sample using the inputs parameter:

PYTHON

1 import cohere
2 from PIL import Image
3 from io import BytesIO
4 import base64
5 
6 co = cohere.ClientV2(api_key="YOUR_API_KEY")
7 
8 # The model accepts input in base64 as a Data URL
9 
10 
11 def image_to_base64_data_url(image_path):
12     # Open the image file
13     with Image.open(image_path) as img:
14         image_format = img.format.lower()
15         buffered = BytesIO()
16         img.save(buffered, format=img.format)
17         # Encode the image data in base64
18         img_base64 = base64.b64encode(buffered.getvalue()).decode(
19             "utf-8"
20         )
21 
22     # Create the Data URL with the inferred image type
23     data_url = f"data:image/{image_format};base64,{img_base64}"
24     return data_url
25 
26 
27 base64_url = image_to_base64_data_url("<PATH_TO_IMAGE>")
28 
29 input = {
30     "content": [
31         {"type": "image_url", "image_url": {"url": base64_url}}
32     ]
33 }
34 
35 res = co.embed(
36     model="embed-v4.0",
37     embedding_types=["float"],
38     input_type="search_document",
39     inputs=[input],
40     output_dimension=1024,
41 )
42 
43 res.embeddings.float

Here’s a code sample using the images parameter:

PYTHON

1 import cohere
2 from PIL import Image
3 from io import BytesIO
4 import base64
5 
6 co = cohere.ClientV2(api_key="YOUR_API_KEY")
7 
8 # The model accepts input in base64 as a Data URL
9 
10 
11 def image_to_base64_data_url(image_path):
12     # Open the image file
13     with Image.open(image_path) as img:
14         # Create a BytesIO object to hold the image data in memory
15         buffered = BytesIO()
16         # Save the image as PNG to the BytesIO object
17         img.save(buffered, format="PNG")
18         # Encode the image data in base64
19         img_base64 = base64.b64encode(buffered.getvalue()).decode(
20             "utf-8"
21         )
22 
23     # Create the Data URL and assumes the original image file type was png
24     data_url = f"data:image/png;base64,{img_base64}"
25     return data_url
26 
27 
28 processed_image = image_to_base64_data_url("<PATH_TO_IMAGE>")
29 
30 res = co.embed(
31     images=[processed_image],
32     model="embed-v4.0",
33     embedding_types=["float"],
34     input_type="image",
35 )
36 
37 res.embeddings.float

Matryoshka Embeddings

Matryoshka learning creates embeddings with coarse-to-fine representation within a single vector; embed-v4.0 supports multiple output dimensions in the following values: [256,512,1024,1536]. To access this, you specify the parameter output_dimension when creating the embeddings.

PYTHON

1 texts = ["hello"]
2 
3 response = co.embed(
4     model="embed-v4.0",
5     texts=texts,
6     output_dimension=1024,
7     input_type="classification",
8     embedding_types=["float"],
9 ).embeddings
10 
11 # print out the embeddings
12 response.float  # returns a vector that is 1024 dimensions

Support for Fused and Mixed Modalities

embed-v4.0 supports text and content-rich images such as figures, slide decks, document screen shots (i.e. screenshots of PDF pages). This eliminates the need for complex text extraction or ETL pipelines. Unlike our previous embed-v3.0 model family, embed-v4.0 is capable of processing both images and texts together; the inputs can either be an image that contains both text and visual content, or text and images that youd like to compress into a single vector representation.

Here’s a code sample illustrating how embed-v4.0 could be used to work with fused images and texts like the following:

Fused image and texts

PYTHON

1 import cohere
2 import base64
3 
4 # Embed an Images and Texts separately
5 with open("./content/finn.jpeg", "rb") as image_file:
6     encoded_string = base64.b64encode(image_file.read()).decode(
7         "utf-8"
8     )
9 
10 # Step 3: Format as data URL
11 data_url = f"data:image/jpeg;base64,{encoded_string}"
12 
13 example_doc = [
14     {"type": "text", "text": "This is a Scottish Fold Cat"},
15     {"type": "image_url", "image_url": {"url": data_url}},
16 ]  # This is where we're fusing text and images.
17 
18 res = co.embed(
19     model="embed-v4.0",
20     inputs=[{"content": example_doc}],
21     input_type="search_document",
22     embedding_types=["float"],
23     output_dimension=1024,
24 ).embeddings.float_
25 
26 # This will return a list of length 1 with the texts and image in a combined embedding
27 
28 res

Compression Levels

The Cohere embeddings platform supports compression. The Embed API features an embeddings_types parameter which allows the user to specify various ways of compressing the output.

The following embedding types are supported:

float
int8
unint8
binary
ubinary

We recommend being explicit about the embedding type(s). To specify an embedding types, pass one of the types from the list above in as list containing a string:

PYTHON

1 res = co.embed(
2     texts=["hello_world"],
3     model="embed-v4.0",
4     input_type="search_document",
5     embedding_types=["int8"],
6 )

You can specify multiple embedding types in a single call. For example, the following call will return both int8 and float embeddings:

PYTHON

1 res = co.embed(
2     texts=phrases,
3     model="embed-v4.0",
4     input_type=input_type,
5     embedding_types=["int8", "float"],
6 )
7 
8 res.embeddings.int8  # This contains your int8 embeddings
9 res.embeddings.float  # This contains your float embeddings

A Note on Bits and Bytes

When doing binary compression, there’s a subtlety worth pointing out: because Cohere packages bits as bytes under the hood, the actual length of the vector changes. This means that if you have a vector of 1024 binary embeddings, it will become 1024/8 => 128 bytes, and this might be confusing if you run len(embeddings). This code shows how to unpack it so it works if you’re using a vector database that does not take bytes for binary:

PYTHON

1 res = co.embed(
2     model="embed-v4.0",
3     texts=["hello"],
4     input_type="search_document",
5     embedding_types=["ubinary"],
6     output_dimension=1024,
7 )
8 print(
9     f"Embed v4 Binary at 1024 dimensions results in length {len(res.embeddings.ubinary[0])}"
10 )
11 
12 query_emb_bin = np.asarray(res.embeddings.ubinary[0], dtype="uint8")
13 query_emb_unpacked = np.unpackbits(query_emb_bin, axis=-1).astype(
14     "int"
15 )
16 query_emb_unpacked = 2 * query_emb_unpacked - 1
17 print(
18     f"Embed v4 Binary at 1024 unpacked will have dimensions:{len(query_emb_unpacked)}"
19 )

The input_type parameter

Multilingual Support

Image Embeddings

Matryoshka Embeddings

Support for Fused and Mixed Modalities

Compression Levels

A Note on Bits and Bytes

The `input_type` parameter