Introduction to Embeddings at Cohere

embeddings.

Embeddings are a way to represent the meaning of texts, images, or information as a list of numbers. Using a simple comparison function, we can then calculate a similarity score for two embeddings to figure out whether two pieces of information are about similar things. Common use-cases for embeddings include semantic search, clustering, and classification.

In the example below we use the embed-v4.0 model to generate embeddings for 3 phrases and compare them using a similarity function. The two similar phrases have a high similarity score, and the embeddings for two unrelated phrases have a low similarity score:

PYTHON
1import cohere
2import numpy as np
3
4co = cohere.ClientV2(api_key="YOUR_API_KEY")
5
6# get the embeddings
7phrases = ["i love soup", "soup is my favorite", "london is far away"]
8
9model = "embed-v4.0"
10input_type = "search_query"
11
12res = co.embed(
13 texts=phrases,
14 model=model,
15 input_type=input_type,
16 output_dimension=1024,
17 embedding_types=["float"],
18)
19
20(soup1, soup2, london) = res.embeddings.float
21
22
23# compare them
24def calculate_similarity(a, b):
25 return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
26
27
28print(
29 f"For the following sentences:\n1: {phrases[0]}\n2: {phrases[1]}\3 The similarity score is: {calculate_similarity(soup1, soup2):.2f}\n"
30)
31print(
32 f"For the following sentences:\n1: {phrases[0]}\n2: {phrases[2]}\3 The similarity score is: {calculate_similarity(soup1, london):.2f}"
33)

The input_type parameter

Cohere embeddings are optimized for different types of inputs.

  • When using embeddings for semantic search, the search query should be embedded by setting input_type="search_query"
  • When using embeddings for semantic search, the text passages that are being searched over should be embedded with input_type="search_document".
  • When using embedding for classification and clustering tasks, you can set input_type to either ‘classification’ or ‘clustering’ to optimize the embeddings appropriately.
  • When input_type='image' for embed-v3.0, the expected input to be embedded is an image instead of text. If you use input_type=images with embed-v4.0 it will default to search_document. We recommend using search_document when working with embed-v4.0.

Multilingual Support

embed-v4.0 is a best-in-class best-in-class multilingual model with support for over 100 languages, including Korean, Japanese, Arabic, Chinese, Spanish, and French.

PYTHON
1import cohere
2
3co = cohere.ClientV2(api_key="YOUR_API_KEY")
4
5texts = [
6 "Hello from Cohere!",
7 "مرحبًا من كوهير!",
8 "Hallo von Cohere!",
9 "Bonjour de Cohere!",
10 "¡Hola desde Cohere!",
11 "Olá do Cohere!",
12 "Ciao da Cohere!",
13 "您好,来自 Cohere!",
14 "कोहेरे से नमस्ते!",
15]
16
17response = co.embed(
18 model="embed-v4.0",
19 texts=texts,
20 input_type="classification",
21 output_dimension=1024,
22 embedding_types=["float"],
23)
24
25embeddings = response.embeddings.float # All text embeddings
26print(embeddings[0][:5]) # Print embeddings for the first text

Image Embeddings

The Cohere Embedding platform supports image embeddings for embed-v4.0 and the embed-v3.0 family. There are two ways to access this functionality:

  • Pass image to the input_type parameter. Here are the steps:
    • Pass image to the input_type parameter
    • Pass your image URL to the images parameter
  • Pass your image URL to the new images parameter. Here are the steps:
    • Pass in a input list of dicts with the key content
    • content contains a list of dicts with the keys type and image

When using the images parameter the following restrictions exist:

  • Pass image to the input_type parameter (as discussed above).
  • Pass your image URL to the new images parameter.

Be aware that image embedding has the following restrictions:

  • If input_type='image', the texts field must be empty.
  • The original image file type must be in a pngjpegwebp, or gif format and can be up to 5 MB in size.
  • The image must be base64 encoded and sent as a Data URL to the images parameter.
  • Our API currently does not support batch image embeddings for embed-v3.0 models. For embed-v4.0, however, you can submit up to 96 images.

When using the inputs parameter the following restrictions exist (note these restrictions apply to embed-v4.0):

  • The maximum size of payload is 20mb
  • All images larger than 2,4578,624 pixels will be downsampled to 2,4578,624 pixels
  • All images smaller than 200,604 pixels will be upsampled to 200,604 pixels
  • input_type must be set to one of the following
    • search_query
    • search_document
    • classification
    • clustering

Here’s a code sample using the inputs parameter:

PYTHON
1import cohere
2from PIL import Image
3from io import BytesIO
4import base64
5
6co = cohere.ClientV2(api_key="YOUR_API_KEY")
7
8# The model accepts input in base64 as a Data URL
9
10
11def image_to_base64_data_url(image_path):
12 # Open the image file
13 with Image.open(image_path) as img:
14 image_format = img.format.lower()
15 buffered = BytesIO()
16 img.save(buffered, format=img.format)
17 # Encode the image data in base64
18 img_base64 = base64.b64encode(buffered.getvalue()).decode(
19 "utf-8"
20 )
21
22 # Create the Data URL with the inferred image type
23 data_url = f"data:image/{image_format};base64,{img_base64}"
24 return data_url
25
26
27base64_url = image_to_base64_data_url("<PATH_TO_IMAGE>")
28
29input = {
30 "content": [
31 {"type": "image_url", "image_url": {"url": base64_url}}
32 ]
33}
34
35res = co.embed(
36 model="embed-v4.0",
37 embedding_types=["float"],
38 input_type="search_document",
39 inputs=[input],
40 output_dimension=1024,
41)
42
43res.embeddings.float

Here’s a code sample using the images parameter:

PYTHON
1import cohere
2from PIL import Image
3from io import BytesIO
4import base64
5
6co = cohere.ClientV2(api_key="YOUR_API_KEY")
7
8# The model accepts input in base64 as a Data URL
9
10
11def image_to_base64_data_url(image_path):
12 # Open the image file
13 with Image.open(image_path) as img:
14 # Create a BytesIO object to hold the image data in memory
15 buffered = BytesIO()
16 # Save the image as PNG to the BytesIO object
17 img.save(buffered, format="PNG")
18 # Encode the image data in base64
19 img_base64 = base64.b64encode(buffered.getvalue()).decode(
20 "utf-8"
21 )
22
23 # Create the Data URL and assumes the original image file type was png
24 data_url = f"data:image/png;base64,{img_base64}"
25 return data_url
26
27
28processed_image = image_to_base64_data_url("<PATH_TO_IMAGE>")
29
30res = co.embed(
31 images=[processed_image],
32 model="embed-v4.0",
33 embedding_types=["float"],
34 input_type="image",
35)
36
37res.embeddings.float

Matryoshka Embeddings

Matryoshka learning creates embeddings with coarse-to-fine representation within a single vector; embed-v4.0 supports multiple output dimensions in the following values: [256,512,1024,1536]. To access this, you specify the parameter output_dimension when creating the embeddings.

PYTHON
1texts = ["hello"]
2
3response = co.embed(
4 model="embed-v4.0",
5 texts=texts,
6 output_dimension=1024,
7 input_type="classification",
8 embedding_types=["float"],
9).embeddings
10
11# print out the embeddings
12response.float # returns a vector that is 1024 dimensions

Support for Fused and Mixed Modalities

embed-v4.0 supports text and content-rich images such as figures, slide decks, document screen shots (i.e. screenshots of PDF pages). This eliminates the need for complex text extraction or ETL pipelines. Unlike our previous embed-v3.0 model family, embed-v4.0 is capable of processing both images and texts together; the inputs can either be an image that contains both text and visual content, or text and images that youd like to compress into a single vector representation.

Here’s a code sample illustrating how embed-v4.0 could be used to work with fused images and texts like the following:

Fused image and texts

PYTHON
1import cohere
2import base64
3
4# Embed an Images and Texts separately
5with open("./content/finn.jpeg", "rb") as image_file:
6 encoded_string = base64.b64encode(image_file.read()).decode(
7 "utf-8"
8 )
9
10# Step 3: Format as data URL
11data_url = f"data:image/jpeg;base64,{encoded_string}"
12
13example_doc = [
14 {"type": "text", "text": "This is a Scottish Fold Cat"},
15 {"type": "image_url", "image_url": {"url": data_url}},
16] # This is where we're fusing text and images.
17
18res = co.embed(
19 model="embed-v4.0",
20 inputs=[{"content": example_doc}],
21 input_type="search_document",
22 embedding_types=["float"],
23 output_dimension=1024,
24).embeddings.float_
25
26# This will return a list of length 1 with the texts and image in a combined embedding
27
28res

Compression Levels

The Cohere embeddings platform supports compression. The Embed API features an embeddings_types parameter which allows the user to specify various ways of compressing the output.

The following embedding types are supported:

  • float
  • int8
  • unint8
  • binary
  • ubinary

We recommend being explicit about the embedding type(s). To specify an embedding types, pass one of the types from the list above in as list containing a string:

PYTHON
1res = co.embed(
2 texts=["hello_world"],
3 model="embed-v4.0",
4 input_type="search_document",
5 embedding_types=["int8"],
6)

You can specify multiple embedding types in a single call. For example, the following call will return both int8 and float embeddings:

PYTHON
1res = co.embed(
2 texts=phrases,
3 model="embed-v4.0",
4 input_type=input_type,
5 embedding_types=["int8", "float"],
6)
7
8res.embeddings.int8 # This contains your int8 embeddings
9res.embeddings.float # This contains your float embeddings

A Note on Bits and Bytes

When doing binary compression, there’s a subtlety worth pointing out: because Cohere packages bits as bytes under the hood, the actual length of the vector changes. This means that if you have a vector of 1024 binary embeddings, it will become 1024/8 => 128 bytes, and this might be confusing if you run len(embeddings). This code shows how to unpack it so it works if you’re using a vector database that does not take bytes for binary:

PYTHON
1res = co.embed(
2 model="embed-v4.0",
3 texts=["hello"],
4 input_type="search_document",
5 embedding_types=["ubinary"],
6 output_dimension=1024,
7)
8print(
9 f"Embed v4 Binary at 1024 dimensions results in length {len(res.embeddings.ubinary[0])}"
10)
11
12query_emb_bin = np.asarray(res.embeddings.ubinary[0], dtype="uint8")
13query_emb_unpacked = np.unpackbits(query_emb_bin, axis=-1).astype(
14 "int"
15)
16query_emb_unpacked = 2 * query_emb_unpacked - 1
17print(
18 f"Embed v4 Binary at 1024 unpacked will have dimensions:{len(query_emb_unpacked)}"
19)
Built with