Introduction to Embeddings at Cohere

embeddings.

Embeddings are a way to represent the meaning of text as a list of numbers. Using a simple comparison function, we can then calculate a similarity score for two embeddings to figure out whether two texts are talking about similar things. Common use-cases for embeddings include semantic search, clustering, and classification.

In the example below we use the embed-english-v3.0 model to generate embeddings for 3 phrases and compare them using a similarity function. The two similar phrases have a high similarity score, and the embeddings for two unrelated phrases have a low similarity score:

PYTHON
1import cohere
2import numpy as np
3
4co = cohere.Client(api_key="YOUR_API_KEY")
5
6# get the embeddings
7phrases = ["i love soup", "soup is my favorite", "london is far away"]
8
9model = "embed-english-v3.0"
10input_type = "search_query"
11
12res = co.embed(
13 texts=phrases,
14 model=model,
15 input_type=input_type,
16 embedding_types=["float"],
17)
18
19(soup1, soup2, london) = res.embeddings.float
20
21
22# compare them
23def calculate_similarity(a, b):
24 return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
25
26
27calculate_similarity(soup1, soup2) # 0.85 - very similar!
28calculate_similarity(soup1, london) # 0.16 - not similar!

The input_type parameter

Cohere embeddings are optimized for different types of inputs.

  • When using embeddings for semantic search, the search query should be embedded by setting input_type="search_query"
  • When using embeddings for semantic search, the text passages that are being searched over should be embedded with input_type="search_document".
  • When using embedding for classification and clustering tasks, you can set input_type to either ‘classification’ or ‘clustering’ to optimize the embeddings appropriately.
  • When input_type='image', the expected input to be embedded is an image instead of text.

Multilingual Support

In addition to embed-english-v3.0 we offer a best-in-class multilingual model embed-multilingual-v3.0 with support for over 100 languages, including Chinese, Spanish, and French. This model can be used with the Embed API, just like its English counterpart:

PYTHON
1import cohere
2
3co = cohere.Client(api_key="<YOUR API KEY>")
4
5texts = [
6 "Hello from Cohere!",
7 "مرحبًا من كوهير!",
8 "Hallo von Cohere!",
9 "Bonjour de Cohere!",
10 "¡Hola desde Cohere!",
11 "Olá do Cohere!",
12 "Ciao da Cohere!",
13 "您好,来自 Cohere!",
14 "कोहेरे से नमस्ते!",
15]
16
17response = co.embed(
18 model="embed-multilingual-v3.0",
19 texts=texts,
20 input_type="classification",
21 embedding_types=["float"],
22)
23
24embeddings = response.embeddings.float # All text embeddings
25print(embeddings[0][:5]) # Print embeddings for the first text

Image Embeddings

The Cohere embedding platform supports image embeddings for the entire of embed-v3.0 family. This functionality can be utilized with the following steps:

  • Pass image to the input_type parameter (as discussed above).
  • Pass your image URL to the new images parameter.

Be aware that image embedding has the following restrictions:

  • If input_type='image', the texts field must be empty.
  • The original image file type must be png or jpeg.
  • The image must be base64 encoded and sent as a Data URL to the images parameter.
  • Our API currently does not support batch image embeddings.
PYTHON
1import cohere
2from PIL import Image
3from io import BytesIO
4import base64
5
6co = cohere.Client(api_key="<YOUR API KEY>")
7
8# The model accepts input in base64 as a Data URL
9
10
11def image_to_base64_data_url(image_path):
12 # Open the image file
13 with Image.open(image_path) as img:
14 # Create a BytesIO object to hold the image data in memory
15 buffered = BytesIO()
16 # Save the image as PNG to the BytesIO object
17 img.save(buffered, format="PNG")
18 # Encode the image data in base64
19 img_base64 = base64.b64encode(buffered.getvalue()).decode(
20 "utf-8"
21 )
22
23 # Create the Data URL and assumes the original image file type was png
24 data_url = f"data:image/png;base64,{img_base64}"
25 return data_url
26
27
28processed_image = image_to_base64_data_url("<PATH_TO_IMAGE>")
29
30ret = co.embed(
31 images=[processed_image],
32 model="embed-english-v3.0",
33 embedding_types=["float"],
34 input_type="image",
35)
36
37ret.embeddings.float

Compression Levels

The Cohere embeddings platform supports compression. The Embed API features an embeddings_types parameter which allows the user to specify various ways of compressing the output.

The following embedding types are supported:

  • float
  • int8
  • unint8
  • binary
  • ubinary

The parameter defaults to float, so if you pass in no argument you’ll get back float embeddings:

PYTHON
1ret = co.embed(texts=phrases, model=model, input_type=input_type)
2
3ret.embeddings # This contains the float embeddings

However we recommend being explicit about the embedding type(s). To specify an embedding types, pass one of the types from the list above in as list containing a string:

PYTHON
1ret = co.embed(
2 texts=phrases,
3 model=model,
4 input_type=input_type,
5 embedding_types=["int8"],
6)
7
8ret.embeddings.int8 # This contains your int8 embeddings
9ret.embeddings.float # This will be empty
10ret.embeddings.uint8 # This will be empty
11ret.embeddings.ubinary # This will be empty
12ret.embeddings.binary # This will be empty

Finally, you can also pass several embedding types in as a list, in which case the endpoint will return a dictionary with both types available:

PYTHON
1ret = co.embed(
2 texts=phrases,
3 model=model,
4 input_type=input_type,
5 embedding_types=["int8", "float"],
6)
7
8ret.embeddings.int8 # This contains your int8 embeddings
9ret.embeddings.float # This contains your float embeddings
10ret.embeddings.uint8 # This will be empty
11ret.embeddings.ubinary # This will be empty
12ret.embeddings.binary # This will be empty