Introduction to Embeddings at Cohere
Embeddings are a way to represent the meaning of text as a list of numbers. Using a simple comparison function, we can then calculate a similarity score for two embeddings to figure out whether two texts are talking about similar things. Common use-cases for embeddings include semantic search, clustering, and classification.
In the example below we use the embed-english-v3.0
model to generate embeddings for 3 phrases and compare them using a similarity function. The two similar phrases have a high similarity score, and the embeddings for two unrelated phrases have a low similarity score:
The input_type
parameter
Cohere embeddings are optimized for different types of inputs.
- When using embeddings for semantic search, the search query should be embedded by setting
input_type="search_query"
- When using embeddings for semantic search, the text passages that are being searched over should be embedded with
input_type="search_document"
. - When using embedding for
classification
andclustering
tasks, you can setinput_type
to either ‘classification’ or ‘clustering’ to optimize the embeddings appropriately. - When
input_type='image'
, the expected input to be embedded is an image instead of text.
Multilingual Support
In addition to embed-english-v3.0
we offer a best-in-class multilingual model embed-multilingual-v3.0 with support for over 100 languages, including Chinese, Spanish, and French. This model can be used with the Embed API, just like its English counterpart:
Image Embeddings
The Cohere embedding platform supports image embeddings for the entire of embed-v3.0
family. This functionality can be utilized with the following steps:
- Pass
image
to theinput_type
parameter (as discussed above). - Pass your image URL to the new
images
parameter.
Be aware that image embedding has the following restrictions:
- If
input_type='image'
, thetexts
field must be empty. - The original image file type must be
png
orjpeg
. - The image must be base64 encoded and sent as a Data URL to the
images
parameter. - Our API currently does not support batch image embeddings.
Compression Levels
The Cohere embeddings platform supports compression. The Embed API features an embeddings_types
parameter which allows the user to specify various ways of compressing the output.
The following embedding types are supported:
float
int8
unint8
binary
ubinary
The parameter defaults to float
, so if you pass in no argument you’ll get back float
embeddings:
However we recommend being explicit about the embedding type(s)
. To specify an embedding types, pass one of the types from the list above in as list containing a string:
Finally, you can also pass several embedding types
in as a list, in which case the endpoint will return a dictionary with both types available: