Introduction to Embeddings at Cohere
![Turning text into embeddings.](https://files.readme.io/e8ea586-Embeddings_Visual_1_1.png)
Turning text into embeddings.
Embeddings are a way to represent the meaning of text as a list of numbers. Using a simple comparison function, we can then calculate a similarity score for two embeddings to figure out whether two texts are talking about similar things. Common use-cases for embeddings include semantic search, clustering, and classification.
In the example below we use the embed-english-v3.0
model to generate embeddings for 3 phrases and compare them using a similarity function. The two similar phrases have a high similarity score, and the embeddings for two unrelated phrases have a low similarity score:
import cohere
import numpy as np
co = cohere.Client(api_key="YOUR_API_KEY")
# get the embeddings
phrases = ["i love soup", "soup is my favorite", "london is far away"]
model="embed-english-v3.0"
input_type="search_query"
res = co.embed(texts=phrases,
model=model,
input_type=input_type,
embedding_types=['float'])
(soup1, soup2, london) = res.embeddings.float
# compare them
def calculate_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
calculate_similarity(soup1, soup2) # 0.85 - very similar!
calculate_similarity(soup1, london) # 0.16 - not similar!
The input_type
parameter
input_type
parameterCohere embeddings are optimized for different types of inputs. For example, when using embeddings for semantic search, the search query should be embedded by setting input_type="search_query"
whereas the text passages that are being searched over should be embedded with input_type="search_document"
. You can find more details and a code snippet in the Semantic Search guide. Similarly, the input type can be set to classification
(example) and clustering
to optimize the embeddings for those use cases.
Multilingual Support
In addition to embed-english-v3.0
we offer a best-in-class multilingual model embed-multilingual-v3.0 with support for over 100 languages, including Chinese, Spanish, and French. This model can be used with the Embed API, just like its English counterpart:
import cohere
co = cohere.Client(api_key="<YOUR API KEY>")
texts = [
'Hello from Cohere!', 'مرحبًا من كوهير!', 'Hallo von Cohere!',
'Bonjour de Cohere!', '¡Hola desde Cohere!', 'Olá do Cohere!',
'Ciao da Cohere!', '您好,来自 Cohere!', 'कोहेरे से नमस्ते!'
]
response = co.embed(
model='embed-multilingual-v3.0',
texts=texts,
input_type='classification',
embedding_types=['float'])
embeddings = response.embeddings.float # All text embeddings
print(embeddings[0][:5]) # Print embeddings for the first text
Compression Levels
The Cohere embeddings platform now supports compression. The Embed API features an embeddings_types
parameter which allows the user to specify various ways of compressing the output.
The following embedding types are now supported:
float
int8
unint8
binary
ubinary
The parameter defaults to float
, so if you pass in no argument you'll get back float
embeddings:
ret = co.embed(texts=phrases,
model=model,
input_type=input_type)
ret.embeddings # This contains the float embeddings
However we recommend being explicit about the embedding type(s)
. To specify an embedding type
, pass one of the types from the list above in as list containing a string:
ret = co.embed(texts=phrases,
model=model,
input_type=input_type,
embedding_types=['int8'])
ret.embeddings.int8 # This contains your int8 embeddings
ret.embeddings.float # This will be empty
ret.embeddings.uint8 # This will be empty
ret.embeddings.ubinary # This will be empty
ret.embeddings.binary # This will be empty
Finally, you can also pass several embedding types
in as a list, in which case the endpoint will return a dictionary with both types available:
ret = co.embed(texts=phrases,
model=model,
input_type=input_type,
embedding_types=['int8', 'float'])
ret.embeddings.int8 # This contains your int8 embeddings
ret.embeddings.float # This contains your float embeddings
ret.embeddings.uint8 # This will be empty
ret.embeddings.ubinary # This will be empty
ret.embeddings.binary # This will be empty
Updated 11 days ago