Semantic Search with Embeddings

This section provides examples on how to use the Embed endpoint to perform semantic search.

Semantic search solves the problem faced by the more traditional approach of lexical search, which is great at finding keyword matches, but struggles to capture the context or meaning of a piece of text.

PYTHON
1import cohere
2import numpy as np
3co = cohere.ClientV2(api_key="YOUR_API_KEY") # Get your free API key: https://dashboard.cohere.com/api-keys

The Embed endpoint takes in texts as input and returns embeddings as output.

For semantic search, there are two types of documents we need to turn into embeddings.

  • The list of documents to search from.
  • The query that will be used to search the documents.

Step 1: Embed the documents

We call the Embed endpoint using co.embed() and pass the required arguments:

  • texts: The list of texts
  • model: Here we choose embed-english-v3.0, which generates embeddings of size 1024
  • input_type: We choose search_document to ensure the model treats these as the documents for search
  • embedding_types: We choose float to get a float array as the output

Step 2: Embed the query

Next, we add and embed a query. We choose search_query as the input_type to ensure the model treats this as the query (instead of documents) for search.

Step 3: Return the most similar documents

Next, we calculate and sort similarity scores between a query and document embeddings, then display the top N most similar documents. Here, we are using the numpy library for calculating similarity using a dot product approach.

PYTHON
1### STEP 1: Embed the documents
2
3# Define the documents
4documents = [
5 "Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.",
6 "Finding Coffee Spots: For your caffeine fix, head to the break room's coffee machine or cross the street to the café for artisan coffee.",
7 "Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!",
8 "Working Hours Flexibility: We prioritize work-life balance. While our core hours are 9 AM to 5 PM, we offer flexibility to adjust as needed.",
9]
10
11# Embed the documents
12doc_emb = co.embed(
13 texts=documents,
14 model="embed-english-v3.0",
15 input_type="search_document",
16 embedding_types=["float"]
17).embeddings.float
18
19### STEP 2: Embed the query
20
21# Add the user query
22query = "How to connect with my teammates?"
23
24# Embed the query
25query_emb = co.embed(
26 texts=[query],
27 model="embed-english-v3.0",
28 input_type="search_query",
29 embedding_types=["float"]
30).embeddings.float
31
32### STEP 3: Return the most similar documents
33
34# Calculate similarity scores
35scores = np.dot(query_emb, np.transpose(doc_emb))[0]
36
37# Sort and filter documents based on scores
38top_n = 2
39top_doc_idxs = np.argsort(-scores)[:top_n]
40
41# Display search results
42for idx, docs_idx in enumerate(top_doc_idxs):
43 print(f"Rank: {idx+1}")
44 print(f"Document: {documents[docs_idx]}\n")
Rank: 1
Document: Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!
Rank: 2
Document: Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.

Content quality measure with Embed v3

A standard text embeddings model is optimized for only topic similarity between a query and candidate documents. But in many real-world applications, you have redundant information with varying content quality.

For instance, consider a user query of “COVID-19 Symptoms” and compare that to candidate document, “COVID-19 has many symptoms”. This document does not offer high-quality and rich information. However, with a typical embedding model, it will appear high on search results because it is highly similar to the query.

The Embed v3 model is trained to capture both content quality and topic similarity. Through this approach, a search system can extract richer information from documents and is robust against noise.

As an example below, give a query (“COVID-19 Symptoms”), the document with the highest quality (“COVID-19 symptoms can include: a high temperature or shivering…”) is ranked first.

Another document (“COVID-19 has many symptoms”) is arguably more similar to the query based on what information it contains, yet it is ranked lower as it doesn’t contain that much information.

This demonstrates how Embed v3 helps to surface high-quality documents for a given query.

PYTHON
1### STEP 1: Embed the documents
2
3documents = [
4 "COVID-19 has many symptoms.",
5 "COVID-19 symptoms are bad.",
6 "COVID-19 symptoms are not nice",
7 "COVID-19 symptoms are bad. 5G capabilities include more expansive service coverage, a higher number of available connections, and lower power consumption.",
8 "COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others.",
9 "COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more",
10 "Dementia has the following symptom: Experiencing memory loss, poor judgment, and confusion.",
11 "COVID-19 has the following symptom: Experiencing memory loss, poor judgment, and confusion.",
12]
13
14# Embed the documents
15doc_emb = co.embed(
16 texts=documents,
17 model="embed-english-v3.0",
18 input_type="search_document",
19 embedding_types=["float"]
20).embeddings.float
21
22### STEP 2: Embed the query
23
24# Add the user query
25query = "COVID-19 Symptoms"
26
27# Embed the query
28query_emb = co.embed(
29 texts=[query],
30 model="embed-english-v3.0",
31 input_type="search_query",
32 embedding_types=["float"]
33).embeddings.float
34
35### STEP 3: Return the most similar documents
36
37# Calculate similarity scores
38scores = np.dot(query_emb, np.transpose(doc_emb))[0]
39
40# Sort and filter documents based on scores
41top_n = 5
42top_doc_idxs = np.argsort(-scores)[:top_n]
43
44# Display search results
45for idx, docs_idx in enumerate(top_doc_idxs):
46 print(f"Rank: {idx+1}")
47 print(f"Document: {documents[docs_idx]}\n")
Rank: 1
Document: COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more
Rank: 2
Document: COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others.
Rank: 3
Document: COVID-19 has the following symptom: Experiencing memory loss, poor judgment, and confusion.
Rank: 4
Document: COVID-19 has many symptoms.
Rank: 5
Document: COVID-19 symptoms are not nice

The Embed endpoint also supports multilingual semantic search via the embed-multilingual-... models. This means you can perform semantic search on texts in different languages.

Specifically, you can do both multilingual and cross-lingual searches using one single model.

PYTHON
1### STEP 1: Embed the documents
2
3documents = [
4 "Remboursement des frais de voyage : Gérez facilement vos frais de voyage en les soumettant via notre outil financier. Les approbations sont rapides et simples.",
5 "Travailler de l'étranger : Il est possible de travailler à distance depuis un autre pays. Il suffit de coordonner avec votre responsable et de vous assurer d'être disponible pendant les heures de travail.",
6 "Avantages pour la santé et le bien-être : Nous nous soucions de votre bien-être et proposons des adhésions à des salles de sport, des cours de yoga sur site et une assurance santé complète.",
7 "Fréquence des évaluations de performance : Nous organisons des bilans informels tous les trimestres et des évaluations formelles deux fois par an.",
8]
9
10# Embed the documents
11doc_emb = co.embed(
12 texts=documents,
13 model="embed-english-v3.0",
14 input_type="search_document",
15 embedding_types=["float"]
16).embeddings.float
17
18### STEP 2: Embed the query
19
20# Add the user query
21query = "What's your remote-working policy?"
22
23# Embed the query
24query_emb = co.embed(
25 texts=[query],
26 model="embed-english-v3.0",
27 input_type="search_query",
28 embedding_types=["float"]
29).embeddings.float
30
31### STEP 3: Return the most similar documents
32
33# Calculate similarity scores
34scores = np.dot(query_emb, np.transpose(doc_emb))[0]
35
36# Sort and filter documents based on scores
37top_n = 1
38top_doc_idxs = np.argsort(-scores)[:top_n]
39
40# Display search results
41for idx, docs_idx in enumerate(top_doc_idxs):
42 print(f"Rank: {idx+1}")
43 print(f"Document: {documents[docs_idx]}\n")
Rank: 1
Document: Travailler de l'étranger : Il est possible de travailler à distance depuis un autre pays. Il suffit de coordonner avec votre responsable et de vous assurer d'être disponible pendant les heures de travail.