Semantic Search with Embeddings

This section provides examples on how to use the Embed endpoint to perform semantic search.

Semantic search solves the problem faced by the more traditional approach of lexical search, which is great at finding keyword matches, but struggles to capture the context or meaning of a piece of text.

PYTHON

1 import cohere
2 import numpy as np
3 
4 co = cohere.Client(
5     api_key="YOUR_API_KEY"
6 )  # Get your free API key: https://dashboard.cohere.com/api-keys

The Embed endpoint takes in texts as input and returns embeddings as output.

For semantic search, there are two types of documents we need to turn into embeddings.

The list of documents to search from.
The query that will be used to search the documents.

Step 1: Embed the documents

We call the Embed endpoint using co.embed() and pass the required arguments:

texts: The list of texts
model: Here we choose embed-v4.0
input_type: We choose search_document to ensure the model treats these as the documents for search
embedding_types: We choose float to get a float array as the output

Step 2: Embed the query

Next, we add and embed a query. We choose search_query as the input_type to ensure the model treats this as the query (instead of documents) for search.

Step 3: Return the most similar documents

Next, we calculate and sort similarity scores between a query and document embeddings, then display the top N most similar documents. Here, we are using the numpy library for calculating similarity using a dot product approach.

PYTHON

1 ### STEP 1: Embed the documents
2 
3 # List of documents
4 documents = [
5     "Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.",
6     "Finding Coffee Spots: For your caffeine fix, head to the break room's coffee machine or cross the street to the café for artisan coffee.",
7     "Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!",
8     "Working Hours Flexibility: We prioritize work-life balance. While our core hours are 9 AM to 5 PM, we offer flexibility to adjust as needed.",
9 ]
10 
11 # Constructing the embed_input object
12 embed_input = [
13     {"content": [{"type": "text", "text": doc}]} for doc in documents
14 ]
15 
16 # Embed the documents
17 doc_emb = co.embed(
18     inputs=embed_input,
19     model="embed-v4.0",
20     output_dimension=1024,
21     input_type="search_document",
22     embedding_types=["float"],
23 ).embeddings.float
24 
25 ### STEP 2: Embed the query
26 
27 # Add the user query
28 query = "How to connect with my teammates?"
29 
30 query_input = [{"content": [{"type": "text", "text": query}]}]
31 
32 # Embed the query
33 query_emb = co.embed(
34     inputs=query_input,
35     model="embed-v4.0",
36     input_type="search_query",
37     output_dimension=1024,
38     embedding_types=["float"],
39 ).embeddings.float
40 
41 ### STEP 3: Return the most similar documents
42 
43 # Calculate similarity scores
44 scores = np.dot(query_emb, np.transpose(doc_emb))[0]
45 
46 # Sort and filter documents based on scores
47 top_n = 2
48 top_doc_idxs = np.argsort(-scores)[:top_n]
49 
50 # Display search results
51 for idx, docs_idx in enumerate(top_doc_idxs):
52     print(f"Rank: {idx+1}")
53     print(f"Document: {documents[docs_idx]}\n")

Here’s an example output:

Rank: 1
Document: Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!
Rank: 2
Document: Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.

Content quality measure with Embed v4

A standard text embeddings model is optimized for only topic similarity between a query and candidate documents. But in many real-world applications, you have redundant information with varying content quality.

For instance, consider a user query of “COVID-19 Symptoms” and compare that to candidate document, “COVID-19 has many symptoms”. This document does not offer high-quality and rich information. However, with a typical embedding model, it will appear high on search results because it is highly similar to the query.

The Embed v4 model is trained to capture both content quality and topic similarity. Through this approach, a search system can extract richer information from documents and is robust against noise.

As an example below, give a query (“COVID-19 Symptoms”), the document with the highest quality (“COVID-19 symptoms can include: a high temperature or shivering…”) is ranked first.

Another document (“COVID-19 has many symptoms”) is arguably more similar to the query based on what information it contains, yet it is ranked lower as it doesn’t contain that much information.

This demonstrates how Embed v4 helps to surface high-quality documents for a given query.

PYTHON

1 ### STEP 1: Embed the documents
2 
3 documents = [
4     "COVID-19 has many symptoms.",
5     "COVID-19 symptoms are bad.",
6     "COVID-19 symptoms are not nice",
7     "COVID-19 symptoms are bad. 5G capabilities include more expansive service coverage, a higher number of available connections, and lower power consumption.",
8     "COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others.",
9     "COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more",
10     "Dementia has the following symptom: Experiencing memory loss, poor judgment, and confusion.",
11     "COVID-19 has the following symptom: Experiencing memory loss, poor judgment, and confusion.",
12 ]
13 
14 # Constructing the embed_input object
15 embed_input = [
16     {"content": [{"type": "text", "text": doc}]} for doc in documents
17 ]
18 
19 # Embed the documents
20 doc_emb = co.embed(
21     inputs=embed_input,
22     model="embed-v4.0",
23     output_dimension=1024,
24     input_type="search_document",
25     embedding_types=["float"],
26 ).embeddings.float
27 
28 ### STEP 2: Embed the query
29 
30 # Add the user query
31 query = "COVID-19 Symptoms"
32 
33 query_input = [{"content": [{"type": "text", "text": query}]}]
34 
35 # Embed the query
36 query_emb = co.embed(
37     inputs=query_input,
38     model="embed-v4.0",
39     input_type="search_query",
40     output_dimension=1024,
41     embedding_types=["float"],
42 ).embeddings.float
43 
44 ### STEP 3: Return the most similar documents
45 
46 # Calculate similarity scores
47 scores = np.dot(query_emb, np.transpose(doc_emb))[0]
48 
49 # Sort and filter documents based on scores
50 top_n = 5
51 top_doc_idxs = np.argsort(-scores)[:top_n]
52 
53 # Display search results
54 for idx, docs_idx in enumerate(top_doc_idxs):
55     print(f"Rank: {idx+1}")
56     print(f"Document: {documents[docs_idx]}\n")

Here’s a sample output:

Rank: 1
Document: COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more
Rank: 2
Document: COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others.
Rank: 3
Document: COVID-19 has the following symptom: Experiencing memory loss, poor judgment, and confusion.
Rank: 4
Document: COVID-19 has many symptoms.
Rank: 5
Document: COVID-19 symptoms are not nice

Multilingual semantic search

The Embed endpoint also supports multilingual semantic search via embed-v4.0 and previous embed-multilingual-... models. This means you can perform semantic search on texts in different languages.

Specifically, you can do both multilingual and cross-lingual searches using one single model.

PYTHON

1 ### STEP 1: Embed the documents
2 
3 documents = [
4     "Remboursement des frais de voyage : Gérez facilement vos frais de voyage en les soumettant via notre outil financier. Les approbations sont rapides et simples.",
5     "Travailler de l'étranger : Il est possible de travailler à distance depuis un autre pays. Il suffit de coordonner avec votre responsable et de vous assurer d'être disponible pendant les heures de travail.",
6     "Avantages pour la santé et le bien-être : Nous nous soucions de votre bien-être et proposons des adhésions à des salles de sport, des cours de yoga sur site et une assurance santé complète.",
7     "Fréquence des évaluations de performance : Nous organisons des bilans informels tous les trimestres et des évaluations formelles deux fois par an.",
8 ]
9 
10 # Constructing the embed_input object
11 embed_input = [
12     {"content": [{"type": "text", "text": doc}]} for doc in documents
13 ]
14 
15 # Embed the documents
16 doc_emb = co.embed(
17     inputs=embed_input,
18     model="embed-v4.0",
19     output_dimension=1024,
20     input_type="search_document",
21     embedding_types=["float"],
22 ).embeddings.float
23 
24 ### STEP 2: Embed the query
25 
26 # Add the user query
27 query = "What's your remote-working policy?"
28 
29 query_input = [{"content": [{"type": "text", "text": query}]}]
30 
31 # Embed the query
32 query_emb = co.embed(
33     inputs=query_input,
34     model="embed-v4.0",
35     input_type="search_query",
36     output_dimension=1024,
37     embedding_types=["float"],
38 ).embeddings.float
39 
40 ### STEP 3: Return the most similar documents
41 
42 # Calculate similarity scores
43 scores = np.dot(query_emb, np.transpose(doc_emb))[0]
44 
45 # Sort and filter documents based on scores
46 top_n = 4
47 top_doc_idxs = np.argsort(-scores)[:top_n]
48 
49 # Display search results
50 for idx, docs_idx in enumerate(top_doc_idxs):
51     print(f"Rank: {idx+1}")
52     print(f"Document: {documents[docs_idx]}\n")

Here’s a sample output:

Rank: 1
Document: Travailler de l'étranger : Il est possible de travailler à distance depuis un autre pays. Il suffit de coordonner avec votre responsable et de vous assurer d'être disponible pendant les heures de travail.
Rank: 2
Document: Avantages pour la santé et le bien-être : Nous nous soucions de votre bien-être et proposons des adhésions à des salles de sport, des cours de yoga sur site et une assurance santé complète.
Rank: 3
Document: Fréquence des évaluations de performance : Nous organisons des bilans informels tous les trimestres et des évaluations formelles deux fois par an.
Rank: 4
Document: Remboursement des frais de voyage : Gérez facilement vos frais de voyage en les soumettant via notre outil financier. Les approbations sont rapides et simples.

Multimodal PDF search

Handling PDF files, which often contain a mix of text, images, and layout information, presents a challenge for traditional embedding methods. This usually requires a multimodal generative model to pre-process the documents into a format that is suitable for the embedding model. This intermediate text representations can lose critical information; for example, the structure and precise content of tables or complex layouts might not be accurately rendered

Embed v4 solves this problem as it is designed to natively understand mixed-modality inputs. Embed v4 can directly process the PDF content, including text and images, in a single step. It generates a unified embedding that captures the semantic meaning derived from both the textual and visual elements.

Here’s an example of how to use the Embed endpoint to perform multimodal PDF search.

First, import the required libraries.

PYTHON

1 from pdf2image import convert_from_path
2 from io import BytesIO
3 import base64
4 import chromadb
5 import cohere

Next, turn a PDF file into a list of images, with one image per page. Then format these images into the content structure expected by the Embed endpoint.

PYTHON

1 pdf_path = "PDF_FILE_PATH"  # https://github.com/cohere-ai/cohere-developer-experience/raw/main/notebooks/guide/embed-v4-pdf-search/data/Samsung_Home_Theatre_HW-N950_ZA_FullManual_02_ENG_180809_2.pdf
2 pages = convert_from_path(pdf_path, dpi=200)
3 
4 input_array = []
5 for page in pages:
6     buffer = BytesIO()
7     page.save(buffer, format="PNG")
8     base64_str = base64.b64encode(buffer.getvalue()).decode("utf-8")
9     base64_image = f"data:image/png;base64,{base64_str}"
10     page_entry = {
11         "content": [
12             {"type": "text", "text": f"{pdf_path}"},
13             {"type": "image_url", "image_url": {"url": base64_image}},
14         ]
15     }
16     input_array.append(page_entry)

Next, generate the embeddings for these pages and store them in a vector database (in this example, we use Chroma).

PYTHON

1 # Generate the document embeddings
2 embeddings = []
3 for i in range(0, len(input_array)):
4     res = co.embed(
5         model="embed-v4.0",
6         input_type="search_document",
7         embedding_types=["float"],
8         inputs=[input_array[i]],
9     ).embeddings.float[0]
10     embeddings.append(res)
11 
12 # Store the embeddings in a vector database
13 ids = []
14 for i in range(0, len(input_array)):
15     ids.append(str(i))
16 
17 chroma_client = chromadb.Client()
18 collection = chroma_client.create_collection("pdf_pages")
19 collection.add(
20     embeddings=embeddings,
21     ids=ids,
22 )

Finally, provide a query and run a search over the documents. This will return a list of sorted IDs representing the most similar pages to the query.

PYTHON

1 query = "Do the speakers come with an optical cable?"
2 
3 # Generate the query embedding
4 query_embeddings = co.embed(
5     model="embed-v4.0",
6     input_type="search_query",
7     embedding_types=["float"],
8     texts=[query],
9 ).embeddings.float[0]
10 
11 # Search the vector database
12 results = collection.query(
13     query_embeddings=[query_embeddings],
14     n_results=5,  # Define the top_k value
15 )
16 
17 # Print the id of the top-ranked page
18 print(results["ids"][0][0])

1 22

The top-ranked page is shown below:

For a more complete example of multimodal PDF search, see the cookbook version.