Semantic Search with Embeddings

This section provides examples on how to use the Embed endpoint to perform semantic search.

Semantic search solves the problem faced by the more traditional approach of lexical search, which is great at finding keyword matches, but struggles to capture the context or meaning of a piece of text.

PYTHON
1import cohere
2import numpy as np
3
4co = cohere.ClientV2(
5 api_key="YOUR_API_KEY"
6) # Get your free API key: https://dashboard.cohere.com/api-keys

The Embed endpoint takes in texts as input and returns embeddings as output.

For semantic search, there are two types of documents we need to turn into embeddings.

  • The list of documents to search from.
  • The query that will be used to search the documents.

Step 1: Embed the documents

We call the Embed endpoint using co.embed() and pass the required arguments:

  • texts: The list of texts
  • model: Here we choose embed-v4.0
  • input_type: We choose search_document to ensure the model treats these as the documents for search
  • embedding_types: We choose float to get a float array as the output

Step 2: Embed the query

Next, we add and embed a query. We choose search_query as the input_type to ensure the model treats this as the query (instead of documents) for search.

Step 3: Return the most similar documents

Next, we calculate and sort similarity scores between a query and document embeddings, then display the top N most similar documents. Here, we are using the numpy library for calculating similarity using a dot product approach.

PYTHON
1### STEP 1: Embed the documents
2
3# Define the documents
4documents = [
5 "Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.",
6 "Finding Coffee Spots: For your caffeine fix, head to the break room's coffee machine or cross the street to the café for artisan coffee.",
7 "Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!",
8 "Working Hours Flexibility: We prioritize work-life balance. While our core hours are 9 AM to 5 PM, we offer flexibility to adjust as needed.",
9]
10
11# Constructing the embed_input object
12embed_input = [
13 {"content": [{"type": "text", "text": doc}]} for doc in documents
14]
15
16# Embed the documents
17doc_emb = co.embed(
18 inputs=embed_input,
19 model="embed-v4.0",
20 output_dimension=1024,
21 input_type="search_document",
22 embedding_types=["float"],
23).embeddings.float
24
25### STEP 2: Embed the query
26
27# Add the user query
28query = "How to connect with my teammates?"
29
30query_input = [{"content": [{"type": "text", "text": query}]}]
31
32# Embed the query
33query_emb = co.embed(
34 inputs=query_input,
35 model="embed-v4.0",
36 input_type="search_query",
37 output_dimension=1024,
38 embedding_types=["float"],
39).embeddings.float
40
41### STEP 3: Return the most similar documents
42
43# Calculate similarity scores
44scores = np.dot(query_emb, np.transpose(doc_emb))[0]
45
46# Sort and filter documents based on scores
47top_n = 2
48top_doc_idxs = np.argsort(-scores)[:top_n]
49
50# Display search results
51for idx, docs_idx in enumerate(top_doc_idxs):
52 print(f"Rank: {idx+1}")
53 print(f"Document: {documents[docs_idx]}\n")

Here’s an example output:

Rank: 1
Document: Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!
Rank: 2
Document: Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.

Content quality measure with Embed v4

A standard text embeddings model is optimized for only topic similarity between a query and candidate documents. But in many real-world applications, you have redundant information with varying content quality.

For instance, consider a user query of “COVID-19 Symptoms” and compare that to candidate document, “COVID-19 has many symptoms”. This document does not offer high-quality and rich information. However, with a typical embedding model, it will appear high on search results because it is highly similar to the query.

The Embed v4 model is trained to capture both content quality and topic similarity. Through this approach, a search system can extract richer information from documents and is robust against noise.

As an example below, give a query (“COVID-19 Symptoms”), the document with the highest quality (“COVID-19 symptoms can include: a high temperature or shivering…”) is ranked first.

Another document (“COVID-19 has many symptoms”) is arguably more similar to the query based on what information it contains, yet it is ranked lower as it doesn’t contain that much information.

This demonstrates how Embed v4 helps to surface high-quality documents for a given query.

PYTHON
1### STEP 1: Embed the documents
2
3documents = [
4 "COVID-19 has many symptoms.",
5 "COVID-19 symptoms are bad.",
6 "COVID-19 symptoms are not nice",
7 "COVID-19 symptoms are bad. 5G capabilities include more expansive service coverage, a higher number of available connections, and lower power consumption.",
8 "COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others.",
9 "COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more",
10 "Dementia has the following symptom: Experiencing memory loss, poor judgment, and confusion.",
11 "COVID-19 has the following symptom: Experiencing memory loss, poor judgment, and confusion.",
12]
13
14# Constructing the embed_input object
15embed_input = [
16 {"content": [{"type": "text", "text": doc}]} for doc in documents
17]
18
19# Embed the documents
20doc_emb = co.embed(
21 inputs=embed_input,
22 model="embed-v4.0",
23 output_dimension=1024,
24 input_type="search_document",
25 embedding_types=["float"],
26).embeddings.float
27
28### STEP 2: Embed the query
29
30# Add the user query
31query = "COVID-19 Symptoms"
32
33query_input = [{"content": [{"type": "text", "text": query}]}]
34
35# Embed the query
36query_emb = co.embed(
37 inputs=query_input,
38 model="embed-v4.0",
39 input_type="search_query",
40 output_dimension=1024,
41 embedding_types=["float"],
42).embeddings.float
43
44### STEP 3: Return the most similar documents
45
46# Calculate similarity scores
47scores = np.dot(query_emb, np.transpose(doc_emb))[0]
48
49# Sort and filter documents based on scores
50top_n = 5
51top_doc_idxs = np.argsort(-scores)[:top_n]
52
53# Display search results
54for idx, docs_idx in enumerate(top_doc_idxs):
55 print(f"Rank: {idx+1}")
56 print(f"Document: {documents[docs_idx]}\n")

Here’s a sample output:

Rank: 1
Document: COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more
Rank: 2
Document: COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others.
Rank: 3
Document: COVID-19 has the following symptom: Experiencing memory loss, poor judgment, and confusion.
Rank: 4
Document: COVID-19 has many symptoms.
Rank: 5
Document: COVID-19 symptoms are not nice

The Embed endpoint also supports multilingual semantic search via embed-v4.0 and previous embed-multilingual-... models. This means you can perform semantic search on texts in different languages.

Specifically, you can do both multilingual and cross-lingual searches using one single model.

Specifically, you can do both multilingual and cross-lingual searches using one single model.

PYTHON
1### STEP 1: Embed the documents
2
3documents = [
4 "Remboursement des frais de voyage : Gérez facilement vos frais de voyage en les soumettant via notre outil financier. Les approbations sont rapides et simples.",
5 "Travailler de l'étranger : Il est possible de travailler à distance depuis un autre pays. Il suffit de coordonner avec votre responsable et de vous assurer d'être disponible pendant les heures de travail.",
6 "Avantages pour la santé et le bien-être : Nous nous soucions de votre bien-être et proposons des adhésions à des salles de sport, des cours de yoga sur site et une assurance santé complète.",
7 "Fréquence des évaluations de performance : Nous organisons des bilans informels tous les trimestres et des évaluations formelles deux fois par an.",
8]
9
10# Constructing the embed_input object
11embed_input = [
12 {"content": [{"type": "text", "text": doc}]} for doc in documents
13]
14
15# Embed the documents
16doc_emb = co.embed(
17 inputs=embed_input,
18 model="embed-v4.0",
19 output_dimension=1024,
20 input_type="search_document",
21 embedding_types=["float"],
22).embeddings.float
23
24### STEP 2: Embed the query
25
26# Add the user query
27query = "What's your remote-working policy?"
28
29query_input = [{"content": [{"type": "text", "text": query}]}]
30
31# Embed the query
32query_emb = co.embed(
33 inputs=query_input,
34 model="embed-v4.0",
35 input_type="search_query",
36 output_dimension=1024,
37 embedding_types=["float"],
38).embeddings.float
39
40### STEP 3: Return the most similar documents
41
42# Calculate similarity scores
43scores = np.dot(query_emb, np.transpose(doc_emb))[0]
44
45# Sort and filter documents based on scores
46top_n = 4
47top_doc_idxs = np.argsort(-scores)[:top_n]
48
49# Display search results
50for idx, docs_idx in enumerate(top_doc_idxs):
51 print(f"Rank: {idx+1}")
52 print(f"Document: {documents[docs_idx]}\n")

Here’s a sample output:

Rank: 1
Document: Travailler de l'étranger : Il est possible de travailler à distance depuis un autre pays. Il suffit de coordonner avec votre responsable et de vous assurer d'être disponible pendant les heures de travail.
Rank: 2
Document: Avantages pour la santé et le bien-être : Nous nous soucions de votre bien-être et proposons des adhésions à des salles de sport, des cours de yoga sur site et une assurance santé complète.
Rank: 3
Document: Fréquence des évaluations de performance : Nous organisons des bilans informels tous les trimestres et des évaluations formelles deux fois par an.
Rank: 4
Document: Remboursement des frais de voyage : Gérez facilement vos frais de voyage en les soumettant via notre outil financier. Les approbations sont rapides et simples.

Handling PDF files, which often contain a mix of text, images, and layout information, presents a challenge for traditional embedding methods. This usually requires a multimodal generative model to pre-process the documents into a format that is suitable for the embedding model. This intermediate text representations can lose critical information; for example, the structure and precise content of tables or complex layouts might not be accurately rendered

Embed v4 solves this problem as it is designed to natively understand mixed-modality inputs. Embed v4 can directly process the PDF content, including text and images, in a single step. It generates a unified embedding that captures the semantic meaning derived from both the textual and visual elements.

Here’s an example of how to use the Embed endpoint to perform multimodal PDF search.

First, import the required libraries.

PYTHON
1from pdf2image import convert_from_path
2from io import BytesIO
3import base64
4import chromadb
5import cohere

Next, turn a PDF file into a list of images, with one image per page. Then format these images into the content structure expected by the Embed endpoint.

PYTHON
1pdf_path = "PDF_FILE_PATH" # https://github.com/cohere-ai/cohere-developer-experience/raw/main/notebooks/guide/embed-v4-pdf-search/data/Samsung_Home_Theatre_HW-N950_ZA_FullManual_02_ENG_180809_2.pdf
2pages = convert_from_path(pdf_path, dpi=200)
3
4input_array = []
5for page in pages:
6 buffer = BytesIO()
7 page.save(buffer, format="PNG")
8 base64_str = base64.b64encode(buffer.getvalue()).decode("utf-8")
9 base64_image = f"data:image/png;base64,{base64_str}"
10 page_entry = {
11 "content": [
12 {"type": "text", "text": f"{pdf_path}"},
13 {"type": "image_url", "image_url": {"url": base64_image}},
14 ]
15 }
16 input_array.append(page_entry)

Next, generate the embeddings for these pages and store them in a vector database (in this example, we use Chroma).

PYTHON
1# Generate the document embeddings
2embeddings = []
3for i in range(0, len(input_array)):
4 res = co.embed(
5 model="embed-v4.0",
6 input_type="search_document",
7 embedding_types=["float"],
8 inputs=[input_array[i]],
9 ).embeddings.float[0]
10 embeddings.append(res)
11
12# Store the embeddings in a vector database
13ids = []
14for i in range(0, len(input_array)):
15 ids.append(str(i))
16
17chroma_client = chromadb.Client()
18collection = chroma_client.create_collection("pdf_pages")
19collection.add(
20 embeddings=embeddings,
21 ids=ids,
22)

Finally, provide a query and run a search over the documents. This will return a list of sorted IDs representing the most similar pages to the query.

PYTHON
1query = "Do the speakers come with an optical cable?"
2
3# Generate the query embedding
4query_embeddings = co.embed(
5 model="embed-v4.0",
6 input_type="search_query",
7 embedding_types=["float"],
8 texts=[query],
9).embeddings.float[0]
10
11# Search the vector database
12results = collection.query(
13 query_embeddings=[query_embeddings],
14 n_results=5, # Define the top_k value
15)
16
17# Print the id of the top-ranked page
18print(results["ids"][0][0])
122

The top-ranked page is shown below:

For a more complete example of multimodal PDF search, see the cookbook version.

Built with