Open in GitHub

In this tutorial, we’ll explore semantic search using Cohere’s Embed modelon Azure AI Foundry.

Semantic search enables search systems to capture the meaning and context of search queries, going beyond simple keyword matching to find relevant results based on semantic similarity.

With the Embed model, you can do this across languages. This is particularly powerful for multilingual applications where the same meaning can be expressed in different languages.

In this tutorial, we’ll cover:

  • Setting up the Cohere client
  • Embedding text data
  • Building a search index
  • Performing semantic search queries

We’ll use Cohere’s Embed model deployed on Azure to demonstrate these capabilities and help you understand how to effectively implement semantic search in your applications.

Setup

First, you will need to deploy the Embed model on Azure via Azure AI Foundry. The deployment will create a serverless API with pay-as-you-go token based billing. You can find more information on how to deploy models in the Azure documentation.

In the example below, we are deploying the Embed Multilingual v3 model.

Once the model is deployed, you can access it via Cohere’s Python SDK. Let’s now install the Cohere SDK and set up our client.

To create a client, you need to provide the API key and the model’s base URL for the Azure endpoint. You can get these information from the Azure AI Foundry platform where you deployed the model.

PYTHON
1# ! pip install cohere hnswlib
2
3import pandas as pd
4import hnswlib
5import re
6import cohere
7
8co = cohere.Client(
9 api_key="AZURE_API_KEY_EMBED",
10 base_url="AZURE_ENDPOINT_EMBED" # example: "https://cohere-embed-v3-multilingual-xyz.eastus.models.ai.azure.com/"
11)

Download dataset

For this example, we’ll be using MultiFIN - an open-source dataset of financial article headlines in 15 different languages (including English, Turkish, Danish, Spanish, Polish, Greek, Finnish, Hebrew, Japanese, Hungarian, Norwegian, Russian, Italian, Icelandic, and Swedish).

We’ve prepared a CSV version of the MultiFIN dataset that includes an additional column containing English translations. While we won’t use these translations for the model itself, they’ll help us understand the results when we encounter headlines in Danish or Spanish. We’ll load this CSV file into a pandas dataframe.

PYTHON
1url = "https://raw.githubusercontent.com/cohere-ai/cohere-aws/main/notebooks/bedrock/multiFIN_train.csv"
2df = pd.read_csv(url)
3
4# Inspect dataset
5df.head(5)

Pre-Process Dataset

For this example, we’ll work with a subset focusing on English, Spanish, and Danish content.

We’ll perform several pre-processing steps: removing any duplicate entries, filtering to keep only our three target languages, and selecting the 80 longest articles as our working dataset.

PYTHON
1# Ensure there is no duplicated text in the headers
2def remove_duplicates(text):
3 return re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', text, flags=re.I)
4
5df ['text'] = df['text'].apply(remove_duplicates)
6
7# Keep only selected languages
8languages = ['English', 'Spanish', 'Danish']
9df = df.loc[df['lang'].isin(languages)]
10
11# Pick the top 80 longest articles
12df['text_length'] = df['text'].str.len()
13df.sort_values(by=['text_length'], ascending=False, inplace=True)
14top_80_df = df[:80]
15
16# Language distribution
17top_80_df['lang'].value_counts()
1lang
2Spanish 33
3English 29
4Danish 18
5Name: count, dtype: int64

Embed and index documents

Let’s embed our documents and store the embeddings. These embeddings are high-dimensional vectors (1,024 dimensions) that capture the semantic meaning of each document. We’ll use Cohere’s embed-multilingual-v3.0 model that we have defined in the client setup.

The v3.0 embedding models require us to specify an input_type parameter that indicates what we’re embedding. For semantic search, we use search_document for the documents we want to search through, and search_query for the search queries we’ll make later.

We’ll also keep track information about each document’s language and translation to provide richer search results.

Finally, we’ll build a search index with the hnsw vector library to store these embeddings efficiently, enabling faster document searches.

PYTHON
1# Embed documents
2docs = top_80_df['text'].to_list()
3docs_lang = top_80_df['lang'].to_list()
4translated_docs = top_80_df['translation'].to_list() #for reference when returning non-English results
5doc_embs = co.embed(
6 texts=docs,
7 input_type='search_document'
8).embeddings
9
10# Create a search index
11index = hnswlib.Index(space='ip', dim=1024)
12index.init_index(max_elements=len(doc_embs), ef_construction=512, M=64)
13index.add_items(doc_embs, list(range(len(doc_embs))))

Send Query and Retrieve Documents

Next, we build a function that takes a query as input, embeds it, and finds the three documents that are the most similar to the query.

PYTHON
1# Retrieval of 4 closest docs to query
2def retrieval(query):
3 # Embed query and retrieve results
4 query_emb = co.embed(
5 texts=[query],
6 input_type="search_query"
7 ).embeddings
8 doc_ids = index.knn_query(query_emb, k=3)[0][0] # we will retrieve 3 closest neighbors
9
10 # Print and append results
11 print(f"QUERY: {query.upper()} \n")
12 retrieved_docs, translated_retrieved_docs = [], []
13
14 for doc_id in doc_ids:
15 # Append results
16 retrieved_docs.append(docs[doc_id])
17 translated_retrieved_docs.append(translated_docs[doc_id])
18
19 # Print results
20 print(f"ORIGINAL ({docs_lang[doc_id]}): {docs[doc_id]}")
21 if docs_lang[doc_id] != "English":
22 print(f"TRANSLATION: {translated_docs[doc_id]} \n----")
23 else:
24 print("----")
25 print("END OF RESULTS \n\n")
26 return retrieved_docs, translated_retrieved_docs

Let’s now try to query the index with a couple of examples, one each in English and Danish.

PYTHON
1queries = [
2 "Can data science help meet sustainability goals?", # English example
3 "Hvor kan jeg finde den seneste danske boligplan?" # Danish example - "Where can I find the latest Danish property plan?"
4]
5
6for query in queries:
7 retrieval(query)
1QUERY: CAN DATA SCIENCE HELP MEET SUSTAINABILITY GOALS?
2
3ORIGINAL (English): Using AI to better manage the environment could reduce greenhouse gas emissions, boost global GDP by up to 38m jobs by 2030
4----
5ORIGINAL (English): Quality of business reporting on the Sustainable Development Goals improves, but has a long way to go to meet and drive targets.
6----
7ORIGINAL (English): Only 10 years to achieve Sustainable Development Goals but businesses remain on starting blocks for integration and progress
8----
9END OF RESULTS
10
11
12QUERY: HVOR KAN JEG FINDE DEN SENESTE DANSKE BOLIGPLAN?
13
14ORIGINAL (Danish): Nyt fra CFOdirect: Ny PP&E-guide, FAQs om den nye leasingstandard, podcast om udfordringerne ved implementering af leasingstandarden og meget mere
15TRANSLATION: New from CFOdirect: New PP&E guide, FAQs on the new leasing standard, podcast on the challenges of implementing the leasing standard and much more
16----
17ORIGINAL (Danish): Lovforslag fremlagt om rentefri lån, udskudt frist for lønsumsafgift, førtidig udbetaling af skattekredit og loft på indestående på skattekontoen
18TRANSLATION: Bills presented on interest -free loans, deferred deadline for payroll tax, early payment of tax credit and ceiling on the balance in the tax account
19----
20ORIGINAL (Danish): Nyt fra CFOdirect: Shareholder-spørgsmål til ledelsen, SEC cybersikkerhedsguide, den amerikanske skattereform og meget mere
21TRANSLATION: New from CFOdirect: Shareholder questions for management, the SEC cybersecurity guide, US tax reform and more
22----
23END OF RESULTS

With the first example, notice how the retrieval system was able to surface documents similar in meaning, for example, surfacing documents related to AI when given a query about data science. This is something that keyword-based search will not be able to capture.

As for the second example, this demonstrates the multilingual nature of the model. You can use the same model across different languages. The model can also perform cross-lingual search, such as the example of from the first retrieved document, where “PP&E guide” is an English term that stands for “property, plant, and equipment,”.

Summary

In this tutorial, we learned about:

  • How to set up the Cohere client to use the Embed model deployed on Azure AI Foundry
  • How to embed text data
  • How to build a search index
  • How to perform multilingualsemantic search

In the next tutorial, we’ll explore how to use the Rerank model for reranking search results.

Built with