Semantic Search with Embeddings
This section provides examples on how to use the Embed endpoint to perform semantic search.
Semantic search solves the problem faced by the more traditional approach of lexical search, which is great at finding keyword matches, but struggles to capture the context or meaning of a piece of text.
The Embed endpoint takes in texts as input and returns embeddings as output.
For semantic search, there are two types of documents we need to turn into embeddings.
- The list of documents to search from.
- The query that will be used to search the documents.
Step 1: Embed the documents
We call the Embed endpoint using co.embed()
and pass the required arguments:
texts
: The list of textsmodel
: Here we chooseembed-english-v3.0
, which generates embeddings of size 1024input_type
: We choosesearch_document
to ensure the model treats these as the documents for searchembedding_types
: We choosefloat
to get a float array as the output
Step 2: Embed the query
Next, we add and embed a query. We choose search_query
as the input_type
to ensure the model treats this as the query (instead of documents) for search.
Step 3: Return the most similar documents
Next, we calculate and sort similarity scores between a query and document embeddings, then display the top N most similar documents. Here, we are using the numpy library for calculating similarity using a dot product approach.
Content quality measure with Embed v3
A standard text embeddings model is optimized for only topic similarity between a query and candidate documents. But in many real-world applications, you have redundant information with varying content quality.
For instance, consider a user query of “COVID-19 Symptoms” and compare that to candidate document, “COVID-19 has many symptoms”. This document does not offer high-quality and rich information. However, with a typical embedding model, it will appear high on search results because it is highly similar to the query.
The Embed v3 model is trained to capture both content quality and topic similarity. Through this approach, a search system can extract richer information from documents and is robust against noise.
As an example below, give a query (“COVID-19 Symptoms”), the document with the highest quality (“COVID-19 symptoms can include: a high temperature or shivering…”) is ranked first.
Another document (“COVID-19 has many symptoms”) is arguably more similar to the query based on what information it contains, yet it is ranked lower as it doesn’t contain that much information.
This demonstrates how Embed v3 helps to surface high-quality documents for a given query.
Multilingual semantic search
The Embed endpoint also supports multilingual semantic search via the embed-multilingual-...
models. This means you can perform semantic search on texts in different languages.
Specifically, you can do both multilingual and cross-lingual searches using one single model.