Elasticsearch and Cohere
Elasticsearch has all the tools developers need to build next generation search experiences with generative AI, and it supports native integration with Cohere through their inference API.
Use Elastic if you’d like to build with:
- A vector database
- Deploy multiple ML models
- Perform text, vector and hybrid search
- Search with filters, facet, aggregations
- Apply document and field level security
- Run on-prem, cloud, or serverless (preview)
This guide uses a dataset of Wikipedia articles to set up a pipeline for semantic search. It will cover:
- Creating an Elastic inference processor using Cohere embeddings
- Creating an Elasticsearch index with embeddings
- Performing hybrid search on the Elasticsearch index and reranking results
- Performing basic RAG
To see the full code sample, refer to this notebook. You can also find an integration guide here.
Prerequisites
This tutorial assumes you have the following:
- An Elastic Cloud account through Elastic Cloud, available with a free trial
- A Cohere production API Key. Get your API Key at this link if you don’t have one
- Python 3.7 or higher
Note: While this tutorial integrates Cohere with an Elastic Cloud serverless project, you can also integrate with your self-managed Elasticsearch deployment or Elastic Cloud deployment by simply switching from the serverless to the general language client.
Create an Elastic Serverless deployment
If you don’t have an Elastic Cloud deployment, sign up here for a free trial and request access to Elastic Serverless
Install the required packages
Install and import the required Python Packages:
elasticsearch_serverless
cohere
: ensure you are on version 5.2.5 or later
To install the packages, use the following code
After the instalation has finished, find your endpoint URL and create your API key in the Serverless dashboard.
Import the required packages
Next, we need to import the modules we need. 🔐 NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.
Create an Elasticsearch client
Now we can instantiate the Python Elasticsearch client.
First we prompt the user for their endpoint and encoded API key. Then we create a client object that instantiates an instance of the Elasticsearch class.
When creating your Elastic Serverless API key make sure to turn on Control security privileges, and edit cluster privileges to specify "cluster": ["all"]
.
Build a Hybrid Search Index with Cohere and Elasticsearch
Create an inference endpoint
One of the biggest pain points of building a vector search index is computing embeddings for a large corpus of data. Fortunately Elastic offers inference endpoints that can be used in ingest pipelines to automatically compute embeddings when bulk indexing operations are performed.
To set up an inference pipeline for ingestion we first must create an inference endpoint that uses Cohere embeddings. You’ll need a Cohere API key for this that you can find in your Cohere account under the API keys section.
We will create an inference endpoint that uses embed-english-v3.0
and int8
or byte
compression to save on storage.
Here’s what you might see:
Create the Index
The mapping of the destination index – the index that contains the embeddings that the model will generate based on your input text – must be created. The destination index must have a field with the semantic_text
field type to index the output of the Cohere model.
Let’s create an index named cohere-wiki-embeddings with the mappings we need
You might see something like this:
Let’s note a few important parameters from that API call:
semantic_text
: A field type automatically generates embeddings for text content using an inference endpoint.inference_id
: Specifies the ID of the inference endpoint to be used. In this example, the model ID is set to cohere_embeddings.copy_to
: Specifies the output field which contains inference results
Insert Documents
Let’s insert our example wiki dataset. You need a production Cohere account to complete this step, otherwise the documentation ingest will time out due to the API request rate limits.
You should see this:
Semantic Search
After the dataset has been enriched with the embeddings, you can query the data using the semantic query provided by Elasticsearch. semantic_text
in Elasticsearch simplifies the semantic search significantly. Learn more about how semantic text in Elasticsearch allows you to focus on your model and results instead of on the technical details.
Here’s what that might look like:
Hybrid Search
After the dataset has been enriched with the embeddings, you can query the data using hybrid search.
Pass a semantic query, and provide the query text and the model you have used to create the embeddings.
Ranking
In order to effectively combine the results from our vector and BM25 retrieval, we can use Cohere’s Rerank 3 model through the inference API to provide a final, more precise, semantic reranking of our results.
First, create an inference endpoint with your Cohere API key. Make sure to specify a name for your endpoint, and the model_id of one of the rerank models. In this example we will use Rerank 3.
You can now rerank your results using that inference endpoint. Here we will pass in the query we used for retrieval, along with the documents we just retrieved using hybrid search.
The inference service will respond with a list of documents in descending order of relevance. Each document has a corresponding index (reflecting to the order the documents were in when sent to the inference endpoint), and if the “return_documents” task setting is True, then the document texts will be included as well.
In this case we will set the response to False and will reconstruct the input documents based on the index returned in the response.
Retrieval augemented generation
Now that we have ranked our results, we can easily turn this into a RAG system with Cohere’s Chat API. Pass in the retrieved documents, along with the query and see the grounded response using Cohere’s newest generative model Command R+.
First, we will create the Cohere client.
Next, we can easily get a grounded generation with citations from the Cohere Chat API. We simply pass in the user query and documents retrieved from Elastic to the API, and print out our grounded response.
And there you have it! A quick and easy implementation of hybrid search and RAG with Cohere and Elastic.