End-to-end RAG using Elasticsearch and Cohere
Learn how to use the Inference API for semantic search and use Cohere’s APIs for RAG.
For this example, you will need:
-
An Elastic Serverless account through Elastic Cloud, available with a free trial
-
A Cohere account with a production API key
-
Python 3.7 or higher
Note: While this tutorial integrates Cohere with an Elastic Cloud serverless project, you can also integrate with your self-managed Elasticsearch deployment or Elastic Cloud deployment by simply switching from using a Serverless endpoint in the Elasticsearch client.
If you don’t have an Elastic Cloud deployment, sign up here for a free trial and request access to Elastic Serverless
To get started, we’ll need to connect to our Elastic Serverless deployment using the Python client.
First we need to pip
install the following packages:
elasticsearch_serverless
cohere
After installing, in the Serverless dashboard, find your endpoint URL, and create your API key.
Next, we need to import the modules we need. 🔐 NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.
Now we can instantiate the Python Elasticsearch client.
First we prompt the user for their endpoint and encoded API key.
Then we create a client
object that instantiates an instance of the Elasticsearch
class.
When creating your Elastic Serverless API key make sure to turn on Control security privileges, and edit cluster privileges to specify "cluster": ["all"]
Confirm that the client has connected with this test:
Create the inference endpoint
Let’s create the inference endpoint by using the Create inference API.
You’ll need an Cohere API key for this that you can find in your Cohere account under the API keys section. A production key is required to complete the steps in this notebook as the Cohere free trial API usage is limited.
Create an ingest pipeline with an inference processor
Create an ingest pipeline with an inference processor by using the put_pipeline
method. Reference the inference endpoint created above as the model_id
to infer against the data that is being ingested in the pipeline.
Let’s note a few important parameters from that API call:
inference
: A processor that performs inference using a machine learning model.model_id
: Specifies the ID of the inference endpoint to be used. In this example, the model ID is set tocohere_embeddings
.input_output
: Specifies input and output fields.input_field
: Field name from which thedense_vector
representation is created.output_field
: Field name which contains inference results.
Create index
The mapping of the destination index – the index that contains the embeddings that the model will create based on your input text – must be created. The destination index must have a field with the dense_vector field type to index the output of the Cohere model.
Let’s create an index named cohere-wiki-embeddings
with the mappings we need.
Insert Documents
Let’s insert our example wiki dataset. You need a production Cohere account to complete this step, otherwise the documentation ingest will time out due to the API request rate limits.
Hybrid search
After the dataset has been enriched with the embeddings, you can query the data using hybrid search.
Pass a query_vector_builder
to the k-nearest neighbor (kNN) vector search API, and provide the query text and the model you have used to create the embeddings.
Ranking
In order to effectively combine the results from our vector and BM25 retrieval, we can use Cohere’s Rerank 3 model through the inference API to provide a final, more precise, semantic reranking of our results.
First, create an inference endpoint with your Cohere API key. Make sure to specify a name for your endpoint, and the model_id of one of the rerank models. In this example we will use Rerank 3.
You can now rerank your results using that inference endpoint. Here we will pass in the query we used for retrieval, along with the documents we just retrieved using hybrid search.
The inference service will respond with a list of documents in descending order of relevance. Each document has a corresponding index (reflecting to the order the documents were in when sent to the inference endpoint), and if the “return_documents” task setting is True, then the document texts will be included as well.
In this case we will set the response to False and will reconstruct the input documents based on the index returned in the response.
Now that we have ranked our results, we can easily turn this into a RAG system with Cohere’s Chat API. Pass in the retrieved documents, along with the query and see the grounded response using Cohere’s newest generative model Command R+.
First, we will create the Cohere client.
Next, we can easily get a grounded generation with citations from the Cohere Chat API. We simply pass in the user query and documents retrieved from Elastic to the API, and print out our grounded response.
And there you have it! A quick and easy implementation of hybrid search and RAG with Cohere and Elastic.