End-to-end example of RAG with Chat, Embed, and Rerank
This section expands on the basic RAG usage by demonstrating a more complete example that includes:
- Retrieval and reranking of documents (via the Embed and Rerank endpoints).
- Building RAG for chatbots (involving multi-turn conversations).
Setup
First, import the Cohere library and create a client.
Cohere platform
Private deployment
Step 1: Generating search queries
Next, we create a search query generation tool for generating search queries from user queries.
We pass a user query, which in this example, asks about how to get to know the team.
Example response:
Step 2: Fetching relevant documents
Retrieval with Embed
Given the search query, we need a way to retrieve the most relevant documents from a large collection of documents.
This is where we can leverage text embeddings through the Embed endpoint.
The Embed endpoint enables semantic search, which lets us to compare the semantic meaning of the documents and the query. It solves the problem faced by the more traditional approach of lexical search, which is great at finding keyword matches, but struggles at capturing the context or meaning of a piece of text.
The Embed endpoint takes in texts as input and returns embeddings as output.
First, we need to embed the documents to search from. We call the Embed endpoint using co.embed()
and pass the following arguments:
model
: Here we chooseembed-english-v3.0
, which generates embeddings of size 1024input_type
: We choosesearch_document
to ensure the model treats these as the documents (instead of the query) for searchtexts
: The list of texts (the FAQs)embedding_types
: We choose thefloat
as the embedding type.
We choose search_query
as the input_type
in the Embed endpoint call. This ensures the model treats this as the query (instead of the documents) for search.
Now, we want to search for the most relevant documents to the query. For this, we make use of the numpy
library to compute the similarity between each query-document pair using the dot product approach.
Each query-document pair returns a score, which represents how similar the pair are. We then sort these scores in descending order and select the top most similar pairs, which we choose 5 (this is an arbitrary choice, you can choose any number).
Here, we show the most relevant documents with their similarity scores.
For simplicity, in this example, we are assuming only one query being generated. For practical implementations, multiple queries may be generated. For those scenarios, you will need to perform retrieval for each query.
Reranking with Rerank
Reranking can boost the results from semantic or lexical search further. The Rerank endpoint takes a list of search results and reranks them according to the most relevant documents to a query. This requires just a single line of code to implement.
We call the endpoint using co.rerank()
and pass the following arguments:
query
: The user querydocuments
: The list of documents we get from the semantic search resultstop_n
: The top reranked documents to selectmodel
: We choose Rerank English 3
Looking at the results, we see that since the query is about getting to know the team, the document that talks about joining Slack channels is now ranked higher (1st) compared to earlier (3rd).
Here we select top_n
to be 2, which will be the documents we will pass next for response generation.
Step 3: Generating the response
Finally, we call the Chat endpoint by passing the retrieved documents. This tells the model to run in RAG-mode and use these documents in its response.
The response and citations are then generated based on the the query and the documents retrieved.