RAG with Cohere
The Chat endpoint provides comprehensive support for various text generation use cases, including retrieval-augmented generation (RAG).
While LLMs are good at maintaining the context of the conversation and generating responses, they can be prone to hallucinate and include factually incorrect or incomplete information in their responses.
RAG enables a model to access and utilize supplementary information from external documents, thereby improving the accuracy of its responses.
When using RAG with the Chat endpoint, these responses are backed by fine-grained citations linking to the source documents. This makes the responses easily verifiable.
In this tutorial, you’ll learn about:
- Basic RAG
- Search query generation
- Retrieval with Embed
- Reranking with Rerank
- Response and citation generation
You’ll learn these by building an onboarding assistant for new hires.
Setup
To get started, first we need to install the cohere
library and create a Cohere client.
Basic RAG
To see how RAG works, let’s define the documents that the application has access to. We’ll use a short list of documents consisting of internal FAQs about the fictitious company Co1t (in production, these documents are massive).
In this example, each document is a dictionary with one field, text
. But we can define any number of fields we want, depending on the nature of the documents. For example, emails could contain title
and text
fields.
To use these documents, we pass them to the documents
parameter in the Chat endpoint call. This tells the model to run in RAG-mode and use these documents in its response.
Let’s create a query asking about the company’s support for personal well-being, which is not going to be available to the model based on the data its trained on. It will need to use external documents.
RAG introduces additional objects in the Chat response. Here we display two:
citations
: indicate the specific text spans from the retrieved documents on which the response is grounded.documents
: the IDs of the documents referenced in the citations.
Further reading:
Search query generation
The previous example showed how to get started with RAG, and in particular, the augmented generation portion of RAG. But as its name implies, RAG consists of other steps, such as retrieval.
In a basic RAG application, the steps involved are:
- Transforming the user message into search queries
- Retrieving relevant documents for a given search query
- Generating the response and citations
Let’s now look at the first step—search query generation. The chatbot needs to generate an optimal set of search queries to use for retrieval.
The Chat endpoint has a feature that handles this for us automatically. This is done by adding the search_queries_only=True
parameter to the Chat endpoint call.
It will generate a list of search queries based on a user message. Depending on the message, it can be one or more queries.
In the example below, the resulting queries breaks down the user message into two separate queries.
And in the example below, the model decides that one query is sufficient.
Retrieval with Embed
Given the search query, we need a way to retrieve the most relevant documents from a large collection of documents.
This is where we can leverage text embeddings through the Embed endpoint. It enables semantic search, which lets us to compare the semantic meaning of the documents and the query. It solves the problem faced by the more traditional approach of lexical search, which is great at finding keyword matches, but struggles at capturing the context or meaning of a piece of text.
The Embed endpoint takes in texts as input and returns embeddings as output.
First, we need to embed the documents to search from. We call the Embed endpoint using co.embed()
and pass the following arguments:
model
: Here we chooseembed-english-v3.0
, which generates embeddings of size 1024input_type
: We choosesearch_document
to ensure the model treats these as the documents (instead of the query) for searchtexts
: The list of texts (the FAQs)
Next, we add a query, which asks about how to get to know the team.
We choose search_query
as the input_type
to ensure the model treats this as the query (instead of the documents) for search.
Now, we want to search for the most relevant documents to the query. For this, we make use of the numpy
library to compute the similarity between each query-document pair using the dot product approach.
Each query-document pair returns a score, which represents how similar the pair are. We then sort these scores in descending order and select the top most similar pairs, which we choose 5 (this is an arbitrary choice, you can choose any number).
Here, we show the most relevant documents with their similarity scores.
Further reading:
- Embed endpoint API reference
- Documentation on the Embed endpoint
- Documentation on the models available on the Embed endpoint
Reranking with Rerank
Reranking can boost the results from semantic or lexical search further. The Rerank endpoint takes a list of search results and reranks them according to the most relevant documents to a query. This requires just a single line of code to implement.
We call the endpoint using co.rerank()
and pass the following arguments:
query
: The user querydocuments
: The list of documents we get from the semantic search resultstop_n
: The top reranked documents to selectmodel
: We choose Rerank English 3
Looking at the results, we see that the given a query about getting to know the team, the document that talks about joining Slack channels is now ranked higher (1st) compared to earlier (3rd).
Here we select top_n
to be 2, which will be the documents we will pass next for response generation.
Further reading:
- Rerank endpoint API reference
- Documentation on Rerank
- Documentation on Rerank fine-tuning
- Documentation on Rerank best practices
Response and citation generation
Finally we reach the step that we saw in the earlier Basic RAG
section. Here, the response is generated based on the the query and the documents retrieved.
RAG introduces additional objects in the Chat response. Here we display two:
citations
: indicate the specific spans of text from the retrieved documents on which the response is grounded.documents
: the IDs of the documents being referenced in the citations.
Conclusion
In this tutorial, you learned about:
- How to get started with RAG
- How to generate search queries
- How to perform retrieval with Embed
- How to perform reranking with Rerank
- How to generate response and citations
RAG is great for building applications that can answer questions by grounding the response in external documents. But you can unlock the ability to not just answer questions, but also automate tasks. This can be done using a technique called tool use.
In Part 7, you will learn how to leverage tool use to automate tasks and workflows.