Retrieval Augmented Generation (RAG)
Large Language Models (LLMs) excel at generating text and maintaining conversational context in chat applications. However, LLMs can sometimes hallucinate - producing responses that are factually incorrect. This is particularly important to mitigate in enterprise environments where organizations work with proprietary information that wasn’t part of the model’s training data.
Retrieval-augmented generation (RAG) addresses this limitation by enabling LLMs to incorporate external knowledge sources into their response generation process. By grounding responses in retrieved facts, RAG significantly reduces hallucinations and improves the accuracy and reliability of the model’s outputs.
In this tutorial, we’ll cover:
- Setting up the Cohere client
- Building a RAG application by combining retrieval and chat capabilities
- Managing chat history and maintaining conversational context
- Handling direct responses vs responses requiring retrieval
- Generating citations for retrieved information
In the next tutorial, we’ll explore how to leverage Cohere’s tool use features to build agentic applications.
We’ll use Cohere’s Command, Embed, and Rerank models deployed on Azure.
Setup
First, you will need to deploy the Command, Embed, and Rerank models on Azure via Azure AI Foundry. The deployment will create a serverless API with pay-as-you-go token based billing. You can find more information on how to deploy models in the Azure documentation.
Once the model is deployed, you can access it via Cohere’s Python SDK. Let’s now install the Cohere SDK and set up our client.
To create a client, you need to provide the API key and the model’s base URL for the Azure endpoint. You can get these information from the Azure AI Foundry platform where you deployed the model.
A quick example
Let’s begin with a simple example to explore how RAG works.
The foundation of RAG is having a set of documents for the LLM to reference. Below, we’ll work with a small collection of basic documents. While RAG systems usually involve retrieving relevant documents based on the user’s query (which we’ll explore later), for now we’ll keep it simple and use this entire small set of documents as context for the LLM.
We have seen how to use the Chat endpoint in the text generation chapter. To use the RAG feature, we simply need to add one additional parameter, documents
, to the endpoint call. These are the documents we want to provide as the context for the model to use in its response.
Let’s see how the model responds to the question “What are the tallest living penguins?”
The model leverages the provided documents as context for its response. Specifically, when mentioning that Emperor penguins are the tallest species, it references doc_0
- the document which states that “Emperor penguins are the tallest.”
A more comprehensive example
Now that we’ve covered a basic RAG implementation, let’s look at a more comprehensive example of RAG that includes:
- Creating a retrieval system that converts documents into text embeddings and stores them in an index
- Building a query generation system that transforms user messages into optimized search queries
- Implementing a chat interface to handle LLM interactions with users
- Designing a response generation system capable of handling various query types
First, let’s import the necessary libraries for this project. This includes hnswlib
for the vector library and unstructured
for chunking the documents (more details on these later).
Define documents
Next, we’ll define the documents we’ll use for RAG. We’ll use a few pages from the Cohere documentation that discuss prompt engineering. Each entry is identified by its title and URL.
Create vectorstore
The Vectorstore class handles the ingestion of documents into embeddings (or vectors) and the retrieval of relevant documents given a query.
It includes a few methods:
load_and_chunk
: Loads the raw documents from the URL and breaks them into smaller chunksembed
: Generates embeddings of the chunked documentsindex
: Indexes the document chunk embeddings to ensure efficient similarity search during retrievalretrieve
: Uses semantic search to retrieve relevant document chunks from the index, given a query. It involves two steps: first, dense retrieval from the index via the Embed endpoint, and second, a reranking via the Rerank endpoint to boost the search results further.
Process documents
With the Vectorstore set up, we can process the documents, which will involve chunking, embedding, and indexing.
We can test if the retrieval is working by entering a search query.
Run chatbot
We can now run the chatbot. For this, we create a run_chatbot
function that accepts the user message and the history of the conversation, if any.
Here’s what happens inside the function:
- For each user message, we use the Chat endpoint’s search query generation feature to turn the user message into one or more queries that are optimized for retrieval. The endpoint can even return no query, meaning a user message can be responded to directly without retrieval. This is done by calling the Chat endpoint with the
search_queries_only
parameter and setting it asTrue
. - If no search query is generated, we call the Chat endpoint to generate a response directly. If there is at least one, we call the
retrieve
method from theVectorstore
instance to retrieve the most relevant documents to each query. - Finally, all the results from all queries are appended to a list and passed to the Chat endpoint for response generation.
- We print the response, together with the citations and the list of document chunks cited, for easy reference.
Here is a sample conversation consisting of a few turns.
There are a few observations worth pointing out:
- Direct response: For user messages that don’t require retrieval (“Hello, I have a question”), the chatbot responds directly without requiring retrieval.
- Citation generation: For responses that do require retrieval (“What’s the difference between zero-shot and few-shot prompting”), the endpoint returns the response together with the citations. These are fine-grained citations, which means they refer to specific spans of the generated text.
- State management: The endpoint maintains the state of the conversation via the
chat_history
parameter, for example, by correctly responding to a vague user message, such as “How would the latter help?” - Response synthesis: The model can decide if none of the retrieved documents provide the necessary information to answer a user message. For example, when asked the question, “What do you know about 5G networks”, the chatbot retrieves external information from the index. However, it doesn’t use any of the information in its response as none of it is relevant to the question.
Conclusion
In this tutorial, we learned about:
- How to set up the Cohere client to use the Command model deployed on Azure AI Foundry for chat
- How to build a RAG application by combining retrieval and chat capabilities
- How to manage chat history and maintain conversational context
- How to handle direct responses vs responses requiring retrieval
- How citations are automatically generated for retrieved information
In the next tutorial, we’ll explore how to leverage Cohere’s tool use features to build agentic applications.