Retrieval augmented generation (RAG) - Cohere on Azure AI Foundry
Retrieval augmented generation (RAG) - Cohere on Azure AI Foundry
Retrieval augmented generation (RAG) - Cohere on Azure AI Foundry
Large Language Models (LLMs) excel at generating text and maintaining conversational context in chat applications. However, LLMs can sometimes hallucinate - producing responses that are factually incorrect. This is particularly important to mitigate in enterprise environments where organizations work with proprietary information that wasn’t part of the model’s training data.
Retrieval-augmented generation (RAG) addresses this limitation by enabling LLMs to incorporate external knowledge sources into their response generation process. By grounding responses in retrieved facts, RAG significantly reduces hallucinations and improves the accuracy and reliability of the model’s outputs.
In this tutorial, we’ll cover:
In the next tutorial, we’ll explore how to leverage Cohere’s tool use features to build agentic applications.
We’ll use Cohere’s Command, Embed, and Rerank models deployed on Azure.
First, you will need to deploy the Command, Embed, and Rerank models on Azure via Azure AI Foundry. The deployment will create a serverless API with pay-as-you-go token based billing. You can find more information on how to deploy models in the Azure documentation.
Once the model is deployed, you can access it via Cohere’s Python SDK. Let’s now install the Cohere SDK and set up our client.
To create a client, you need to provide the API key and the model’s base URL for the Azure endpoint. You can get these information from the Azure AI Foundry platform where you deployed the model.
Let’s begin with a simple example to explore how RAG works.
The foundation of RAG is having a set of documents for the LLM to reference. Below, we’ll work with a small collection of basic documents. While RAG systems usually involve retrieving relevant documents based on the user’s query (which we’ll explore later), for now we’ll keep it simple and use this entire small set of documents as context for the LLM.
We have seen how to use the Chat endpoint in the text generation chapter. To use the RAG feature, we simply need to add one additional parameter, documents, to the endpoint call. These are the documents we want to provide as the context for the model to use in its response.
Let’s see how the model responds to the question “What are the tallest living penguins?”
The model leverages the provided documents as context for its response. Specifically, when mentioning that Emperor penguins are the tallest species, it references doc_0 - the document which states that “Emperor penguins are the tallest.”
Now that we’ve covered a basic RAG implementation, let’s look at a more comprehensive example of RAG that includes:
First, let’s import the necessary libraries for this project. This includes hnswlib for the vector library and unstructured for chunking the documents (more details on these later).
Next, we’ll define the documents we’ll use for RAG. We’ll use a few pages from the Cohere documentation that discuss prompt engineering. Each entry is identified by its title and URL.
The Vectorstore class handles the ingestion of documents into embeddings (or vectors) and the retrieval of relevant documents given a query.
It includes a few methods:
load_and_chunk: Loads the raw documents from the URL and breaks them into smaller chunksembed: Generates embeddings of the chunked documentsindex: Indexes the document chunk embeddings to ensure efficient similarity search during retrievalretrieve: Uses semantic search to retrieve relevant document chunks from the index, given a query. It involves two steps: first, dense retrieval from the index via the Embed endpoint, and second, a reranking via the Rerank endpoint to boost the search results further.With the Vectorstore set up, we can process the documents, which will involve chunking, embedding, and indexing.
We can test if the retrieval is working by entering a search query.
We can now run the chatbot. For this, we create a run_chatbot function that accepts the user message and the history of the conversation, if available.
Here is a sample conversation consisting of a few turns.
There are a few observations worth pointing out:
In this tutorial, we learned about:
In the next tutorial, we’ll explore how to leverage Cohere’s tool use features to build agentic applications.