Retrieval augmented generation (RAG) - Cohere on Azure AI Foundry

Open in GitHub

Large Language Models (LLMs) excel at generating text and maintaining conversational context in chat applications. However, LLMs can sometimes hallucinate - producing responses that are factually incorrect. This is particularly important to mitigate in enterprise environments where organizations work with proprietary information that wasn’t part of the model’s training data.

Retrieval-augmented generation (RAG) addresses this limitation by enabling LLMs to incorporate external knowledge sources into their response generation process. By grounding responses in retrieved facts, RAG significantly reduces hallucinations and improves the accuracy and reliability of the model’s outputs.

In this tutorial, we’ll cover:

Setting up the Cohere client
Building a RAG application by combining retrieval and chat capabilities
Managing chat history and maintaining conversational context
Handling direct responses vs responses requiring retrieval
Generating citations for retrieved information

In the next tutorial, we’ll explore how to leverage Cohere’s tool use features to build agentic applications.

We’ll use Cohere’s Command, Embed, and Rerank models deployed on Azure.

Setup

First, you will need to deploy the Command, Embed, and Rerank models on Azure via Azure AI Foundry. The deployment will create a serverless API with pay-as-you-go token based billing. You can find more information on how to deploy models in the Azure documentation.

Once the model is deployed, you can access it via Cohere’s Python SDK. Let’s now install the Cohere SDK and set up our client.

To create a client, you need to provide the API key and the model’s base URL for the Azure endpoint. You can get these information from the Azure AI Foundry platform where you deployed the model.

PYTHON

1 # %pip install cohere hnswlib unstructured
2 
3 import cohere
4 
5 co_chat = cohere.ClientV2(
6     api_key="AZURE_API_KEY_CHAT",
7     base_url="AZURE_ENDPOINT_CHAT",  # example: "https://cohere-command-r-plus-08-2024-xyz.eastus.models.ai.azure.com/"
8 )
9 
10 co_embed = cohere.ClientV2(
11     api_key="AZURE_API_KEY_EMBED",
12     base_url="AZURE_ENDPOINT_EMBED",  # example: "https://embed-v-4-0-xyz.eastus.models.ai.azure.com/"
13 )
14 
15 co_rerank = cohere.ClientV2(
16     api_key="AZURE_API_KEY_RERANK",
17     base_url="AZURE_ENDPOINT_RERANK",  # example: "https://cohere-rerank-v3-multilingual-xyz.eastus.models.ai.azure.com/"
18 )

A quick example

Let’s begin with a simple example to explore how RAG works.

The foundation of RAG is having a set of documents for the LLM to reference. Below, we’ll work with a small collection of basic documents. While RAG systems usually involve retrieving relevant documents based on the user’s query (which we’ll explore later), for now we’ll keep it simple and use this entire small set of documents as context for the LLM.

We have seen how to use the Chat endpoint in the text generation chapter. To use the RAG feature, we simply need to add one additional parameter, documents, to the endpoint call. These are the documents we want to provide as the context for the model to use in its response.

PYTHON

1 documents = [
2     {
3         "title": "Tall penguins",
4         "text": "Emperor penguins are the tallest.",
5     },
6     {
7         "title": "Penguin habitats",
8         "text": "Emperor penguins only live in Antarctica.",
9     },
10     {
11         "title": "What are animals?",
12         "text": "Animals are different from plants.",
13     },
14 ]

Let’s see how the model responds to the question “What are the tallest living penguins?”

The model leverages the provided documents as context for its response. Specifically, when mentioning that Emperor penguins are the tallest species, it references doc_0 - the document which states that “Emperor penguins are the tallest.”

PYTHON

1 message = "What are the tallest living penguins?"
2 
3 response = co_chat.chat(
4     model="model",  # Pass a dummy string
5     messages=[{"role": "user", "content": message}],
6     documents=[{"data": doc} for doc in documents],
7 )
8 
9 print("\nRESPONSE:\n")
10 print(response.message.content[0].text)
11 
12 if response.message.citations:
13     print("\nCITATIONS:\n")
14     for citation in response.message.citations:
15         print(citation)

1 RESPONSE:
2 
3 The tallest living penguins are the Emperor penguins. They only live in Antarctica.
4 
5 CITATIONS:
6 
7 start=36 end=53 text='Emperor penguins.' sources=[DocumentSource(type='document', id='doc:0', document={'id': 'doc:0', 'text': 'Emperor penguins are the tallest.', 'title': 'Tall penguins'})] type=None
8 start=59 end=83 text='only live in Antarctica.' sources=[DocumentSource(type='document', id='doc:1', document={'id': 'doc:1', 'text': 'Emperor penguins only live in Antarctica.', 'title': 'Penguin habitats'})] type=None

A more comprehensive example

Now that we’ve covered a basic RAG implementation, let’s look at a more comprehensive example of RAG that includes:

Creating a retrieval system that converts documents into text embeddings and stores them in an index
Building a query generation system that transforms user messages into optimized search queries
Implementing a chat interface to handle LLM interactions with users
Designing a response generation system capable of handling various query types

First, let’s import the necessary libraries for this project. This includes hnswlib for the vector library and unstructured for chunking the documents (more details on these later).

PYTHON

1 import uuid
2 import yaml
3 import hnswlib
4 from typing import List, Dict
5 from unstructured.partition.html import partition_html
6 from unstructured.chunking.title import chunk_by_title

Define documents

Next, we’ll define the documents we’ll use for RAG. We’ll use a few pages from the Cohere documentation that discuss prompt engineering. Each entry is identified by its title and URL.

PYTHON

1 raw_documents = [
2     {
3         "title": "Crafting Effective Prompts",
4         "url": "https://docs.cohere.com/docs/crafting-effective-prompts",
5     },
6     {
7         "title": "Advanced Prompt Engineering Techniques",
8         "url": "https://docs.cohere.com/docs/advanced-prompt-engineering-techniques",
9     },
10     {
11         "title": "Prompt Truncation",
12         "url": "https://docs.cohere.com/docs/prompt-truncation",
13     },
14     {
15         "title": "Preambles",
16         "url": "https://docs.cohere.com/docs/preambles",
17     },
18 ]

Create vectorstore

The Vectorstore class handles the ingestion of documents into embeddings (or vectors) and the retrieval of relevant documents given a query.

It includes a few methods:

load_and_chunk: Loads the raw documents from the URL and breaks them into smaller chunks
embed: Generates embeddings of the chunked documents
index: Indexes the document chunk embeddings to ensure efficient similarity search during retrieval
retrieve: Uses semantic search to retrieve relevant document chunks from the index, given a query. It involves two steps: first, dense retrieval from the index via the Embed endpoint, and second, a reranking via the Rerank endpoint to boost the search results further.

PYTHON

1 class Vectorstore:
2 
3     def __init__(self, raw_documents: List[Dict[str, str]]):
4         self.raw_documents = raw_documents
5         self.docs = []
6         self.docs_embs = []
7         self.retrieve_top_k = 10
8         self.rerank_top_k = 3
9         self.load_and_chunk()
10         self.embed()
11         self.index()
12 
13     def load_and_chunk(self) -> None:
14         """
15         Loads the text from the sources and chunks the HTML content.
16         """
17         print("Loading documents...")
18 
19         for raw_document in self.raw_documents:
20             elements = partition_html(url=raw_document["url"])
21             chunks = chunk_by_title(elements)
22             for chunk in chunks:
23                 self.docs.append(
24                     {
25                         "data": {
26                             "title": raw_document["title"],
27                             "text": str(chunk),
28                             "url": raw_document["url"],
29                         }
30                     }
31                 )
32 
33     def embed(self) -> None:
34         """
35         Embeds the document chunks using the Cohere API.
36         """
37         print("Embedding document chunks...")
38 
39         batch_size = 90
40         self.docs_len = len(self.docs)
41         for i in range(0, self.docs_len, batch_size):
42             batch = self.docs[i : min(i + batch_size, self.docs_len)]
43             texts = [item["data"]["text"] for item in batch]
44             docs_embs_batch = co_embed.embed(
45                 texts=texts,
46                 model="embed-v4.0",
47                 input_type="search_document",
48                 embedding_types=["float"],
49             ).embeddings.float
50             self.docs_embs.extend(docs_embs_batch)
51 
52     def index(self) -> None:
53         """
54         Indexes the document chunks for efficient retrieval.
55         """
56         print("Indexing document chunks...")
57 
58         self.idx = hnswlib.Index(space="ip", dim=1024)
59         self.idx.init_index(
60             max_elements=self.docs_len, ef_construction=512, M=64
61         )
62         self.idx.add_items(
63             self.docs_embs, list(range(len(self.docs_embs)))
64         )
65 
66         print(
67             f"Indexing complete with {self.idx.get_current_count()} document chunks."
68         )
69 
70     def retrieve(self, query: str) -> List[Dict[str, str]]:
71         """
72         Retrieves document chunks based on the given query.
73 
74         Parameters:
75         query (str): The query to retrieve document chunks for.
76 
77         Returns:
78         List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
79         """
80 
81         # Dense retrieval
82         query_emb = co_embed.embed(
83             texts=[query],
84             model="embed-v4.0",
85             input_type="search_query",
86             embedding_types=["float"],
87         ).embeddings.float
88 
89         doc_ids = self.idx.knn_query(
90             query_emb, k=self.retrieve_top_k
91         )[0][0]
92 
93         # Reranking
94         docs_to_rerank = [
95             self.docs[doc_id]["data"] for doc_id in doc_ids
96         ]
97         yaml_docs = [
98             yaml.dump(doc, sort_keys=False) for doc in docs_to_rerank
99         ]
100         rerank_results = co_rerank.rerank(
101             query=query,
102             documents=yaml_docs,
103             model="model",  # Pass a dummy string
104             top_n=self.rerank_top_k,
105         )
106 
107         doc_ids_reranked = [
108             doc_ids[result.index] for result in rerank_results.results
109         ]
110 
111         docs_retrieved = []
112         for doc_id in doc_ids_reranked:
113             docs_retrieved.append(self.docs[doc_id]["data"])
114 
115         return docs_retrieved

Process documents

With the Vectorstore set up, we can process the documents, which will involve chunking, embedding, and indexing.

PYTHON

1 # Create an instance of the Vectorstore class with the given sources
2 vectorstore = Vectorstore(raw_documents)

1 Loading documents...
2 Embedding document chunks...
3 Indexing document chunks...
4 Indexing complete with 137 document chunks.

We can test if the retrieval is working by entering a search query.

PYTHON

1 vectorstore.retrieve("Prompting by giving examples")

1 [{'title': 'Advanced Prompt Engineering Techniques',
2   'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.',
3   'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'},
4  {'title': 'Crafting Effective Prompts',
5   'text': 'Incorporating Example Outputs\n\nLLMs respond well when they have specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.',
6   'url': 'https://docs.cohere.com/docs/crafting-effective-prompts'},
7  {'title': 'Advanced Prompt Engineering Techniques',
8   'text': 'In addition to giving correct examples, including negative examples with a clear indication of why they are wrong can help the LLM learn to distinguish between correct and incorrect responses. Ordering the examples can also be important; if there are patterns that could be picked up on that are not relevant to the correctness of the question, the model may incorrectly pick up on those instead of the semantics of the question itself.',
9   'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'}]

Run chatbot

We can now run the chatbot. For this, we create a run_chatbot function that accepts the user message and the history of the conversation, if available.

PYTHON

1 def run_chatbot(query, messages=None):
2     if messages is None:
3         messages = []
4 
5     messages.append({"role": "user", "content": query})
6 
7     # Retrieve document chunks and format
8     documents = vectorstore.retrieve(query)
9     documents_formatted = []
10     for doc in documents:
11         documents_formatted.append({"data": doc})
12 
13     # Use document chunks to respond
14     response = co_chat.chat(
15         model="model",  # Pass a dummy string
16         messages=messages,
17         documents=documents_formatted,
18     )
19 
20     # Print the chatbot response, citations, and documents
21     print("\nRESPONSE:\n")
22     print(response.message.content[0].text)
23 
24     if response.message.citations:
25         print("\nCITATIONS:\n")
26         for citation in response.message.citations:
27             print("-" * 20)
28             print(
29                 "start:",
30                 citation.start,
31                 "end:",
32                 citation.end,
33                 "text:",
34                 citation.text,
35             )
36             print("SOURCES:")
37             print(citation.sources)
38 
39     # Add assistant response to messages
40     messages.append(
41         {
42             "role": "assistant",
43             "content": response.message.content[0].text,
44         }
45     )
46 
47     return messages

Here is a sample conversation consisting of a few turns.

PYTHON

1 messages = run_chatbot("Hello, I have a question")

1 RESPONSE:
2 
3 Hello there! How can I help you today?

PYTHON

1 messages = run_chatbot("How to provide examples in prompts", messages)

RESPONSE:
There are a few ways to provide examples in prompts.
One way is to provide a few relevant and diverse examples in the prompt. This can help steer the LLM towards a high-quality solution. Good examples condition the model to the expected response type and style.
Another way is to provide specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.
In addition to giving correct examples, including negative examples with a clear indication of why they are wrong can help the LLM learn to distinguish between correct and incorrect responses.
CITATIONS:
--------------------
start: 68 end: 126 text: provide a few relevant and diverse examples in the prompt.
SOURCES:
[DocumentSource(type='document', id='doc:0', document={'id': 'doc:0', 'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.', 'title': 'Advanced Prompt Engineering Techniques', 'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'})]
--------------------
start: 136 end: 187 text: help steer the LLM towards a high-quality solution.
SOURCES:
[DocumentSource(type='document', id='doc:0', document={'id': 'doc:0', 'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.', 'title': 'Advanced Prompt Engineering Techniques', 'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'})]
--------------------
start: 188 end: 262 text: Good examples condition the model to the expected response type and style.
SOURCES:
[DocumentSource(type='document', id='doc:0', document={'id': 'doc:0', 'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.', 'title': 'Advanced Prompt Engineering Techniques', 'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'})]
--------------------
start: 282 end: 321 text: provide specific examples to work from.
SOURCES:
[DocumentSource(type='document', id='doc:1', document={'id': 'doc:1', 'text': 'Incorporating Example Outputs\n\nLLMs respond well when they have specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.', 'title': 'Crafting Effective Prompts', 'url': 'https://docs.cohere.com/docs/crafting-effective-prompts'})]
--------------------
start: 335 end: 485 text: instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.
SOURCES:
[DocumentSource(type='document', id='doc:1', document={'id': 'doc:1', 'text': 'Incorporating Example Outputs\n\nLLMs respond well when they have specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.', 'title': 'Crafting Effective Prompts', 'url': 'https://docs.cohere.com/docs/crafting-effective-prompts'})]
--------------------
start: 527 end: 679 text: including negative examples with a clear indication of why they are wrong can help the LLM learn to distinguish between correct and incorrect responses.
SOURCES:
[DocumentSource(type='document', id='doc:2', document={'id': 'doc:2', 'text': 'In addition to giving correct examples, including negative examples with a clear indication of why they are wrong can help the LLM learn to distinguish between correct and incorrect responses. Ordering the examples can also be important; if there are patterns that could be picked up on that are not relevant to the correctness of the question, the model may incorrectly pick up on those instead of the semantics of the question itself.', 'title': 'Advanced Prompt Engineering Techniques', 'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'})]

PYTHON

1 messages = run_chatbot(
2     "What do you know about 5G networks?", messages
3 )

1 RESPONSE:
2 
3 I'm sorry, I could not find any information about 5G networks.

PYTHON

1 for message in messages:
2     print(message, "\n")

1 {'role': 'user', 'content': 'Hello, I have a question'} 
2 
3 {'role': 'assistant', 'content': 'Hello! How can I help you today?'} 
4 
5 {'role': 'user', 'content': 'How to provide examples in prompts'} 
6 
7 {'role': 'assistant', 'content': 'There are a few ways to provide examples in prompts.\n\nOne way is to provide a few relevant and diverse examples in the prompt. This can help steer the LLM towards a high-quality solution. Good examples condition the model to the expected response type and style.\n\nAnother way is to provide specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.\n\nIn addition to giving correct examples, including negative examples with a clear indication of why they are wrong can help the LLM learn to distinguish between correct and incorrect responses.'} 
8 
9 {'role': 'user', 'content': 'What do you know about 5G networks?'} 
10 
11 {'role': 'assistant', 'content': "I'm sorry, I could not find any information about 5G networks."}

There are a few observations worth pointing out:

Direct response: For user messages that don’t require retrieval (“Hello, I have a question”), the chatbot responds directly without requiring retrieval.
Citation generation: For responses that do require retrieval (“What’s the difference between zero-shot and few-shot prompting”), the endpoint returns the response together with the citations. These are fine-grained citations, which means they refer to specific spans of the generated text.
Response synthesis: The model can decide if none of the retrieved documents provide the necessary information to answer a user message. For example, when asked the question, “What do you know about 5G networks”, the chatbot retrieves external information from the index. However, it doesn’t use any of the information in its response as none of it is relevant to the question.

Conclusion

In this tutorial, we learned about:

How to set up the Cohere client to use the Command model deployed on Azure AI Foundry for chat
How to build a RAG application by combining retrieval and chat capabilities
How to manage chat history and maintain conversational context
How to handle direct responses vs responses requiring retrieval
How citations are automatically generated for retrieved information

In the next tutorial, we’ll explore how to leverage Cohere’s tool use features to build agentic applications.