Retrieval Augmented Generation (RAG)

Open in GitHub

Large Language Models (LLMs) excel at generating text and maintaining conversational context in chat applications. However, LLMs can sometimes hallucinate - producing responses that are factually incorrect. This is particularly important to mitigate in enterprise environments where organizations work with proprietary information that wasn’t part of the model’s training data.

Retrieval-augmented generation (RAG) addresses this limitation by enabling LLMs to incorporate external knowledge sources into their response generation process. By grounding responses in retrieved facts, RAG significantly reduces hallucinations and improves the accuracy and reliability of the model’s outputs.

In this tutorial, we’ll cover:

  • Setting up the Cohere client
  • Building a RAG application by combining retrieval and chat capabilities
  • Managing chat history and maintaining conversational context
  • Handling direct responses vs responses requiring retrieval
  • Generating citations for retrieved information

In the next tutorial, we’ll explore how to leverage Cohere’s tool use features to build agentic applications.

We’ll use Cohere’s Command, Embed, and Rerank models deployed on Azure.

Setup

First, you will need to deploy the Command, Embed, and Rerank models on Azure via Azure AI Foundry. The deployment will create a serverless API with pay-as-you-go token based billing. You can find more information on how to deploy models in the Azure documentation.

Once the model is deployed, you can access it via Cohere’s Python SDK. Let’s now install the Cohere SDK and set up our client.

To create a client, you need to provide the API key and the model’s base URL for the Azure endpoint. You can get these information from the Azure AI Foundry platform where you deployed the model.

PYTHON
1# ! pip install cohere hnswlib unstructured
2
3import cohere
4
5co_chat = cohere.Client(
6 api_key="AZURE_API_KEY_CHAT",
7 base_url="AZURE_ENDPOINT_CHAT" # example: "https://cohere-command-r-plus-08-2024-xyz.eastus.models.ai.azure.com/"
8)
9
10co_embed = cohere.Client(
11 api_key="AZURE_API_KEY_EMBED",
12 base_url="AZURE_ENDPOINT_EMBED" # example: "https://cohere-embed-v3-multilingual-xyz.eastus.models.ai.azure.com/"
13)
14
15co_rerank = cohere.Client(
16 api_key="AZURE_API_KEY_RERANK",
17 base_url="AZURE_ENDPOINT_RERANK" # example: "https://cohere-rerank-v3-multilingual-xyz.eastus.models.ai.azure.com/"
18)

A quick example

Let’s begin with a simple example to explore how RAG works.

The foundation of RAG is having a set of documents for the LLM to reference. Below, we’ll work with a small collection of basic documents. While RAG systems usually involve retrieving relevant documents based on the user’s query (which we’ll explore later), for now we’ll keep it simple and use this entire small set of documents as context for the LLM.

We have seen how to use the Chat endpoint in the text generation chapter. To use the RAG feature, we simply need to add one additional parameter, documents, to the endpoint call. These are the documents we want to provide as the context for the model to use in its response.

PYTHON
1documents = [
2 {
3 "title": "Tall penguins",
4 "text": "Emperor penguins are the tallest."},
5 {
6 "title": "Penguin habitats",
7 "text": "Emperor penguins only live in Antarctica."},
8 {
9 "title": "What are animals?",
10 "text": "Animals are different from plants."}
11]

Let’s see how the model responds to the question “What are the tallest living penguins?”

The model leverages the provided documents as context for its response. Specifically, when mentioning that Emperor penguins are the tallest species, it references doc_0 - the document which states that “Emperor penguins are the tallest.”

PYTHON
1message = "What are the tallest living penguins?"
2
3response = co_chat.chat(
4 message=message,
5 documents=documents
6)
7
8print("\nRESPONSE:\n")
9print(response.text)
10
11if response.citations:
12 print("\nCITATIONS:\n")
13 for citation in response.citations:
14 print(citation)
1RESPONSE:
2
3The tallest living penguins are the Emperor penguins. They only live in Antarctica.
4
5CITATIONS:
6
7start=36 end=53 text='Emperor penguins.' document_ids=['doc_0']
8start=72 end=83 text='Antarctica.' document_ids=['doc_1']

A more comprehensive example

Now that we’ve covered a basic RAG implementation, let’s look at a more comprehensive example of RAG that includes:

  • Creating a retrieval system that converts documents into text embeddings and stores them in an index
  • Building a query generation system that transforms user messages into optimized search queries
  • Implementing a chat interface to handle LLM interactions with users
  • Designing a response generation system capable of handling various query types

First, let’s import the necessary libraries for this project. This includes hnswlib for the vector library and unstructured for chunking the documents (more details on these later).

PYTHON
1import uuid
2import yaml
3import hnswlib
4from typing import List, Dict
5from unstructured.partition.html import partition_html
6from unstructured.chunking.title import chunk_by_title

Define documents

Next, we’ll define the documents we’ll use for RAG. We’ll use a few pages from the Cohere documentation that discuss prompt engineering. Each entry is identified by its title and URL.

PYTHON
1raw_documents = [
2 {
3 "title": "Crafting Effective Prompts",
4 "url": "https://docs.cohere.com/docs/crafting-effective-prompts"},
5 {
6 "title": "Advanced Prompt Engineering Techniques",
7 "url": "https://docs.cohere.com/docs/advanced-prompt-engineering-techniques"},
8 {
9 "title": "Prompt Truncation",
10 "url": "https://docs.cohere.com/docs/prompt-truncation"},
11 {
12 "title": "Preambles",
13 "url": "https://docs.cohere.com/docs/preambles"}
14]

Create vectorstore

The Vectorstore class handles the ingestion of documents into embeddings (or vectors) and the retrieval of relevant documents given a query.

It includes a few methods:

  • load_and_chunk: Loads the raw documents from the URL and breaks them into smaller chunks
  • embed: Generates embeddings of the chunked documents
  • index: Indexes the document chunk embeddings to ensure efficient similarity search during retrieval
  • retrieve: Uses semantic search to retrieve relevant document chunks from the index, given a query. It involves two steps: first, dense retrieval from the index via the Embed endpoint, and second, a reranking via the Rerank endpoint to boost the search results further.
PYTHON
1class Vectorstore:
2
3 def __init__(self, raw_documents: List[Dict[str, str]]):
4 self.raw_documents = raw_documents
5 self.docs = []
6 self.docs_embs = []
7 self.retrieve_top_k = 10
8 self.rerank_top_k = 3
9 self.load_and_chunk()
10 self.embed()
11 self.index()
12
13
14 def load_and_chunk(self) -> None:
15 """
16 Loads the text from the sources and chunks the HTML content.
17 """
18 print("Loading documents...")
19
20 for raw_document in self.raw_documents:
21 elements = partition_html(url=raw_document["url"])
22 chunks = chunk_by_title(elements)
23 for chunk in chunks:
24 self.docs.append(
25 {
26 "title": raw_document["title"],
27 "text": str(chunk),
28 "url": raw_document["url"],
29 }
30 )
31
32 def embed(self) -> None:
33 """
34 Embeds the document chunks using the Cohere API.
35 """
36 print("Embedding document chunks...")
37
38 batch_size = 90
39 self.docs_len = len(self.docs)
40 for i in range(0, self.docs_len, batch_size):
41 batch = self.docs[i : min(i + batch_size, self.docs_len)]
42 texts = [item["text"] for item in batch]
43 docs_embs_batch = co_embed.embed(
44 texts=texts,
45 input_type="search_document"
46 ).embeddings
47 self.docs_embs.extend(docs_embs_batch)
48
49 def index(self) -> None:
50 """
51 Indexes the document chunks for efficient retrieval.
52 """
53 print("Indexing document chunks...")
54
55 self.idx = hnswlib.Index(space="ip", dim=1024)
56 self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
57 self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))
58
59 print(f"Indexing complete with {self.idx.get_current_count()} document chunks.")
60
61 def retrieve(self, query: str) -> List[Dict[str, str]]:
62 """
63 Retrieves document chunks based on the given query.
64
65 Parameters:
66 query (str): The query to retrieve document chunks for.
67
68 Returns:
69 List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
70 """
71
72 # Dense retrieval
73 query_emb = co_embed.embed(
74 texts=[query],
75 input_type="search_query"
76 ).embeddings
77
78 doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]
79
80 # Reranking
81 docs_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]
82 yaml_docs = [yaml.dump(doc, sort_keys=False) for doc in docs_to_rerank]
83 rerank_results = co_rerank.rerank(
84 query=query,
85 documents=yaml_docs,
86 top_n=self.rerank_top_k
87 )
88 doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]
89
90 docs_retrieved = []
91 for doc_id in doc_ids_reranked:
92 docs_retrieved.append(
93 {
94 "title": self.docs[doc_id]["title"],
95 "text": self.docs[doc_id]["text"],
96 "url": self.docs[doc_id]["url"],
97 }
98 )
99
100 return docs_retrieved

Process documents

With the Vectorstore set up, we can process the documents, which will involve chunking, embedding, and indexing.

PYTHON
1# Create an instance of the Vectorstore class with the given sources
2vectorstore = Vectorstore(raw_documents)
1Loading documents...
2Embedding document chunks...
3Indexing document chunks...
4Indexing complete with 137 document chunks.

We can test if the retrieval is working by entering a search query.

PYTHON
1vectorstore.retrieve("Prompting by giving examples")
1[{'title': 'Advanced Prompt Engineering Techniques',
2 'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.',
3 'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'},
4 {'title': 'Crafting Effective Prompts',
5 'text': 'Incorporating Example Outputs\n\nLLMs respond well when they have specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.',
6 'url': 'https://docs.cohere.com/docs/crafting-effective-prompts'},
7 {'title': 'Advanced Prompt Engineering Techniques',
8 'text': 'In addition to giving correct examples, including negative examples with a clear indication of why they are wrong can help the LLM learn to distinguish between correct and incorrect responses. Ordering the examples can also be important; if there are patterns that could be picked up on that are not relevant to the correctness of the question, the model may incorrectly pick up on those instead of the semantics of the question itself.',
9 'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'}]

Run chatbot

We can now run the chatbot. For this, we create a run_chatbot function that accepts the user message and the history of the conversation, if any.

Here’s what happens inside the function:

  • For each user message, we use the Chat endpoint’s search query generation feature to turn the user message into one or more queries that are optimized for retrieval. The endpoint can even return no query, meaning a user message can be responded to directly without retrieval. This is done by calling the Chat endpoint with the search_queries_only parameter and setting it as True.
  • If no search query is generated, we call the Chat endpoint to generate a response directly. If there is at least one, we call the retrieve method from the Vectorstore instance to retrieve the most relevant documents to each query.
  • Finally, all the results from all queries are appended to a list and passed to the Chat endpoint for response generation.
  • We print the response, together with the citations and the list of document chunks cited, for easy reference.
PYTHON
1def run_chatbot(message, chat_history=None):
2
3 if chat_history is None:
4 chat_history = []
5
6 # Generate search queries, if any
7 response = co_chat.chat(
8 message=message,
9 search_queries_only=True,
10 chat_history=chat_history,
11 )
12
13 search_queries = []
14 for query in response.search_queries:
15 search_queries.append(query.text)
16
17 # If there are search queries, retrieve the documents
18 if search_queries:
19 print("Retrieving information...", end="")
20
21 # Retrieve document chunks for each query
22 documents = []
23 for query in search_queries:
24 documents.extend(vectorstore.retrieve(query))
25
26 # Use document chunks to respond
27 response = co_chat.chat(
28 message=message,
29 documents=documents,
30 chat_history=chat_history
31 )
32
33 else:
34 response = co_chat.chat(
35 message=message,
36 chat_history=chat_history
37 )
38
39 # Print the chatbot response, citations, and documents
40 print("\nRESPONSE:\n")
41 print(response.text)
42
43 if response.citations:
44 print("\nCITATIONS:\n")
45 for citation in response.citations:
46 print(citation)
47 print("\nDOCUMENTS:\n")
48 for document in response.documents:
49 print(document)
50
51 chat_history = response.chat_history
52
53 return chat_history

Here is a sample conversation consisting of a few turns.

PYTHON
1chat_history = run_chatbot("Hello, I have a question")
1RESPONSE:
2
3Hello there! How can I help you today?
PYTHON
1chat_history = run_chatbot("What's the difference between zero-shot and few-shot prompting", chat_history)
Retrieving information...
RESPONSE:
Zero-shot prompting is a technique where the model is asked to perform a task without any examples of the task being performed. Few-shot prompting, on the other hand, provides the model with a few relevant and diverse examples of the task being performed before asking the specific question to be answered. This can help steer the LLM toward a high-quality solution.
CITATIONS:
start=78 end=127 text='without any examples of the task being performed.' document_ids=['doc_0', 'doc_3']
start=193 end=254 text='few relevant and diverse examples of the task being performed' document_ids=['doc_0', 'doc_3']
start=321 end=366 text='steer the LLM toward a high-quality solution.' document_ids=['doc_0', 'doc_3']
DOCUMENTS:
{'id': 'doc_0', 'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.', 'title': 'Advanced Prompt Engineering Techniques', 'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'}
{'id': 'doc_3', 'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.', 'title': 'Advanced Prompt Engineering Techniques', 'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'}
PYTHON
1chat_history = run_chatbot("What do you know about 5G networks?", chat_history)
1Retrieving information...
2RESPONSE:
3
4I'm sorry, I could not find any information about 5G networks.
PYTHON
1print("Chat history:")
2for c in chat_history:
3 print(c, "\n")
1Chat history:
2role='USER' message='Hello, I have a question' tool_calls=None
3
4role='CHATBOT' message='Hello there! How can I help you today?' tool_calls=None
5
6role='USER' message="What's the difference between zero-shot and few-shot prompting" tool_calls=None
7
8role='CHATBOT' message='Zero-shot prompting is a technique where the model is asked to perform a task without any examples of the task being performed. Few-shot prompting, on the other hand, provides the model with a few relevant and diverse examples of the task being performed before asking the specific question to be answered. This can help steer the LLM toward a high-quality solution.' tool_calls=None
9
10role='USER' message='What do you know about 5G networks?' tool_calls=None
11
12role='CHATBOT' message="I'm sorry, I could not find any information about 5G networks." tool_calls=None

There are a few observations worth pointing out:

  • Direct response: For user messages that don’t require retrieval (“Hello, I have a question”), the chatbot responds directly without requiring retrieval.
  • Citation generation: For responses that do require retrieval (“What’s the difference between zero-shot and few-shot prompting”), the endpoint returns the response together with the citations. These are fine-grained citations, which means they refer to specific spans of the generated text.
  • State management: The endpoint maintains the state of the conversation via the chat_history parameter, for example, by correctly responding to a vague user message, such as “How would the latter help?”
  • Response synthesis: The model can decide if none of the retrieved documents provide the necessary information to answer a user message. For example, when asked the question, “What do you know about 5G networks”, the chatbot retrieves external information from the index. However, it doesn’t use any of the information in its response as none of it is relevant to the question.

Conclusion

In this tutorial, we learned about:

  • How to set up the Cohere client to use the Command model deployed on Azure AI Foundry for chat
  • How to build a RAG application by combining retrieval and chat capabilities
  • How to manage chat history and maintain conversational context
  • How to handle direct responses vs responses requiring retrieval
  • How citations are automatically generated for retrieved information

In the next tutorial, we’ll explore how to leverage Cohere’s tool use features to build agentic applications.

Built with