Building RAG models with Cohere

Open in Colab

The Chat endpoint provides comprehensive support for various text generation use cases, including retrieval-augmented generation (RAG).

While LLMs are good at maintaining the context of the conversation and generating responses, they can be prone to hallucinate and include factually incorrect or incomplete information in their responses.

RAG enables a model to access and utilize supplementary information from external documents, thereby improving the accuracy of its responses.

When using RAG with the Chat endpoint, these responses are backed by fine-grained citations linking to the source documents. This makes the responses easily verifiable.

In this tutorial, you’ll learn about:

Basic RAG
Search query generation
Retrieval with Embed
Reranking with Rerank
Response and citation generation

You’ll learn these by building an onboarding assistant for new hires.

Setup

To get started, first we need to install the cohere library and create a Cohere client.

PYTHON

1 # pip install cohere
2 
3 import cohere
4 import numpy as np
5 import json
6 from typing import List
7 
8 # Get your free API key: https://dashboard.cohere.com/api-keys
9 co = cohere.ClientV2(api_key="COHERE_API_KEY")

Basic RAG

To see how RAG works, let’s define the documents that the application has access to. We’ll use a short list of documents consisting of internal FAQs about the fictitious company Co1t (in production, these documents are massive).

In this example, each document is a data object with one field, text. But we can define any number of fields we want, depending on the nature of the documents. For example, emails could contain title and text fields.

PYTHON

1 documents = [
2     {
3         "data": {
4             "text": "Reimbursing Travel Expenses: Easily manage your travel expenses by submitting them through our finance tool. Approvals are prompt and straightforward."
5         }
6     },
7     {
8         "data": {
9             "text": "Working from Abroad: Working remotely from another country is possible. Simply coordinate with your manager and ensure your availability during core hours."
10         }
11     },
12     {
13         "data": {
14             "text": "Health and Wellness Benefits: We care about your well-being and offer gym memberships, on-site yoga classes, and comprehensive health insurance."
15         }
16     },
17 ]

To call the Chat API with RAG, pass the following parameters at a minimum. This tells the model to run in RAG-mode and use these documents in its response.

model for the model ID
messages for the user’s query.
documents for defining the documents.

Let’s create a query asking about the company’s support for personal well-being, which is not going to be available to the model based on the data its trained on. It will need to use external documents.

RAG introduces additional objects in the Chat response. One of them is citations, which contains details about:

specific text spans from the retrieved documents on which the response is grounded.
the documents referenced in the citations.

PYTHON

1 # Add the user query
2 query = "Are there health benefits?"
3 
4 # Generate the response
5 response = co.chat(
6     model="command-a-03-2025",
7     messages=[{"role": "user", "content": query}],
8     documents=documents,
9 )
10 
11 # Display the response
12 print(response.message.content[0].text)
13 
14 # Display the citations and source documents
15 if response.message.citations:
16     print("\nCITATIONS:")
17     for citation in response.message.citations:
18         print(citation, "\n")

Yes, we offer gym memberships, on-site yoga classes, and comprehensive health insurance.
CITATIONS:
start=14 end=88 text='gym memberships, on-site yoga classes, and comprehensive health insurance.' sources=[DocumentSource(type='document', id='doc:2', document={'id': 'doc:2', 'text': 'Health and Wellness Benefits: We care about your well-being and offer gym memberships, on-site yoga classes, and comprehensive health insurance.'})]

Search query generation

The previous example showed how to get started with RAG, and in particular, the augmented generation portion of RAG. But as its name implies, RAG consists of other steps, such as retrieval.

In a basic RAG application, the steps involved are:

Transforming the user message into search queries
Retrieving relevant documents for a given search query
Generating the response and citations

Let’s now look at the first step—search query generation. The chatbot needs to generate an optimal set of search queries to use for retrieval.

There are different possible approaches to this. In this example, we’ll take a tool use approach.

Here, we build a tool that takes a user query and returns a list of relevant document snippets for that query. The tool can generate zero, one or multiple search queries depending on the user query.

PYTHON

1 def generate_search_queries(message: str) -> List[str]:
2 
3     # Define the query generation tool
4     query_gen_tool = [
5         {
6             "type": "function",
7             "function": {
8                 "name": "internet_search",
9                 "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
10                 "parameters": {
11                     "type": "object",
12                     "properties": {
13                         "queries": {
14                             "type": "array",
15                             "items": {"type": "string"},
16                             "description": "a list of queries to search the internet with.",
17                         }
18                     },
19                     "required": ["queries"],
20                 },
21             },
22         }
23     ]
24 
25     # Define a system instruction to optimize search query generation
26     instructions = "Write a search query that will find helpful information for answering the user's question accurately. If you need more than one search query, write a list of search queries. If you decide that a search is very unlikely to find information that would be useful in constructing a response to the user, you should instead directly answer."
27 
28     # Generate search queries (if any)
29     search_queries = []
30 
31     res = co.chat(
32         model="command-a-03-2025",
33         messages=[
34             {"role": "system", "content": instructions},
35             {"role": "user", "content": message},
36         ],
37         tools=query_gen_tool,
38     )
39 
40     if res.message.tool_calls:
41         for tc in res.message.tool_calls:
42             queries = json.loads(tc.function.arguments)["queries"]
43             search_queries.extend(queries)
44 
45     return search_queries

In the example above, the tool breaks down the user message into two separate queries.

PYTHON

1 query = "How to stay connected with the company, and do you organize team events?"
2 queries_for_search = generate_search_queries(query)
3 print(queries_for_search)

['how to stay connected with the company', 'does the company organize team events']

And in the example below, the tool decides that one query is sufficient.

PYTHON

1 query = "How flexible are the working hours"
2 queries_for_search = generate_search_queries(query)
3 print(queries_for_search)

['how flexible are the working hours at the company']

And in the example below, the tool decides that no retrieval is needed to answer the query.

PYTHON

1 query = "What is 2 + 2"
2 queries_for_search = generate_search_queries(query)
3 print(queries_for_search)

[]

Retrieval with Embed

Given the search query, we need a way to retrieve the most relevant documents from a large collection of documents.

This is where we can leverage text embeddings through the Embed endpoint. It enables semantic search, which lets us to compare the semantic meaning of the documents and the query. It solves the problem faced by the more traditional approach of lexical search, which is great at finding keyword matches, but struggles at capturing the context or meaning of a piece of text.

The Embed endpoint takes in texts as input and returns embeddings as output.

First, we need to embed the documents to search from. We call the Embed endpoint using co.embed() and pass the following arguments:

model: Here we choose embed-v4.0
input_type: We choose search_document to ensure the model treats these as the documents (instead of the query) for search
texts: The list of texts (the FAQs)

PYTHON

1 # Define the documents
2 faqs_long = [
3     {
4         "data": {
5             "text": "Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged."
6         }
7     },
8     {
9         "data": {
10             "text": "Finding Coffee Spots: For your caffeine fix, head to the break room's coffee machine or cross the street to the café for artisan coffee."
11         }
12     },
13     {
14         "data": {
15             "text": "Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!"
16         }
17     },
18     {
19         "data": {
20             "text": "Working Hours Flexibility: We prioritize work-life balance. While our core hours are 9 AM to 5 PM, we offer flexibility to adjust as needed."
21         }
22     },
23     {
24         "data": {
25             "text": "Side Projects Policy: We encourage you to pursue your passions. Just be mindful of any potential conflicts of interest with our business."
26         }
27     },
28     {
29         "data": {
30             "text": "Reimbursing Travel Expenses: Easily manage your travel expenses by submitting them through our finance tool. Approvals are prompt and straightforward."
31         }
32     },
33     {
34         "data": {
35             "text": "Working from Abroad: Working remotely from another country is possible. Simply coordinate with your manager and ensure your availability during core hours."
36         }
37     },
38     {
39         "data": {
40             "text": "Health and Wellness Benefits: We care about your well-being and offer gym memberships, on-site yoga classes, and comprehensive health insurance."
41         }
42     },
43     {
44         "data": {
45             "text": "Performance Reviews Frequency: We conduct informal check-ins every quarter and formal performance reviews twice a year."
46         }
47     },
48     {
49         "data": {
50             "text": "Proposing New Ideas: Innovation is welcomed! Share your brilliant ideas at our weekly team meetings or directly with your team lead."
51         }
52     },
53 ]
54 
55 # Embed the documents
56 doc_emb = co.embed(
57     model="embed-v4.0",
58     input_type="search_document",
59     texts=[doc["data"]["text"] for doc in faqs_long],
60     embedding_types=["float"],
61 ).embeddings.float

Next, we add a query, which asks about how to get to know the team.

We choose search_query as the input_type to ensure the model treats this as the query (instead of the documents) for search.

PYTHON

1 # Add the user query
2 query = "How to get to know my teammates"
3 
4 # Generate the search query
5 # Note: For simplicity, we are assuming only one query generated. For actual implementations, you will need to perform search for each query.
6 queries_for_search = generate_search_queries(query)[0]
7 print("Search query: ", queries_for_search)
8 
9 # Embed the search query
10 query_emb = co.embed(
11     model="embed-v4.0",
12     input_type="search_query",
13     texts=[queries_for_search],
14     embedding_types=["float"],
15 ).embeddings.float

Search query:  how to get to know teammates

Now, we want to search for the most relevant documents to the query. For this, we make use of the numpy library to compute the similarity between each query-document pair using the dot product approach.

Each query-document pair returns a score, which represents how similar the pair are. We then sort these scores in descending order and select the top most similar pairs, which we choose 5 (this is an arbitrary choice, you can choose any number).

Here, we show the most relevant documents with their similarity scores.

PYTHON

1 # Compute dot product similarity and display results
2 n = 5
3 scores = np.dot(query_emb, np.transpose(doc_emb))[0]
4 max_idx = np.argsort(-scores)[:n]
5 
6 retrieved_documents = [faqs_long[item] for item in max_idx]
7 
8 for rank, idx in enumerate(max_idx):
9     print(f"Rank: {rank+1}")
10     print(f"Score: {scores[idx]}")
11     print(f"Document: {retrieved_documents[rank]}\n")

Rank: 1
Score: 0.34212792245283796
Document: {'data': {'text': 'Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!'}}
Rank: 2
Score: 0.2883222063024371
Document: {'data': {'text': 'Proposing New Ideas: Innovation is welcomed! Share your brilliant ideas at our weekly team meetings or directly with your team lead.'}}
Rank: 3
Score: 0.278128283997032
Document: {'data': {'text': 'Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.'}}
Rank: 4
Score: 0.19474858706643985
Document: {'data': {'text': "Finding Coffee Spots: For your caffeine fix, head to the break room's coffee machine or cross the street to the café for artisan coffee."}}
Rank: 5
Score: 0.13713692506528824
Document: {'data': {'text': 'Side Projects Policy: We encourage you to pursue your passions. Just be mindful of any potential conflicts of interest with our business.'}}

Reranking can boost the results from semantic or lexical search further. The Rerank endpoint takes a list of search results and reranks them according to the most relevant documents to a query. This requires just a single line of code to implement.

We call the endpoint using co.rerank() and pass the following arguments:

query: The user query
documents: The list of documents we get from the semantic search results
top_n: The top reranked documents to select
model: We choose Rerank English 3

Looking at the results, we see that the given a query about getting to know the team, the document that talks about joining Slack channels is now ranked higher (1st) compared to earlier (3rd).

Here we select top_n to be 2, which will be the documents we will pass next for response generation.

PYTHON

1 # Rerank the documents
2 results = co.rerank(
3     query=queries_for_search,
4     documents=[doc["data"]["text"] for doc in retrieved_documents],
5     top_n=2,
6     model="rerank-english-v3.0",
7 )
8 
9 # Display the reranking results
10 for idx, result in enumerate(results.results):
11     print(f"Rank: {idx+1}")
12     print(f"Score: {result.relevance_score}")
13     print(f"Document: {retrieved_documents[result.index]}\n")
14 
15 reranked_documents = [
16     retrieved_documents[result.index] for result in results.results
17 ]

Rank: 1
Score: 0.0020507434
Document: {'data': {'text': 'Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.'}}
Rank: 2
Score: 0.0014158706
Document: {'data': {'text': 'Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!'}}

Finally we reach the step that we saw in the earlier “Basic RAG” section.

To call the Chat API with RAG, we pass the following parameters. This tells the model to run in RAG-mode and use these documents in its response.

model for the model ID
messages for the user’s query.
documents for defining the documents.

The response is then generated based on the the query and the documents retrieved.

RAG introduces additional objects in the Chat response. One of them is citations, which contains details about:

specific text spans from the retrieved documents on which the response is grounded.
the documents referenced in the citations.

PYTHON

1 # Generate the response
2 response = co.chat(
3     model="command-a-03-2025",
4     messages=[{"role": "user", "content": query}],
5     documents=reranked_documents,
6 )
7 
8 # Display the response
9 print(response.message.content[0].text)
10 
11 # Display the citations and source documents
12 if response.message.citations:
13     print("\nCITATIONS:")
14     for citation in response.message.citations:
15         print(citation, "\n")

You can get to know your teammates by joining relevant Slack channels and engaging in team-building activities. These activities include monthly outings and weekly game nights. You are also welcome to suggest new activity ideas.
CITATIONS:
start=38 end=69 text='joining relevant Slack channels' sources=[DocumentSource(type='document', id='doc:0', document={'id': 'doc:0', 'text': 'Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.'})] 
start=86 end=111 text='team-building activities.' sources=[DocumentSource(type='document', id='doc:1', document={'id': 'doc:1', 'text': 'Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!'})] 
start=137 end=176 text='monthly outings and weekly game nights.' sources=[DocumentSource(type='document', id='doc:1', document={'id': 'doc:1', 'text': 'Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!'})] 
start=201 end=228 text='suggest new activity ideas.' sources=[DocumentSource(type='document', id='doc:1', document={'id': 'doc:1', 'text': 'Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!'})]

Conclusion

In this tutorial, you learned about:

How to get started with RAG
How to generate search queries
How to perform retrieval with Embed
How to perform reranking with Rerank
How to generate response and citations

RAG is great for building applications that can answer questions by grounding the response in external documents. But you can unlock the ability to not just answer questions, but also automate tasks. This can be done using a technique called tool use.

In Part 7, you will learn how to leverage tool use to automate tasks and workflows.