Building RAG models with Cohere

Open in Colab

The Chat endpoint provides comprehensive support for various text generation use cases, including retrieval-augmented generation (RAG).

While LLMs are good at maintaining the context of the conversation and generating responses, they can be prone to hallucinate and include factually incorrect or incomplete information in their responses.

RAG enables a model to access and utilize supplementary information from external documents, thereby improving the accuracy of its responses.

When using RAG with the Chat endpoint, these responses are backed by fine-grained citations linking to the source documents. This makes the responses easily verifiable.

In this tutorial, you’ll learn about:

Basic RAG
Search query generation
Retrieval with Embed
Reranking with Rerank
Response and citation generation

You’ll learn these by building an onboarding assistant for new hires.

Setup

To get started, first we need to install the cohere library and create a Cohere client.

PYTHON

1 # pip install cohere numpy
2 
3 import numpy as np
4 import cohere
5 
6 # Get your API key: https://dashboard.cohere.com/api-keys
7 co = cohere.Client("COHERE_API_KEY")

Basic RAG

To see how RAG works, let’s define the documents that the application has access to. We’ll use a short list of documents consisting of internal FAQs about the fictitious company Co1t (in production, these documents are massive).

In this example, each document is a dictionary with one field, text. But we can define any number of fields we want, depending on the nature of the documents. For example, emails could contain title and text fields.

PYTHON

1 # Define the documents
2 faqs_short = [
3     {
4         "text": "Reimbursing Travel Expenses: Easily manage your travel expenses by submitting them through our finance tool. Approvals are prompt and straightforward."
5     },
6     {
7         "text": "Working from Abroad: Working remotely from another country is possible. Simply coordinate with your manager and ensure your availability during core hours."
8     },
9     {
10         "text": "Health and Wellness Benefits: We care about your well-being and offer gym memberships, on-site yoga classes, and comprehensive health insurance."
11     },
12     {
13         "text": "Performance Reviews Frequency: We conduct informal check-ins every quarter and formal performance reviews twice a year."
14     },
15 ]

To use these documents, we pass them to the documents parameter in the Chat endpoint call. This tells the model to run in RAG-mode and use these documents in its response.

Let’s create a query asking about the company’s support for personal well-being, which is not going to be available to the model based on the data its trained on. It will need to use external documents.

RAG introduces additional objects in the Chat response. Here we display two:

citations: indicate the specific text spans from the retrieved documents on which the response is grounded.
documents: the IDs of the documents referenced in the citations.

PYTHON

1 # Add the user query
2 query = "Are there fitness-related perks?"
3 
4 # Generate the response
5 response = co.chat(
6     message=query,
7     model="command-a-03-2025",
8     documents=faqs_short,
9 )
10 
11 # Display the response
12 print(response.text)
13 
14 # Display the citations and source documents
15 if response.citations:
16     print("\nCITATIONS:")
17     for citation in response.citations:
18         print(citation)
19 
20     print("\nDOCUMENTS:")
21     for document in response.documents:
22         print(document)

Yes, we offer health and wellness benefits, including gym memberships, on-site yoga classes, and comprehensive health insurance.
CITATIONS:
start=14 end=42 text='health and wellness benefits' document_ids=['doc_2']
start=54 end=69 text='gym memberships' document_ids=['doc_2']
start=71 end=91 text='on-site yoga classes' document_ids=['doc_2']
start=97 end=128 text='comprehensive health insurance.' document_ids=['doc_2']
DOCUMENTS:
{'id': 'doc_2', 'text': 'Health and Wellness Benefits: We care about your well-being and offer gym memberships, on-site yoga classes, and comprehensive health insurance.'}

Search query generation

The previous example showed how to get started with RAG, and in particular, the augmented generation portion of RAG. But as its name implies, RAG consists of other steps, such as retrieval.

In a basic RAG application, the steps involved are:

Transforming the user message into search queries
Retrieving relevant documents for a given search query
Generating the response and citations

Let’s now look at the first step—search query generation. The chatbot needs to generate an optimal set of search queries to use for retrieval.

The Chat endpoint has a feature that handles this for us automatically. This is done by adding the search_queries_only=True parameter to the Chat endpoint call.

It will generate a list of search queries based on a user message. Depending on the message, it can be one or more queries.

In the example below, the resulting queries breaks down the user message into two separate queries.

PYTHON

1 # Add the user query
2 query = "How to stay connected with the company and do you organize team events?"
3 
4 # Generate the search queries
5 response = co.chat(message=query, search_queries_only=True)
6 
7 queries = []
8 for r in response.search_queries:
9     queries.append(r.text)
10 
11 print(queries)

['staying connected with the company', 'team events']

And in the example below, the model decides that one query is sufficient.

PYTHON

1 # Add the user query
2 query = "How flexible are the working hours"
3 
4 # Generate the search queries
5 response = co.chat(message=query, search_queries_only=True)
6 
7 queries = []
8 for r in response.search_queries:
9     queries.append(r.text)
10 
11 print(queries)

['working hours flexibility']

Retrieval with Embed

Given the search query, we need a way to retrieve the most relevant documents from a large collection of documents.

This is where we can leverage text embeddings through the Embed endpoint. It enables semantic search, which lets us to compare the semantic meaning of the documents and the query. It solves the problem faced by the more traditional approach of lexical search, which is great at finding keyword matches, but struggles at capturing the context or meaning of a piece of text.

The Embed endpoint takes in texts as input and returns embeddings as output.

First, we need to embed the documents to search from. We call the Embed endpoint using co.embed() and pass the following arguments:

model: Here we choose embed-v4.0
input_type: We choose search_document to ensure the model treats these as the documents (instead of the query) for search
texts: The list of texts (the FAQs)

PYTHON

1 # Define the documents
2 faqs_long = [
3     {
4         "text": "Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged."
5     },
6     {
7         "text": "Finding Coffee Spots: For your caffeine fix, head to the break room's coffee machine or cross the street to the café for artisan coffee."
8     },
9     {
10         "text": "Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!"
11     },
12     {
13         "text": "Working Hours Flexibility: We prioritize work-life balance. While our core hours are 9 AM to 5 PM, we offer flexibility to adjust as needed."
14     },
15     {
16         "text": "Side Projects Policy: We encourage you to pursue your passions. Just be mindful of any potential conflicts of interest with our business."
17     },
18     {
19         "text": "Reimbursing Travel Expenses: Easily manage your travel expenses by submitting them through our finance tool. Approvals are prompt and straightforward."
20     },
21     {
22         "text": "Working from Abroad: Working remotely from another country is possible. Simply coordinate with your manager and ensure your availability during core hours."
23     },
24     {
25         "text": "Health and Wellness Benefits: We care about your well-being and offer gym memberships, on-site yoga classes, and comprehensive health insurance."
26     },
27     {
28         "text": "Performance Reviews Frequency: We conduct informal check-ins every quarter and formal performance reviews twice a year."
29     },
30     {
31         "text": "Proposing New Ideas: Innovation is welcomed! Share your brilliant ideas at our weekly team meetings or directly with your team lead."
32     },
33 ]
34 
35 # Embed the documents
36 doc_emb = co.embed(
37     model="embed-v4.0",
38     input_type="search_document",
39     texts=[doc["text"] for doc in faqs_long],
40 ).embeddings

Next, we add a query, which asks about how to get to know the team.

We choose search_query as the input_type to ensure the model treats this as the query (instead of the documents) for search.

PYTHON

1 # Add the user query
2 query = "How to get to know my teammates"
3 
4 # Generate the search query
5 response = co.chat(message=query, search_queries_only=True)
6 query_optimized = response.search_queries[0].text
7 
8 # Embed the search query
9 query_emb = co.embed(
10     model="embed-v4.0",
11     input_type="search_query",
12     texts=[query_optimized],
13 ).embeddings

Now, we want to search for the most relevant documents to the query. For this, we make use of the numpy library to compute the similarity between each query-document pair using the dot product approach.

Each query-document pair returns a score, which represents how similar the pair are. We then sort these scores in descending order and select the top most similar pairs, which we choose 5 (this is an arbitrary choice, you can choose any number).

Here, we show the most relevant documents with their similarity scores.

PYTHON

1 # Compute dot product similarity and display results
2 n = 5
3 scores = np.dot(query_emb, np.transpose(doc_emb))[0]
4 scores_sorted = sorted(
5     enumerate(scores), key=lambda x: x[1], reverse=True
6 )[:n]
7 
8 retrieved_documents = [faqs_long[item[0]] for item in scores_sorted]
9 
10 for idx, item in enumerate(scores_sorted):
11     print(f"Rank: {idx+1}")
12     print(f"Score: {item[1]}")
13     print(f"Document: {faqs_long[item[0]]}\n")

Rank: 1
Score: 0.32675385963873044
Document: {'text': 'Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!'}
Rank: 2
Score: 0.2683516879250747
Document: {'text': 'Proposing New Ideas: Innovation is welcomed! Share your brilliant ideas at our weekly team meetings or directly with your team lead.'}
Rank: 3
Score: 0.25784017142593213
Document: {'text': 'Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.'}
Rank: 4
Score: 0.18610347850687634
Document: {'text': "Finding Coffee Spots: For your caffeine fix, head to the break room's coffee machine or cross the street to the café for artisan coffee."}
Rank: 5
Score: 0.12958686394309055
Document: {'text': 'Health and Wellness Benefits: We care about your well-being and offer gym memberships, on-site yoga classes, and comprehensive health insurance.'}

Reranking with Rerank

Reranking can boost the results from semantic or lexical search further. The Rerank endpoint takes a list of search results and reranks them according to the most relevant documents to a query. This requires just a single line of code to implement.

We call the endpoint using co.rerank() and pass the following arguments:

query: The user query
documents: The list of documents we get from the semantic search results
top_n: The top reranked documents to select
model: We choose Rerank English 3

Looking at the results, we see that the given a query about getting to know the team, the document that talks about joining Slack channels is now ranked higher (1st) compared to earlier (3rd).

Here we select top_n to be 2, which will be the documents we will pass next for response generation.

PYTHON

1 # Rerank the documents
2 results = co.rerank(
3     query=query_optimized,
4     documents=retrieved_documents,
5     top_n=2,
6     model="rerank-english-v3.0",
7 )
8 
9 # Display the reranking results
10 for idx, result in enumerate(results.results):
11     print(f"Rank: {idx+1}")
12     print(f"Score: {result.relevance_score}")
13     print(f"Document: {retrieved_documents[result.index]}\n")
14 
15 reranked_documents = [
16     retrieved_documents[result.index] for result in results.results
17 ]

Rank: 1
Score: 0.0040072887
Document: {'text': 'Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.'}
Rank: 2
Score: 0.0020829707
Document: {'text': 'Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!'}

Response and citation generation

Finally we reach the step that we saw in the earlier Basic RAG section. Here, the response is generated based on the the query and the documents retrieved.

RAG introduces additional objects in the Chat response. Here we display two:

citations: indicate the specific spans of text from the retrieved documents on which the response is grounded.
documents: the IDs of the documents being referenced in the citations.

PYTHON

1 # Generate the response
2 response = co.chat(
3     message=query_optimized,
4     model="command-a-03-2025",
5     documents=reranked_documents,
6 )
7 
8 # Display the response
9 print(response.text)
10 
11 # Display the citations and source documents
12 if response.citations:
13     print("\nCITATIONS:")
14     for citation in response.citations:
15         print(citation)
16 
17     print("\nDOCUMENTS:")
18     for document in response.documents:
19         print(document)

There are a few ways to get to know your teammates. You could join your company's Slack channels to stay informed and connected. You could also take part in team-building activities, such as outings and game nights.
CITATIONS:
start=62 end=96 text="join your company's Slack channels" document_ids=['doc_0']
start=100 end=128 text='stay informed and connected.' document_ids=['doc_0']
start=157 end=181 text='team-building activities' document_ids=['doc_1']
start=191 end=215 text='outings and game nights.' document_ids=['doc_1']
DOCUMENTS:
{'id': 'doc_0', 'text': 'Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.'}
{'id': 'doc_1', 'text': 'Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!'}

Conclusion

In this tutorial, you learned about:

How to get started with RAG
How to generate search queries
How to perform retrieval with Embed
How to perform reranking with Rerank
How to generate response and citations

RAG is great for building applications that can answer questions by grounding the response in external documents. But you can unlock the ability to not just answer questions, but also automate tasks. This can be done using a technique called tool use.

In Part 7, you will learn how to leverage tool use to automate tasks and workflows.