Generate Parallel Queries for Better RAG Retrieval

Open in Colab

Compare two user queries to a RAG chatbot, “What was Apple’s revenue in 2023?” and “What were Apple’s and Google’s revenue in 2023?”.

The first query is straightforward as we can perform retrieval using pretty much the same query we get.

But the second query is more complex. We need to break it down into two separate queries, one for Apple and one for Google.

This is an example that requires query expansion. Here, the agentic RAG will need to transform the query into a more optimized set of queries it should use to perform the retrieval.

In this part, we’ll learn how to create an agentic RAG system that can perform query expansion and then run those queries in parallel:

Query expansion
Query expansion over multiple data sources
Query expansion in multi-turn conversations

We’ll learn these by building an agent that answers questions about using Cohere.

Setup

To get started, first we need to install the cohere library and create a Cohere client.

We also need to import the tool definitions that we’ll use in this tutorial.

Important: the source code for tool definitions can be found here. Make sure to have the tool_def.py file in the same directory as this notebook for the imports to work correctly.

PYTHON

1 ! pip install cohere langchain langchain-community pydantic -qq

PYTHON

1 import json
2 import os
3 import cohere
4 
5 from tool_def import (
6     search_developer_docs,
7     search_developer_docs_tool,
8     search_internet,
9     search_internet_tool,
10     search_code_examples,
11     search_code_examples_tool,
12 )
13 
14 co = cohere.ClientV2(
15     "COHERE_API_KEY"
16 )  # Get your free API key: https://dashboard.cohere.com/api-keys
17 
18 os.environ["TAVILY_API_KEY"] = (
19     "TAVILY_API_KEY"  # We'll need the Tavily API key to perform internet search. Get your API key: https://app.tavily.com/home
20 )

Setting up the tools

We set up the same set of tools as in Part 1. If you want further details on how to set up the tools, check out Part 1.

PYTHON

1 functions_map = {
2     "search_developer_docs": search_developer_docs,
3     "search_internet": search_internet,
4     "search_code_examples": search_code_examples,
5 }

Running an agentic RAG workflow

We create a run_agent function to run the agentic RAG workflow, the same as in Part 1. If you want further details on how to set up the tools, check out Part 1.

PYTHON

1 tools = [
2     search_developer_docs_tool,
3     search_internet_tool,
4     search_code_examples_tool,
5 ]

PYTHON

1 system_message = """## Task and Context
2 You are an assistant who helps developers use Cohere. You are equipped with a number of tools that can provide different types of information. If you can't find the information you need from one tool, you should try other tools if there is a possibility that they could provide the information you need."""

PYTHON

1 model = "command-a-03-2025"
2 
3 
4 def run_agent(query, messages=None):
5     if messages is None:
6         messages = []
7 
8     if "system" not in {m.get("role") for m in messages}:
9         messages.append({"role": "system", "content": system_message})
10 
11     # Step 1: get user message
12     print(f"QUESTION:\n{query}")
13     print("=" * 50)
14 
15     messages.append({"role": "user", "content": query})
16 
17     # Step 2: Generate tool calls (if any)
18     response = co.chat(
19         model=model, messages=messages, tools=tools, temperature=0.3
20     )
21 
22     while response.message.tool_calls:
23 
24         print("TOOL PLAN:")
25         print(response.message.tool_plan, "\n")
26         print("TOOL CALLS:")
27         for tc in response.message.tool_calls:
28             print(
29                 f"Tool name: {tc.function.name} | Parameters: {tc.function.arguments}"
30             )
31         print("=" * 50)
32 
33         messages.append(
34             {
35                 "role": "assistant",
36                 "tool_calls": response.message.tool_calls,
37                 "tool_plan": response.message.tool_plan,
38             }
39         )
40 
41         # Step 3: Get tool results
42         for tc in response.message.tool_calls:
43             tool_result = functions_map[tc.function.name](
44                 **json.loads(tc.function.arguments)
45             )
46             tool_content = []
47             for data in tool_result:
48                 tool_content.append(
49                     {
50                         "type": "document",
51                         "document": {"data": json.dumps(data)},
52                     }
53                 )
54                 # Optional: add an "id" field in the "document" object, otherwise IDs are auto-generated
55             messages.append(
56                 {
57                     "role": "tool",
58                     "tool_call_id": tc.id,
59                     "content": tool_content,
60                 }
61             )
62 
63         # Step 4: Generate response and citations
64         response = co.chat(
65             model=model,
66             messages=messages,
67             tools=tools,
68             temperature=0.3,
69         )
70 
71     messages.append(
72         {
73             "role": "assistant",
74             "content": response.message.content[0].text,
75         }
76     )
77 
78     # Print final response
79     print("RESPONSE:")
80     print(response.message.content[0].text)
81     print("=" * 50)
82 
83     # Print citations (if any)
84     verbose_source = (
85         False  # Change to True to display the contents of a source
86     )
87     if response.message.citations:
88         print("CITATIONS:\n")
89         for citation in response.message.citations:
90             print(
91                 f"Start: {citation.start}| End:{citation.end}| Text:'{citation.text}' "
92             )
93             print("Sources:")
94             for idx, source in enumerate(citation.sources):
95                 print(f"{idx+1}. {source.id}")
96                 if verbose_source:
97                     print(f"{source.tool_output}")
98             print("\n")
99 
100     return messages

Query expansion

Let’s ask the agent a few questions, starting with this one about the Chat endpoint and the RAG feature.

Firstly, the agent rightly chooses the search_developer_docs tool to retrieve the information it needs.

Additionally, because the question asks about two different things, retrieving information using the same query as the user’s may not be the optimal approach. Instead, the query needs to be expanded or split into multiple parts, each retrieving its own set of documents.

Thus, the agent expands the original query into two queries.

This is enabled by the parallel tool calling feature that comes with the Chat endpoint.

This results in a richer and more representative list of documents retrieved, and therefore a more accurate and comprehensive answer.

PYTHON

1 messages = run_agent("Explain the Chat endpoint and the RAG feature")

1 QUESTION:
2 Explain the Chat endpoint and the RAG feature
3 ==================================================
4 TOOL PLAN:
5 I will search the Cohere developer documentation for the Chat endpoint and the RAG feature. 
6 
7 TOOL CALLS:
8 Tool name: search_developer_docs | Parameters: {"query":"Chat endpoint"}
9 Tool name: search_developer_docs | Parameters: {"query":"RAG feature"}
10 ==================================================
11 RESPONSE:
12 The Chat endpoint facilitates a conversational interface, allowing users to send messages to the model and receive text responses.
13 
14 Retrieval Augmented Generation (RAG) is a method for generating text using additional information fetched from an external data source, which can greatly increase the accuracy of the response.
15 ==================================================
16 CITATIONS:
17 
18 Start: 18| End:56| Text:'facilitates a conversational interface' 
19 Sources:
20 1. search_developer_docs_c059cbhr042g:3
21 2. search_developer_docs_beycjq0ejbvx:3
22 
23 
24 Start: 58| End:130| Text:'allowing users to send messages to the model and receive text responses.' 
25 Sources:
26 1. search_developer_docs_c059cbhr042g:3
27 2. search_developer_docs_beycjq0ejbvx:3
28 
29 
30 Start: 132| End:162| Text:'Retrieval Augmented Generation' 
31 Sources:
32 1. search_developer_docs_c059cbhr042g:4
33 2. search_developer_docs_beycjq0ejbvx:4
34 
35 
36 Start: 174| End:266| Text:'method for generating text using additional information fetched from an external data source' 
37 Sources:
38 1. search_developer_docs_c059cbhr042g:4
39 2. search_developer_docs_beycjq0ejbvx:4
40 
41 
42 Start: 278| End:324| Text:'greatly increase the accuracy of the response.' 
43 Sources:
44 1. search_developer_docs_c059cbhr042g:4
45 2. search_developer_docs_beycjq0ejbvx:4

Query expansion over multiple data sources

The earlier example focused on a single data source, the Cohere developer documentation. However, the agentic RAG can also perform query expansion over multiple data sources.

Here, the agent is asked a question that contains two parts: first asking for an explanation of the Embed endpoint and then asking for code examples.

It correctly identifies that this requires both searching the developer documentation and the code examples. Thus, it generates two queries, one for each data source, and performs two separate searches in parallel.

Its response then contains information referenced from both data sources.

PYTHON

1 messages = run_agent(
2     "What is the Embed endpoint? Give me some code tutorials"
3 )

1 QUESTION:
2 What is the Embed endpoint? Give me some code tutorials
3 ==================================================
4 TOOL PLAN:
5 I will search for 'what is the Embed endpoint' and 'Embed endpoint code tutorials' at the same time. 
6 
7 TOOL CALLS:
8 Tool name: search_developer_docs | Parameters: {"query":"what is the Embed endpoint"}
9 Tool name: search_code_examples | Parameters: {"query":"Embed endpoint code tutorials"}
10 ==================================================
11 RESPONSE:
12 The Embed endpoint returns text embeddings. An embedding is a list of floating point numbers that captures semantic information about the text that it represents.
13 
14 I'm afraid I couldn't find any code tutorials for the Embed endpoint.
15 ==================================================
16 CITATIONS:
17 
18 Start: 19| End:43| Text:'returns text embeddings.' 
19 Sources:
20 1. search_developer_docs_pgzdgqd3k0sd:1
21 
22 
23 Start: 62| End:162| Text:'list of floating point numbers that captures semantic information about the text that it represents.' 
24 Sources:
25 1. search_developer_docs_pgzdgqd3k0sd:1

Query expansion in multi-turn conversations

A RAG chatbot needs to be able to infer the user’s intent for a given query, sometimes based on a vague context.

This is especially important in multi-turn conversations, where the user’s intent may not be clear from a single query.

For example, in the first turn, a user might ask “What is A” and in the second turn, they might ask “Compare that with B and C”. So, the agent needs to be able to infer that the user’s intent is to compare A with B and C.

Let’s see an example of this. First, note that the run_agent function is already set up to handle multi-turn conversations. It can take messages from the previous conversation turns and append them to the messages list.

In the first turn, the user asks about the Chat endpoint, to which the agent duly responds.

PYTHON

1 messages = run_agent("What is the Chat endpoint?")

1 QUESTION:
2 What is the Chat endpoint?
3 ==================================================
4 TOOL PLAN:
5 I will search the Cohere developer documentation for 'Chat endpoint'. 
6 
7 TOOL CALLS:
8 Tool name: search_developer_docs | Parameters: {"query":"Chat endpoint"}
9 ==================================================
10 RESPONSE:
11 The Chat endpoint facilitates a conversational interface, allowing users to send messages to the model and receive text responses.
12 ==================================================
13 CITATIONS:
14 
15 Start: 18| End:130| Text:'facilitates a conversational interface, allowing users to send messages to the model and receive text responses.' 
16 Sources:
17 1. search_developer_docs_qx7dht277mg7:3

In the second turn, the user asks a question that has two parts: first, how it’s different from RAG, and then, for code examples.

We pass the messages from the previous conversation turn to the run_agent function.

Because of this, the agent is able to infer that the question is referring to the Chat endpoint even though the user didn’t explicitly mention it.

The agent then expands the query into two separate queries, one for the search_code_examples tool and one for the search_developer_docs tool.

PYTHON

1 messages = run_agent(
2     "How is it different from RAG? Also any code tutorials?", messages
3 )

1 QUESTION:
2 How is it different from RAG? Also any code tutorials?
3 ==================================================
4 TOOL PLAN:
5 I will search the Cohere developer documentation for 'Chat endpoint vs RAG' and 'Chat endpoint code tutorials'. 
6 
7 TOOL CALLS:
8 Tool name: search_developer_docs | Parameters: {"query":"Chat endpoint vs RAG"}
9 Tool name: search_code_examples | Parameters: {"query":"Chat endpoint"}
10 ==================================================
11 RESPONSE:
12 The Chat endpoint facilitates a conversational interface, allowing users to send messages to the model and receive text responses.
13 
14 RAG (Retrieval Augmented Generation) is a method for generating text using additional information fetched from an external data source, which can greatly increase the accuracy of the response.
15 
16 I could not find any code tutorials for the Chat endpoint, but I did find a tutorial on RAG with Chat Embed and Rerank via Pinecone.
17 ==================================================
18 CITATIONS:
19 
20 Start: 414| End:458| Text:'RAG with Chat Embed and Rerank via Pinecone.' 
21 Sources:
22 1. search_code_examples_h8mn6mdqbrc3:2

Summary

In this tutorial, we learned about:

How query expansion works in an agentic RAG system
How query expansion works over multiple data sources
How query expansion works in multi-turn conversations

Having said that, we may encounter even more complex queries than what we’ve seen so far. In particular, some queries require sequential reasoning where the retrieval needs to happen over multiple steps.

In Part 3, we’ll learn how the agentic RAG system can perform sequential reasoning.