Generate Parallel Queries for Better RAG Retrieval

Open in Colab

Compare two user queries to a RAG chatbot, “What was Apple’s revenue in 2023?” and “What were Apple’s and Google’s revenue in 2023?“.

The first query is straightforward as we can perform retrieval using pretty much the same query we get.

But the second query is more complex. We need to break it down into two separate queries, one for Apple and one for Google.

This is an example that requires query expansion. Here, the agentic RAG will need to transform the query into a more optimized set of queries it should use to perform the retrieval.

In this part, we’ll learn how to create an agentic RAG system that can perform query expansion and then run those queries in parallel:

  • Query expansion
  • Query expansion over multiple data sources
  • Query expansion in multi-turn conversations

We’ll learn these by building an agent that answers questions about using Cohere.

Setup

To get started, first we need to install the cohere library and create a Cohere client.

We also need to import the tool definitions that we’ll use in this tutorial.

Important: the source code for tool definitions can be found here. Make sure to have the tool_def.py file in the same directory as this notebook for the imports to work correctly.
PYTHON
1! pip install cohere langchain langchain-community pydantic -qq
PYTHON
1import json
2import os
3import cohere
4
5from tool_def import (
6 search_developer_docs,
7 search_developer_docs_tool,
8 search_internet,
9 search_internet_tool,
10 search_code_examples,
11 search_code_examples_tool,
12)
13
14co = cohere.ClientV2(
15 "COHERE_API_KEY"
16) # Get your free API key: https://dashboard.cohere.com/api-keys
17
18os.environ["TAVILY_API_KEY"] = (
19 "TAVILY_API_KEY" # We'll need the Tavily API key to perform internet search. Get your API key: https://app.tavily.com/home
20)

Setting up the tools

We set up the same set of tools as in Part 1. If you want further details on how to set up the tools, check out Part 1.

PYTHON
1functions_map = {
2 "search_developer_docs": search_developer_docs,
3 "search_internet": search_internet,
4 "search_code_examples": search_code_examples,
5}

Running an agentic RAG workflow

We create a run_agent function to run the agentic RAG workflow, the same as in Part 1. If you want further details on how to set up the tools, check out Part 1.

PYTHON
1tools = [
2 search_developer_docs_tool,
3 search_internet_tool,
4 search_code_examples_tool,
5]
PYTHON
1system_message = """## Task and Context
2You are an assistant who helps developers use Cohere. You are equipped with a number of tools that can provide different types of information. If you can't find the information you need from one tool, you should try other tools if there is a possibility that they could provide the information you need."""
PYTHON
1model = "command-a-03-2025"
2
3
4def run_agent(query, messages=None):
5 if messages is None:
6 messages = []
7
8 if "system" not in {m.get("role") for m in messages}:
9 messages.append({"role": "system", "content": system_message})
10
11 # Step 1: get user message
12 print(f"QUESTION:\n{query}")
13 print("=" * 50)
14
15 messages.append({"role": "user", "content": query})
16
17 # Step 2: Generate tool calls (if any)
18 response = co.chat(
19 model=model, messages=messages, tools=tools, temperature=0.3
20 )
21
22 while response.message.tool_calls:
23
24 print("TOOL PLAN:")
25 print(response.message.tool_plan, "\n")
26 print("TOOL CALLS:")
27 for tc in response.message.tool_calls:
28 print(
29 f"Tool name: {tc.function.name} | Parameters: {tc.function.arguments}"
30 )
31 print("=" * 50)
32
33 messages.append(
34 {
35 "role": "assistant",
36 "tool_calls": response.message.tool_calls,
37 "tool_plan": response.message.tool_plan,
38 }
39 )
40
41 # Step 3: Get tool results
42 for tc in response.message.tool_calls:
43 tool_result = functions_map[tc.function.name](
44 **json.loads(tc.function.arguments)
45 )
46 tool_content = []
47 for data in tool_result:
48 tool_content.append(
49 {
50 "type": "document",
51 "document": {"data": json.dumps(data)},
52 }
53 )
54 # Optional: add an "id" field in the "document" object, otherwise IDs are auto-generated
55 messages.append(
56 {
57 "role": "tool",
58 "tool_call_id": tc.id,
59 "content": tool_content,
60 }
61 )
62
63 # Step 4: Generate response and citations
64 response = co.chat(
65 model=model,
66 messages=messages,
67 tools=tools,
68 temperature=0.3,
69 )
70
71 messages.append(
72 {
73 "role": "assistant",
74 "content": response.message.content[0].text,
75 }
76 )
77
78 # Print final response
79 print("RESPONSE:")
80 print(response.message.content[0].text)
81 print("=" * 50)
82
83 # Print citations (if any)
84 verbose_source = (
85 False # Change to True to display the contents of a source
86 )
87 if response.message.citations:
88 print("CITATIONS:\n")
89 for citation in response.message.citations:
90 print(
91 f"Start: {citation.start}| End:{citation.end}| Text:'{citation.text}' "
92 )
93 print("Sources:")
94 for idx, source in enumerate(citation.sources):
95 print(f"{idx+1}. {source.id}")
96 if verbose_source:
97 print(f"{source.tool_output}")
98 print("\n")
99
100 return messages

Query expansion

Let’s ask the agent a few questions, starting with this one about the Chat endpoint and the RAG feature.

Firstly, the agent rightly chooses the search_developer_docs tool to retrieve the information it needs.

Additionally, because the question asks about two different things, retrieving information using the same query as the user’s may not be the optimal approach. Instead, the query needs to be expanded or split into multiple parts, each retrieving its own set of documents.

Thus, the agent expands the original query into two queries.

This is enabled by the parallel tool calling feature that comes with the Chat endpoint.

This results in a richer and more representative list of documents retrieved, and therefore a more accurate and comprehensive answer.

PYTHON
1messages = run_agent("Explain the Chat endpoint and the RAG feature")
1QUESTION:
2Explain the Chat endpoint and the RAG feature
3==================================================
4TOOL PLAN:
5I will search the Cohere developer documentation for the Chat endpoint and the RAG feature.
6
7TOOL CALLS:
8Tool name: search_developer_docs | Parameters: {"query":"Chat endpoint"}
9Tool name: search_developer_docs | Parameters: {"query":"RAG feature"}
10==================================================
11RESPONSE:
12The Chat endpoint facilitates a conversational interface, allowing users to send messages to the model and receive text responses.
13
14Retrieval Augmented Generation (RAG) is a method for generating text using additional information fetched from an external data source, which can greatly increase the accuracy of the response.
15==================================================
16CITATIONS:
17
18Start: 18| End:56| Text:'facilitates a conversational interface'
19Sources:
201. search_developer_docs_c059cbhr042g:3
212. search_developer_docs_beycjq0ejbvx:3
22
23
24Start: 58| End:130| Text:'allowing users to send messages to the model and receive text responses.'
25Sources:
261. search_developer_docs_c059cbhr042g:3
272. search_developer_docs_beycjq0ejbvx:3
28
29
30Start: 132| End:162| Text:'Retrieval Augmented Generation'
31Sources:
321. search_developer_docs_c059cbhr042g:4
332. search_developer_docs_beycjq0ejbvx:4
34
35
36Start: 174| End:266| Text:'method for generating text using additional information fetched from an external data source'
37Sources:
381. search_developer_docs_c059cbhr042g:4
392. search_developer_docs_beycjq0ejbvx:4
40
41
42Start: 278| End:324| Text:'greatly increase the accuracy of the response.'
43Sources:
441. search_developer_docs_c059cbhr042g:4
452. search_developer_docs_beycjq0ejbvx:4

Query expansion over multiple data sources

The earlier example focused on a single data source, the Cohere developer documentation. However, the agentic RAG can also perform query expansion over multiple data sources.

Here, the agent is asked a question that contains two parts: first asking for an explanation of the Embed endpoint and then asking for code examples.

It correctly identifies that this requires both searching the developer documentation and the code examples. Thus, it generates two queries, one for each data source, and performs two separate searches in parallel.

Its response then contains information referenced from both data sources.

PYTHON
1messages = run_agent(
2 "What is the Embed endpoint? Give me some code tutorials"
3)
1QUESTION:
2What is the Embed endpoint? Give me some code tutorials
3==================================================
4TOOL PLAN:
5I will search for 'what is the Embed endpoint' and 'Embed endpoint code tutorials' at the same time.
6
7TOOL CALLS:
8Tool name: search_developer_docs | Parameters: {"query":"what is the Embed endpoint"}
9Tool name: search_code_examples | Parameters: {"query":"Embed endpoint code tutorials"}
10==================================================
11RESPONSE:
12The Embed endpoint returns text embeddings. An embedding is a list of floating point numbers that captures semantic information about the text that it represents.
13
14I'm afraid I couldn't find any code tutorials for the Embed endpoint.
15==================================================
16CITATIONS:
17
18Start: 19| End:43| Text:'returns text embeddings.'
19Sources:
201. search_developer_docs_pgzdgqd3k0sd:1
21
22
23Start: 62| End:162| Text:'list of floating point numbers that captures semantic information about the text that it represents.'
24Sources:
251. search_developer_docs_pgzdgqd3k0sd:1

Query expansion in multi-turn conversations

A RAG chatbot needs to be able to infer the user’s intent for a given query, sometimes based on a vague context.

This is especially important in multi-turn conversations, where the user’s intent may not be clear from a single query.

For example, in the first turn, a user might ask “What is A” and in the second turn, they might ask “Compare that with B and C”. So, the agent needs to be able to infer that the user’s intent is to compare A with B and C.

Let’s see an example of this. First, note that the run_agent function is already set up to handle multi-turn conversations. It can take messages from the previous conversation turns and append them to the messages list.

In the first turn, the user asks about the Chat endpoint, to which the agent duly responds.

PYTHON
1messages = run_agent("What is the Chat endpoint?")
1QUESTION:
2What is the Chat endpoint?
3==================================================
4TOOL PLAN:
5I will search the Cohere developer documentation for 'Chat endpoint'.
6
7TOOL CALLS:
8Tool name: search_developer_docs | Parameters: {"query":"Chat endpoint"}
9==================================================
10RESPONSE:
11The Chat endpoint facilitates a conversational interface, allowing users to send messages to the model and receive text responses.
12==================================================
13CITATIONS:
14
15Start: 18| End:130| Text:'facilitates a conversational interface, allowing users to send messages to the model and receive text responses.'
16Sources:
171. search_developer_docs_qx7dht277mg7:3

In the second turn, the user asks a question that has two parts: first, how it’s different from RAG, and then, for code examples.

We pass the messages from the previous conversation turn to the run_agent function.

Because of this, the agent is able to infer that the question is referring to the Chat endpoint even though the user didn’t explicitly mention it.

The agent then expands the query into two separate queries, one for the search_code_examples tool and one for the search_developer_docs tool.

PYTHON
1messages = run_agent(
2 "How is it different from RAG? Also any code tutorials?", messages
3)
1QUESTION:
2How is it different from RAG? Also any code tutorials?
3==================================================
4TOOL PLAN:
5I will search the Cohere developer documentation for 'Chat endpoint vs RAG' and 'Chat endpoint code tutorials'.
6
7TOOL CALLS:
8Tool name: search_developer_docs | Parameters: {"query":"Chat endpoint vs RAG"}
9Tool name: search_code_examples | Parameters: {"query":"Chat endpoint"}
10==================================================
11RESPONSE:
12The Chat endpoint facilitates a conversational interface, allowing users to send messages to the model and receive text responses.
13
14RAG (Retrieval Augmented Generation) is a method for generating text using additional information fetched from an external data source, which can greatly increase the accuracy of the response.
15
16I could not find any code tutorials for the Chat endpoint, but I did find a tutorial on RAG with Chat Embed and Rerank via Pinecone.
17==================================================
18CITATIONS:
19
20Start: 414| End:458| Text:'RAG with Chat Embed and Rerank via Pinecone.'
21Sources:
221. search_code_examples_h8mn6mdqbrc3:2

Summary

In this tutorial, we learned about:

  • How query expansion works in an agentic RAG system
  • How query expansion works over multiple data sources
  • How query expansion works in multi-turn conversations

Having said that, we may encounter even more complex queries than what we’ve seen so far. In particular, some queries require sequential reasoning where the retrieval needs to happen over multiple steps.

In Part 3, we’ll learn how the agentic RAG system can perform sequential reasoning.

Built with