Querying Structured Data (Tables)
In the previous tutorials, we explored how to build agentic RAG applications over unstructured and semi-structured data. Now, we’ll see how to implement agentic RAG over structured data.
This tutorial focuses on querying tables, and the next tutorial will be about querying SQL databases.
Consider a scenario where you have a CSV file containing evaluation results for an LLM application.
A user might ask questions like “What’s the average score for a specific use case?” or “Which configuration has the lowest latency?“. These queries require not just retrieval, but also data analysis and interpretation.
In this tutorial, we’ll cover:
- Creating a function to execute Python code
- Setting up a tool to interact with tabular data
- Building an agent for querying tabular data
- Running the agent
Let’s get started by setting up our environment and defining the necessary tools for our agent.
Setup
To get started, first we need to install the cohere
library and create a Cohere client.
And here’s the data we’ll be working with. evaluation_results.csv
is a CSV file containing evaluation results for a set of LLM applications - name extraction, email drafting, and article summarization.
The file has the following columns:
usecase
: The use case.run
: The run ID.score
: The evaluation score for a particular run.temperature
: The temperature setting of the model for a particular run.tokens
: The number of tokens generated by the model for a particular run.latency
: The latency of the model’s response for a particular run.
Important: the data can be found here. Make sure to have the file in the same directory as this notebook for the imports to work correctly.
usecase | run | score | temperature | tokens | latency | |
---|---|---|---|---|---|---|
0 | extract_names | A | 0.5 | 0.3 | 103 | 1.12 |
1 | draft_email | A | 0.6 | 0.3 | 252 | 2.50 |
2 | summarize_article | A | 0.8 | 0.3 | 350 | 4.20 |
3 | extract_names | B | 0.2 | 0.3 | 101 | 2.85 |
4 | draft_email | B | 0.4 | 0.3 | 230 | 3.20 |
Creating a function to execute Python code
Here, we introduce a new tool that allows the agent to execute Python code and return the result. The agent will use this tool to generate pandas code to query the data.
To create this tool, we’ll use the PythonREPL
class from the langchain_experimental.utilities
module. This class provides a sandboxed environment for executing Python code and returns the result.
First, we define a python_tool
that uses the PythonREPL
class to execute Python code and return the result.
Next, we define a ToolInput
class to handle the input for the python_tool
.
Finally, we create a function analyze_evaluation_results
that takes a string of Python code as input, executes the code using the Python tool we created, and returns the result.
Important: the source code for tool definitions can be found here. Make sure to have the tool_def.py
file in the same directory as this notebook for the imports to work correctly.
Setting up a tool to interact with tabular data
Next, we define the analyze_evaluation_results
tool. There are many ways we can set up a tool to work with CSV data, and in this example, we are using the tool description to provide the agent with the necessary context for working with the CSV file, such as:
- the name of the CSV file to load
- the columns of the CSV file
- additional instructions on what libraries to use (in this case,
pandas
)
The parameter of this tool is the code
string containing the Python code that the agent writes to analyze the data.
Building an agent for querying tabular data
Next, let’s create a run_agent
function to run the agentic RAG workflow, the same as in Part 1.
The only change we are making here is to make the system message simpler and more specific since the agent now only has one tool.
Running the agent
Let’s ask the agent a few questions, starting with this query about the average evaluation score in run A.
To answer this query, the agent needs to write Python code that uses the pandas library to calculate the average evaluation score in run A. And it gets the answer right.
Next, we ask a slightly more complex question, this time about the latency of the highest-scoring run for one use case. This requires the agent to filter based on the use case, find the highest-scoring run, and return the latency value.
Next, we ask a question to compare the use cases in terms of token usage, and to show a markdown table to show the comparison.
Summary
In this tutorial, we learned about:
- How to create a function to execute Python code
- How to set up a tool to interact with tabular data
- How to run the agent
By implementing these techniques, we’ve expanded our agentic RAG system to handle structured data in the form of tables.
While this tutorial demonstrated how to work with tabular data using pandas and Python, the agentic RAG approach can be applied to other forms of structured data as well. This means we can build agents that can translate natural language queries into various data analysis tasks.
In Part 6, we’ll learn how to perform RAG over structured data (SQL).