Shaan Desai

Notebook Overview

Motivation

Tabular data reasoning continues to be an exciting application of Agents, particularly in the financial domain, where multiple reasoning steps may be needed to identify the right answer. For example, a financial analyst working with financial statements may be interested in computing various financial ratios with natural language queries.

Some examples may include:

ROE (Return on Equity) = Net Income / Shareholder’s Equity
Net Profit Margin = Net Income / Revenue
Asset Turnover = Revenue / Average Total Assets
Financial Leverage = Average Total Assets / Shareholder’s Equity

Having an Agent which is able to correctly compute these and other ratios would be a great help for any analyst in the field of Finance.

Objective

In this notebook we explore how to setup a Cohere ReAct Agent to answer questions over tables in Apple’s SEC10K 2020 form. We show how this can be done with two variants of a Langchain Python tool, one that requires you to pass the full path of the dataframe and another that requires the dataframe objects to be loaded in memory. While there is no major difference between the two approaches, we present both to show how to manage files that are loaded in memory vs files in storage.

Setup

PYTHON

1 from langchain_core.pydantic_v1 import BaseModel, Field
2 from langchain.agents import AgentExecutor
3 from langchain_cohere.react_multi_hop.agent import create_cohere_react_agent
4 from langchain_core.prompts import ChatPromptTemplate
5 from langchain_cohere.chat_models import ChatCohere
6 import pandas as pd
7 from langchain.agents import Tool
8 from langchain_experimental.utilities import PythonREPL
9 from langchain_experimental.tools.python.tool import PythonAstREPLTool

PYTHON

1 # Uncomment if you need to install the following packages
2 #!pip install --quiet langchain langchain_cohere langchain_experimental --upgrade
3 #!pip install sec-api

Introduction

The aim of this notebook is to showcase how Cohere Langchain Agents can be used to answer questions over tabular data. This notebook assumes the input is a set of csv files extracted from Apple’s SEC 10K filings.

Data Loading

We use the sec-api to download the income statement and balance sheet from the SEC 10K .htm file. Please note that the tables need to be parsed such that the index is numerical as the python code generation struggles to filter based on index. We have processed the tables and provided them for you. They can be found here.

PYTHON

1 income_statement = pd.read_csv('income_statement.csv')
2 balance_sheet = pd.read_csv('balance_sheet.csv')

QnA over Single Table

Agent with Python Tool

In the example below, we show how the python tool can be used to load a dataframe and extract information from it. To do this successfully we need to:

Pass the file name to the preamble so the model knows how to load the dataframe.
Pass a preview of the dataframe in the preamble so the model knows which columns/rows to query.

First, let’s implement the ReAct agent.

PYTHON

1 # instantiate the Cohere llm
2 llm = ChatCohere(model="command-r", temperature=0.1,cohere_api_key="",verbose=True)
3 
4 # create the python tool
5 python_repl = PythonREPL()
6 python_tool = Tool(
7     name="python_repl",
8     description="Executes python code and returns the result. The code runs in a static sandbox without interactive mode, so print output or save output to a file.",
9     func=python_repl.run,
10 )
11 python_tool.name = "python_interpreter"
12 
13 class ToolInput(BaseModel):
14     code: str = Field(description="Python code to execute.")
15 python_tool.args_schema = ToolInput
16 tools=[python_tool]
17 
18 # define the prompt template
19 prompt = ChatPromptTemplate.from_template("{input}")
20 
21 # instantiate the ReAct agent
22 agent = create_cohere_react_agent(
23    llm=llm,
24    tools=tools,
25    prompt=prompt,
26 )
27 agent_executor = AgentExecutor(agent=agent,
28                                tools=tools,
29                                verbose=True,
30                                return_intermediate_steps=True
31                     )
32 # finally, define the preamble
33 preamble = """
34 You are an expert who answers the user's question. You are working with a pandas dataframe in Python. The name of the dataframe is `income_statement.csv`.
35 Here is a preview of the dataframe:
36 {head_df}
37 """.format(head_df=income_statement.head(3).to_markdown())

Then, we define the dictionary including the questions we want the Agent to answer, and their answer.

PYTHON

1 question_dict ={
2     'q1': ['what is the highest value of cost of goods and service?',169559000000],
3     'q2': ['what is the largest gross profit margin?',0.3836194330595236],
4     'q3': ['what is the minimum ratio of operating income loss divided by non operating income expense',35.360599]
5 }

Let’s now see how the Agent answers each of the questions.

PYTHON

1 for qsn,val in question_dict.items():
2     print(f'question:{qsn}')
3     agent_executor.invoke({
4        "input": val[0],
5        "preamble": preamble,
6     })
7     print(f'GT Answer:{val[1]}')
8     print('-'*50)

Output

question:q1
[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
I will write and execute Python code to find the highest value of Cost of Goods and Services Sold from the provided dataframe.
{'tool_name': 'python_interpreter', 'parameters': {'code': 'import pandas as pd\n\n# Read the CSV file into a DataFrame\ndf = pd.read_csv(\'income_statement.csv\')\n\n# Find the highest value in the \'CostOfGoodsAndServicesSold\' column\nmax_value = df[\'CostOfGoodsAndServicesSold\'].max()\n\nprint(f"The highest value of Cost of Goods and Services Sold is {max_value}")'}}
[0m[36;1m[1;3mThe highest value of Cost of Goods and Services Sold is 169559000000.0
[0m[32;1m[1;3mRelevant Documents: 0
Cited Documents: 0
Answer: The highest value of Cost of Goods and Services Sold is 169559000000.0.
Grounded answer: The highest value of Cost of Goods and Services Sold is <co: 0="">169559000000.0</co:>.[0m
[1m> Finished chain.[0m
GT Answer:169559000000
--------------------------------------------------
question:q2
[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
I will write and execute Python code to calculate the largest gross profit margin from the data frame provided.
{'tool_name': 'python_interpreter', 'parameters': {'code': 'import pandas as pd\n\n# Read the CSV file into a DataFrame\ndf = pd.read_csv(\'income_statement.csv\')\n\n# Calculate the gross profit margin as a percentage\ndf[\'GrossProfitMargin\'] = (df[\'GrossProfit\'] / df[\'RevenueFromContractWithCustomerExcludingAssessedTax\']) * 100\n\n# Find the row with the maximum gross profit margin\nmax_margin_row = df.loc[df[\'GrossProfitMargin\'].idxmax()]\n\nprint(f"The largest gross profit margin is {max_margin_row[\'GrossProfitMargin\']}% on {max_margin_row[\'index\']}")\n'}}
[0m[36;1m[1;3mThe largest gross profit margin is 38.36194330595236% on 2019-12-29-2020-03-28
[0m[32;1m[1;3mRelevant Documents: 0
Cited Documents: 0
Answer: The largest gross profit margin is approximately 38.36% on 2019-12-29 to 2020-03-28.
Grounded answer: The largest gross profit margin is <co: 0="">approximately 38.36</co:>% on <co: 0="">2019-12-29 to 2020-03-28</co:>.[0m
[1m> Finished chain.[0m
GT Answer:0.3836194330595236
--------------------------------------------------
question:q3
[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
I will write and execute Python code to calculate the minimum ratio of Operating Income Loss divided by Non Operating Income Expense.
{'tool_name': 'python_interpreter', 'parameters': {'code': 'import pandas as pd\n\n# Read the CSV file into a DataFrame\ndf = pd.read_csv(\'income_statement.csv\')\n\n# Calculate the ratio\nratios = df[\'OperatingIncomeLoss\'] / df[\'NonoperatingIncomeExpense\']\n\nprint("Minimum ratio:", ratios.min())'}}
[0m[36;1m[1;3mMinimum ratio: 35.36059850374065
[0m[32;1m[1;3mRelevant Documents: 0
Cited Documents: 0
Answer: The minimum ratio of Operating Income Loss divided by Non Operating Income Expense is approximately **35.36**.
Grounded answer: The minimum ratio of Operating Income Loss divided by Non Operating Income Expense is approximately *<co: 0="">*35.36</co:>**.[0m
[1m> Finished chain.[0m
GT Answer:35.360599
--------------------------------------------------

Nice! The agent uses the Python tool to write the code to access the data in the tables, and perform the required calculations.

Agent with Python Tool that takes tables as input

In the example above, the model needs to load the dataframe in the python call before carrying out operations. In this example, we show how to pass the dataframes to the python tool so it has the file already loaded.

PYTHON

1 # call the PythonAstREPLTool in order to pass tables to the tool
2 df_locals = {'df':pd.read_csv('income_statement.csv')}
3 tools = [PythonAstREPLTool(locals=df_locals)]
4 
5 agent = create_cohere_react_agent(
6    llm=llm,
7    tools=tools,
8    prompt=prompt,
9 )
10 agent_executor = AgentExecutor(agent=agent,
11                                tools=tools,
12                                verbose=True,
13                                return_intermediate_steps=True
14                     )
15 
16 preamble = """
17 You are an expert who answers the user's question. You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
18 Here is a preview of the dataframe:
19 {head_df}
20 """.format(head_df=income_statement.head(3).to_markdown())

Let’s loop again over the same dictionary of questions.

PYTHON

1 for qsn,val in question_dict.items():
2     print(f'question:{qsn}')
3     agent_executor.invoke({
4        "input": val[0],
5        "preamble": preamble,
6     })
7     print(f'GT Answer:{val[1]}')
8     print('-'*50)

Output

question:q1
[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
I will write and execute Python code to find the highest value of Cost of Goods and Services Sold from the provided dataframe.
{'tool_name': 'python_interpreter', 'parameters': {'code': "import pandas as pd\n\n# Read the CSV file into a DataFrame\ndf = pd.read_csv('income_statement.csv')\n\n# Find the highest value in the 'CostOfGoodsAndServicesSold' column\nmax_value = df['CostOfGoodsAndServicesSold'].max()\n\nprint(f'The highest value of Cost of Goods and Services Sold is {max_value:.2f}')\n"}}
[0m[36;1m[1;3mThe highest value of Cost of Goods and Services Sold is 169559000000.00
[0m[32;1m[1;3mRelevant Documents: 0
Cited Documents: 0
Answer: The highest value of Cost of Goods and Services Sold is 169559000000.00.
Grounded answer: The highest value of Cost of Goods and Services Sold is <co: 0="">169559000000.00</co:>.[0m
[1m> Finished chain.[0m
GT Answer:169559000000
--------------------------------------------------
question:q2
[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
I will write and execute Python code to calculate the largest gross profit margin from the data frame provided.
{'tool_name': 'python_interpreter', 'parameters': {'code': 'import pandas as pd\n\n# Read the CSV file into a DataFrame\ndf = pd.read_csv(\'income_statement.csv\')\n\n# Calculate the gross profit margin as a percentage\ndf[\'GrossProfitMargin\'] = (df[\'GrossProfit\'] / df[\'RevenueFromContractWithCustomerExcludingAssessedTax\']) * 100\n\n# Find the row with the maximum gross profit margin\nmax_margin_row = df.loc[df[\'GrossProfitMargin\'].idxmax()]\n\nprint(f"The largest gross profit margin is {max_margin_row[\'GrossProfitMargin\']}% on {max_margin_row[\'index\']}")\n'}}
[0m[36;1m[1;3mThe largest gross profit margin is 38.36194330595236% on 2019-12-29-2020-03-28
[0m[32;1m[1;3mRelevant Documents: 0
Cited Documents: 0
Answer: The largest gross profit margin is approximately 38.36% on 2019-12-29 to 2020-03-28.
Grounded answer: The largest gross profit margin is <co: 0="">approximately 38.36</co:>% on <co: 0="">2019-12-29 to 2020-03-28</co:>.[0m
[1m> Finished chain.[0m
GT Answer:0.3836194330595236
--------------------------------------------------
question:q3
[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
I will write and execute Python code to calculate the minimum ratio of Operating Income Loss divided by Non Operating Income Expense.
{'tool_name': 'python_interpreter', 'parameters': {'code': "import pandas as pd\n\n# Read the CSV file into a DataFrame\ndf = pd.read_csv('income_statement.csv')\n\n# Calculate the ratio\nratios = df['OperatingIncomeLoss'] / df['NonoperatingIncomeExpense']\n\n# Find the minimum ratio\nmin_ratio = ratios.min()\n\nprint(f'Minimum ratio: {min_ratio:.2f}')\n"}}
[0m[36;1m[1;3mMinimum ratio: 35.36
[0m[32;1m[1;3mRelevant Documents: 0
Cited Documents: 0
Answer: The minimum ratio of Operating Income Loss divided by Non Operating Income Expense is approximately **35.36**.
Grounded answer: The minimum ratio of Operating Income Loss divided by Non Operating Income Expense is approximately *<co: 0="">*35.36</co:>**.[0m
[1m> Finished chain.[0m
GT Answer:35.360599
--------------------------------------------------

Also in this case, the Agent correctly answers all the questions.

QnA over Multiple Tables

We now make the task for the Agent more complicated, by asking it questions whose answer can be computed only by retrieving relevant information from multiple tables.

PYTHON

1 # define the Agent
2 python_repl = PythonREPL()
3 python_tool = Tool(
4     name="python_repl",
5     description="Executes python code and returns the result. The code runs in a static sandbox without interactive mode, so print output or save output to a file.",
6     func=python_repl.run,
7 )
8 python_tool.name = "python_interpreter"
9 
10 class ToolInput(BaseModel):
11     code: str = Field(description="Python code to execute.")
12 python_tool.args_schema = ToolInput
13 tools=[python_tool]
14 
15 prompt = ChatPromptTemplate.from_template("{input}")
16 agent = create_cohere_react_agent(
17    llm=llm,
18    tools=tools,
19    prompt=prompt,
20 )
21 agent_executor = AgentExecutor(agent=agent,
22                                tools=tools,
23                                verbose=True,
24                                return_intermediate_steps=True
25                     )
26 
27 preamble = """
28 You are an expert who answers the user's question in complete sentences. You are working with two pandas dataframe in Python. Ensure your output is a string.
29 
30 Here is a preview of the `income_statement.csv` dataframe:
31 {table_1}
32 
33 Here is a preview of the `balance_sheet.csv` dataframe:
34 {table_2}
35 """.format(table_1=income_statement.head(3).to_markdown(),table_2=balance_sheet.head(3).to_markdown())

We now define a new question.

PYTHON

1 question_dict ={
2     'q1': ['what is the ratio of the largest stockholders equity to the smallest revenue'],
3 }

The answer to this question can be obtained only by accessing both the balance sheet and the income statement, as shown below:

PYTHON

1 # get the largest stockholders equity
2 x = balance_sheet['StockholdersEquity'].astype(float).max()
3 print(f"The largest stockholders equity value is: {x}")
4 
5 # get the smallest revenue
6 y = income_statement['RevenueFromContractWithCustomerExcludingAssessedTax'].astype(float).min()
7 print(f"The smallest revenue value is: {y}")
8 
9 # compute the ratio
10 ratio = x/y
11 print(f"Their ratio is: {ratio}")

Output

The largest stockholders equity value is: 134047000000.0
The smallest revenue value is: 53809000000.0
Their ratio is: 2.4911631883142227

Let’s now get the answer from the Agent.

PYTHON

1 for qsn,val in question_dict.items():
2     print(f'question:{qsn}')
3     agent_executor.invoke({
4        "input": val[0],
5        "preamble": preamble,
6     })

Output

question:q1
[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
I will write and execute Python code to find the largest and smallest values in the relevant columns of the dataframes, then calculate the ratio.
{'tool_name': 'python_interpreter', 'parameters': {'code': "import pandas as pd\n\nincome_statement = pd.read_csv('income_statement.csv')\nbalance_sheet = pd.read_csv('balance_sheet.csv')\n\n# Find largest and smallest values in relevant columns\nmax_stockholders_equity = max([balance_sheet['StockholdersEquity'].iloc[i] for i in range(len(balance_sheet))])\nmin_revenue = min([income_statement['RevenueFromContractWithCustomerExcludingAssessedTax'].iloc[i] for i in range(len(income_statement))])\n\n# Calculate ratio\nratio = max_stockholders_equity / min_revenue\n\nprint(f'The ratio of the largest stockholders equity to the smallest revenue is {ratio:.2f}'\n    )"}}
[0m[36;1m[1;3mNameError("name 'balance_sheet' is not defined")[0m[32;1m[1;3m
My code has thrown an error because the variable `balance_sheet` has not been defined. I will fix this by adding the line `balance_sheet = pd.read_csv('balance_sheet.csv')` to the beginning of my code.
{'tool_name': 'python_interpreter', 'parameters': {'code': "import pandas as pd\n\nincome_statement = pd.read_csv('income_statement.csv')\n\nbalance_sheet = pd.read_csv('balance_sheet.csv')\n\n# Find largest and smallest values in relevant columns\nmax_stockholders_equity = max([balance_sheet['StockholdersEquity'].iloc[i] for i in range(len(balance_sheet))])\nmin_revenue = min([income_statement['RevenueFromContractWithCustomerExcludingAssessedTax'].iloc[i] for i in range(len(income_statement))])\n\n# Calculate ratio\nratio = max_stockholders_equity / min_revenue\n\nprint(f'The ratio of the largest stockholders equity to the smallest revenue is {ratio:.2f}'\n    )"}}
[0m[36;1m[1;3mNameError("name 'balance_sheet' is not defined")[0m[32;1m[1;3m
The same error has occurred again, so I will check my code carefully to make sure the variable is being defined correctly.
{'tool_name': 'python_interpreter', 'parameters': {'code': "import pandas as pd\n\n# Load data\nincome_statement = pd.read_csv('income_statement.csv')\nbalance_sheet = pd.read_csv('balance_sheet.csv')\n\n# Find largest and smallest values in relevant columns\nmax_stockholders_equity = max(balance_sheet['StockholdersEquity'])\nmin_revenue = min(income_statement['RevenueFromContractWithCustomerExcludingAssessedTax'])\n\n# Calculate ratio\nratio = max_stockholders_equity / min_revenue\n\nprint(f'The ratio of the largest stockholders equity to the smallest revenue is {ratio:.2f}')"}}
[0m[36;1m[1;3mThe ratio of the largest stockholders equity to the smallest revenue is 2.49
[0m[32;1m[1;3mRelevant Documents: 0,1,2
Cited Documents: 2
Answer: The ratio of the largest stockholders equity to the smallest revenue is approximately **2.49**.
Grounded answer: The ratio of the largest stockholders equity to the smallest revenue is approximately *<co: 2="">*2.49</co:>**.[0m
[1m> Finished chain.[0m

The Agent defined a plan (“write and execute Python code to find the largest and smallest values in the relevant columns of the dataframes, then calculate the ratio”) and executed it. It made some mistakes when coding, that resulted in NameError errors, but it fixed them and finally got to the correct answer