Build Chatbots That Know Your Business with MongoDB and Cohere
What you will learn:
- How to empower leverage semantic search on customer or operational data in MongoDB Atlas.
- Pass retrieved data to Cohere’s Command R+ generative model for retrieval-augmented generation (RAG).
- Develop and deploy a RAG-optimized user interface for your app.
- Create a conversation data store for your RAG chatbot using MongoDB
Use Case: Develop an advanced chatbot assistant that provides asset managers with information and actionable insights on technology company market reports.
Introduction
- What is Cohere?
- What is MongoDB?
- How Cohere and MongoDB work together?
What is Cohere?
What is MongoDB?
What exactly are we showing today?
Step 1: Install libaries and Set Environment Variables
Critical Security Reminder: Safeguard your production environment by never committing sensitive information, such as environment variable values, to public repositories. This practice is essential for maintaining the security and integrity of your systems.
Libraries:
cohere
: A Python library for accessing Cohere’s large language models, enabling natural language processing tasks like text generation, classification, and embedding.pymongo
: The recommended Python driver for MongoDB, allowing Python applications to interact with MongoDB databases for data storage and retrieval.datasets
: A library by Hugging Face that provides easy access to a wide range of datasets for machine learning and natural language processing tasks. *tqdm
: A fast, extensible progress bar library for Python, useful for displaying progress in long-running operations or loops.
Step 2: Data Loading and Preparation
Dataset Information
This dataset contains detailed information about multiple technology companies in the Information Technology sector. For each company, the dataset includes:
- Company name and stock ticker symbol
- Market analysis reports for recent years (typically 2023 and 2024), which include:
- Title and author of the report
- Date of publication
- Detailed content covering financial performance, product innovations, market position, challenges, and future outlook
- Stock recommendations and price targets
- Key financial metrics such as:
- Current stock price
- 52-week price range
- Market capitalization
- Price-to-earnings (P/E) ratio
- Dividend yield
- Recent news items, typically including:
- Date of the news
- Headline
- Brief summary
The market analysis reports provide in-depth information about each company’s performance, innovations, challenges, and future prospects. They offer insights into the companies’ strategies, market positions, and potential for growth.
recent_news | reports | company | ticker | key_metrics | sector | |
---|---|---|---|---|---|---|
0 | [{‘date’: ‘2024-06-09’, ‘headline’: ‘CyberDefe… | [{‘author’: ‘Taylor Smith, Technology Sector L… | CyberDefense Dynamics | CDDY | {‘52_week_range’: {‘high’: 387.3, ‘low’: 41.63… | Information Technology |
1 | [{‘date’: ‘2024-07-04’, ‘headline’: ‘CloudComp… | [{‘author’: ‘Casey Jones, Chief Market Strateg… | CloudCompute Pro | CCPR | {‘52_week_range’: {‘high’: 524.23, ‘low’: 171… | Information Technology |
2 | [{‘date’: ‘2024-06-27’, ‘headline’: ‘VirtualRe… | [{‘author’: ‘Sam Brown, Head of Equity Researc… | VirtualReality Systems | VRSY | {‘52_week_range’: {‘high’: 530.59, ‘low’: 56.4… | Information Technology |
3 | [{‘date’: ‘2024-07-06’, ‘headline’: ‘BioTech I… | [{‘author’: ‘Riley Smith, Senior Tech Analyst… | BioTech Innovations | BTCI | {‘52_week_range’: {‘high’: 366.55, ‘low’: 124… | Information Technology |
4 | [{‘date’: ‘2024-06-26’, ‘headline’: ‘QuantumCo… | [{‘author’: ‘Riley Garcia, Senior Tech Analyst… | QuantumComputing Inc | QCMP | {‘52_week_range’: {‘high’: 231.91, ‘low’: 159… | Information Technology |
company | ticker | combined_attributes | |
---|---|---|---|
0 | CyberDefense Dynamics | CDDY | CyberDefense Dynamics Information Technology 2… |
1 | CloudCompute Pro | CCPR | CloudCompute Pro Information Technology 2023 C… |
2 | VirtualReality Systems | VRSY | VirtualReality Systems Information Technology … |
3 | BioTech Innovations | BTCI | BioTech Innovations Information Technology 202… |
4 | QuantumComputing Inc | QCMP | QuantumComputing Inc Information Technology 20… |
Step 3: Embedding Generation with Cohere
We just computed 63 embeddings.
recent_news | reports | company | ticker | key_metrics | sector | combined_attributes | embedding | |
---|---|---|---|---|---|---|---|---|
0 | [{‘date’: ‘2024-06-09’, ‘headline’: ‘CyberDefe… | [{‘author’: ‘Taylor Smith, Technology Sector L… | CyberDefense Dynamics | CDDY | {‘52_week_range’: {‘high’: 387.3, ‘low’: 41.63… | Information Technology | CyberDefense Dynamics Information Technology 2… | [0.01210022, -0.03466797, -0.017562866, -0.025… |
1 | [{‘date’: ‘2024-07-04’, ‘headline’: ‘CloudComp… | [{‘author’: ‘Casey Jones, Chief Market Strateg… | CloudCompute Pro | CCPR | {‘52_week_range’: {‘high’: 524.23, ‘low’: 171… | Information Technology | CloudCompute Pro Information Technology 2023 C… | [-0.058563232, -0.06323242, -0.037139893, -0.0… |
2 | [{‘date’: ‘2024-06-27’, ‘headline’: ‘VirtualRe… | [{‘author’: ‘Sam Brown, Head of Equity Researc… | VirtualReality Systems | VRSY | {‘52_week_range’: {‘high’: 530.59, ‘low’: 56.4… | Information Technology | VirtualReality Systems Information Technology … | [0.024154663, -0.022872925, -0.01751709, -0.05… |
3 | [{‘date’: ‘2024-07-06’, ‘headline’: ‘BioTech I… | [{‘author’: ‘Riley Smith, Senior Tech Analyst’… | BioTech Innovations | BTCI | {‘52_week_range’: {‘high’: 366.55, ‘low’: 124… | Information Technology | BioTech Innovations Information Technology 202… | [0.020736694, -0.041046143, -0.0029773712, -0… |
4 | [{‘date’: ‘2024-06-26’, ‘headline’: ‘QuantumCo… | [{‘author’: ‘Riley Garcia, Senior Tech Analyst… | QuantumComputing Inc | QCMP | {‘52_week_range’: {‘high’: 231.91, ‘low’: 159… | Information Technology | QuantumComputing Inc Information Technology 20… | [-0.009757996, -0.04815674, 0.039611816, 0.023… |
Step 4: MongoDB Vector Database and Connection Setup
MongoDB acts as both an operational and a vector database for the RAG system. MongoDB Atlas specifically provides a database solution that efficiently stores, queries and retrieves vector embeddings.
Creating a database and collection within MongoDB is made simple with MongoDB Atlas.
- First, register for a MongoDB Atlas account. For existing users, sign into MongoDB Atlas.
- Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.
- Create the database:
asset_management_use_case
. - Within the database
asset_management_use_case
, create the collectionmarket_reports
. - Create a vector search index named vector_index for the ‘listings_reviews’ collection. This index enables the RAG application to retrieve records as additional context to supplement user queries via vector search. Below is the JSON definition of the data collection vector search index.
Your vector search index created on MongoDB Atlas should look like below:
Follow MongoDB’s steps to get the connection string from the Atlas UI. After setting up the database and obtaining the Atlas cluster connection URI, securely store the URI within your development environment.
Connection to MongoDB successful
Step 5: Data Ingestion
MongoDB’s Document model and its compatibility with Python dictionaries offer several benefits for data ingestion.
- Document-oriented structure:
- MongoDB stores data in JSON-like documents: BSON(Binary JSON).
- This aligns naturally with Python dictionaries, allowing for seamless data representation using key value pair data structures.
- Schema flexibility:
- MongoDB is schema-less, meaning each document in a collection can have a different structure.
- This flexibility matches Python’s dynamic nature, allowing you to ingest varied data structures without predefined schemas.
- Efficient ingestion:
- The similarity between Python dictionaries and MongoDB documents allows for direct ingestion without complex transformations.
- This leads to faster data insertion and reduced processing overhead.
Data ingestion into MongoDB completed
Step 6: MongoDB Query language and Vector Search
Query flexibility
MongoDB’s query language is designed to work well with document structures, making it easy to query and manipulate ingested data using familiar Python-like syntax.
Aggregation Pipeline
MongoDB’s aggregation pipelines is a powerful feature of the MongoDB Database that allows for complex data processing and analysis within the database. Aggregation pipeline can be thought of similarly to pipelines in data engineering or machine learning, where processes operate sequentially, each stage taking an input, performing operations, and providing an output for the next stage.
Stages
Stages are the building blocks of an aggregation pipeline. Each stage represents a specific data transformation or analysis operation. Common stages include:
$match
: Filters documents (similar to WHERE in SQL)$group
: Groups documents by specified fields$sort
: Sorts the documents$project
: Reshapes documents (select, rename, compute fields)$limit
: Limits the number of documents$unwind
: Deconstructs array fields$lookup
: Performs left outer joins with other collections
Step 7: Add the Cohere Reranker
Cohere rerank functions as a second stage search that can improve the precision of your first stage search results
reports | company | combined_attributes | score | |
---|---|---|---|---|
0 | [{‘author’: ‘Jordan Garcia, Senior Tech Analys… | GreenEnergy Corp | GreenEnergy Corp Information Technology 2023 G… | 0.659524 |
1 | [{‘author’: ‘Morgan Smith, Technology Sector L… | BioTech Therapeutics | BioTech Therapeutics Information Technology 20… | 0.646300 |
2 | [{‘author’: ‘Casey Davis, Technology Sector Le… | RenewableEnergy Innovations | RenewableEnergy Innovations Information Techno… | 0.645224 |
3 | [{‘author’: ‘Morgan Johnson, Technology Sector… | QuantumSensor Corp | QuantumSensor Corp Information Technology 2023… | 0.644383 |
4 | [{‘author’: ‘Morgan Williams, Senior Tech Anal…` | BioEngineering Corp | BioEngineering Corp Information Technology 202… | 0.643690 |
company | combined_attributes | reports | vector_search_score | relevance_score | |
---|---|---|---|---|---|
0 | GreenEnergy Corp | GreenEnergy Corp Information Technology 2023 G… | [{‘author’: ‘Jordan Garcia, Senior Tech Analys… | 0.659524 | 0.000147 |
1 | BioEngineering Corp | BioEngineering Corp Information Technology 202… | [{‘author’: ‘Morgan Williams, Senior Tech Anal… | 0.643690 | 0.000065 |
2 | QuantumSensor Corp | QuantumSensor Corp Information Technology 2023… | [{‘author’: ‘Morgan Johnson, Technology Sector… | 0.644383 | 0.000054 |
Step 8: Handling User Queries
Final answer: Here is an overview of the companies with negative market reports or sentiment that might deter long-term investment:
GreenEnergy Corp (GRNE):
- Challenges: Despite solid financial performance and a positive market position, GRNE faces challenges due to the volatile political environment and rising trade tensions, resulting in increased tariffs and supply chain disruptions.
- Regulatory Scrutiny: The company is under scrutiny for its data handling practices, raising concerns about potential privacy breaches and ethical dilemmas.
BioEngineering Corp (BENC):
- Regulatory Hurdles: BENC faces delays in obtaining approvals for certain products due to stringent healthcare regulations, impacting their time-to-market.
- Reimbursement and Pricing Pressures: As healthcare costs rise, the company must carefully navigate pricing strategies to balance accessibility and profitability.
- Research and Development Expenses: BENC has experienced a significant increase in R&D expenses, which may impact its ability to maintain a competitive pricing strategy.
QuantumSensor Corp (QSCP):
- Supply Chain Disruptions: QSCP has faced supply chain issues due to global logistics problems and geopolitical tensions, impacting production and delivery.
- Regulatory Scrutiny: The company is under scrutiny for its data collection and handling practices, with potential privacy and ethical concerns.
- Technical Workforce Challenges: Attracting and retaining skilled talent in a competitive market has been challenging for QSCP.