Topic Modeler

Topic Modeler

Topic modeling is a technique used to extract the main topics from a large collection of text documents. It helps people make sense of large volumes of unstructured text data, such as incoming messages to a chatbot. For example, this can be useful in the context of customer service, where teams can analyze common questions and feedback coming from customers about a product or service, so they can better serve customers in the future.

The Topic Modeler sample app analyzes a dataset composed of commands that people give to their AI-based personal assistant, and it then extracts the dataset’s key topics and themes. It leverages text embeddings, which are numerical representations of text data that captures its meaning, and uses the Cohere Embed endpoint to retrieve these text embeddings. The topics are generated via a clustering algorithm, grouping the various text inputs into different clusters based on the topics they represent.

The steps to build the Topic Modeler are:

Step 1: Load the text dataset
Step 2: Create clusters
Step 3: Get cluster keywords
Step 4: Visualize clusters on a plot

Read on for more details on each of these steps.

Building a Topic Modeling App

Step 1: Load the Text Dataset

First, let’s load the datasets. We’ll use the MASSIVE dataset, which contains a list of commands that people give to their AI-based personal assistant (e.g., Alexa). This type of clustering exploration is similar to how a company would analyze incoming customer messages when designing a chatbot.

# To install required dependencies, run the following command in the command line:
# pip install pandas cohere datasets altair topically umap-learn bertopic

# Import required libraries
import pandas as pd
import numpy as np
import cohere
import umap
import altair as alt
from bertopic import BERTopic
from datasets import load_dataset
from typing import Optional, List

# Get (a small sample) the dataset
dataset = load_dataset("AmazonScience/massive", "en-US", split="train" )

# For a simple demo, try only 100 records
df = pd.DataFrame(dataset).sample(100)

We can then embed the texts, which in this dataset reside in a column titled utt.

# Initialize the Cohere client
co = cohere.Client(api_key)

# Embed with Cohere’s embedding model, then convert into a numpy array
embeddings = co.embed(texts=list(df[‘utt’]),
embeddings = np.array(embeds)

title = "Commands to AI personal assistant"

Step 2: Create Clusters

Now that we have retrieved the text embeddings using the Embed endpoint, we can cluster these texts to understand how they are naturally grouped. Similar types of reviews will be grouped together in clusters.

Here, we apply the k-means clustering algorithm on the embeddings to create a number of clusters. This algorithm will identify which texts belong together, and assign each text to a cluster.

from sklearn.cluster import KMeans

n_clusters = 10

# Load and initialize BERTopic to use KMeans clustering with 8 clusters only.
cluster_model = KMeans(n_clusters=n_clusters)
topic_model = BERTopic(hdbscan_model=cluster_model)

# df is a dataframe. df['title'] is the column of text we're modeling
df['topic'], probabilities = topic_model.fit_transform(df['utt'], embeddings)

Step 3: Get Cluster Keywords

The next step is to generate a cluster name that best represents each cluster. For this, we’ll use the keywords retrieved from BERTopic, which is an open-source library focused on topic modeling (see the introduction to BERTopic on Cohere’s YouTube channel).

keywords = topic_model.generate_topic_labels()
df['cluster_keywords'] = df['topic'].map(lambda x: keywords[x])

Step 4: Visualize Clusters on a Plot

Finally, we’ll visualize the clusters on a plot. Here we use the Altair plotting package, which allows us to create an interactive plot that we can use to explore the text.

def interactive_clusters_scatterplot(
        df: pd.DataFrame,
        fields_in_tooltip: List[str] = None,
        title: str = '',
        title_column: str = 'keywords'
    if fields_in_tooltip is None:
        fields_in_tooltip = ['']

    selection = alt.selection_multi(fields=[title_column], bind='legend')

    chart = alt.Chart(df).transform_calculate(
    ).mark_circle(size=20, stroke='#666', strokeWidth=1, opacity=0.1).encode(
              axis=alt.Axis(labels=False, ticks=False, domain=False)
              axis=alt.Axis(labels=False, ticks=False, domain=False)

        opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),
        category={'scheme': 'category20'}
    return chart 

# Reduce dimensions to be able to plot the embeddings
n_neighbors = 15
reducer = umap.UMAP(n_neighbors=n_neighbors)
umap_embeds = reducer.fit_transform(embeddings)
df['x'] = umap_embeds[:, 0]
df['y'] = umap_embeds[:, 1]

# Specify the names of columns to plot

title_column = 'cluster_keywords'
fields_in_tooltip = ['utt',  ‘topic’, 'cluster_keywords']

title = "Commands to AI personal assistant"

chart = interactive_clusters_scatterplot(df,
                                            title=title + " - " + str(n_clusters) + " clusters",


And that concludes the process of creating our Topic Modeler app, built with text embedding, clustering, and keyword extraction. To get started building your own version, create a free Cohere account.