Topic Modeling AI Papers

Topic Modeling AI Papers

Natural Language Processing (NLP) is a hot topic in machine learning. It involves analyzing and understanding text-based data. Clustering algorithms are quite popular in NLP. They group a set of unlabeled texts in such a way that texts in the same cluster are more like one another. Topic modeling is one application of clustering in NLP. It uses unsupervised learning to extract topics from a collection of documents. Other applications include automatic document organization and fast information retrieval or filtering.

You'll learn how to use Cohere’s NLP tools to perform semantic search and clustering of AI Papers. This will help you discover trends in AI. You'll scrape the Journal of Artificial Intelligence Research. The output is a list of recently published AI papers. You’ll use Cohere’s Embed Endpoint to generate word embeddings using your list of AI papers. Finally, visualize the embeddings and proceed to build semantic search and topic modeling.

To follow along with this tutorial, you need to be familiar with Python. Make sure you have python version 3.6+ installed in your development machine. You can also use Google Colab to try out the project in the cloud. Finally, you need to have a Cohere Account. If you haven’t signed up already, register for a New Cohere Account. All new accounts receive $75 free credits. You'll access a Pay-As-You-Go option after finishing your credits.

First, you need to install the python dependencies required to run the project. Use pip to install them using the command below

pip install requests beautifulsoup4 cohere altair clean-text numpy pandas sklearn > /dev/null

Create a new python file named Write all your code in this file. import the dependencies and initialize Cohere’s client.

import cohere

api_key = '<API_KEY>' 
co = cohere.Client(api_key)

This tutorial focuses on applying topic modeling to look for recent trends in AI. This task requires you to source a list of titles for AI papers. Use web scraping techniques to collect a list of AI papers. Use the Journal of Artificial Intelligence Research as your data source. Finally, you will clean this data by removing unwanted characters and stop words.

First, import the required libraries to make web requests and process the web content .

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from cleantext import clean

Next, make an HTTP request to the source website that has an archive of the AI papers.

URL = ""
page = requests.get(URL)

Use this archive to get the list of AI papers published. This archive has papers published since 2015. This tutorial considers papers published recently, on or after 2020 only.

soup = BeautifulSoup(page.content, "html.parser")
archive_links = []

for link in'a.title'):
  vol = link.text
  link = link.get('href')
  year = int(vol[vol.find("(")+1:vol.find(")")])
  if year >= 2020:
    archive_links.append({ 'year': year, 'link': link })

Finally, you’ll need to clean the titles of the AI papers gathered. Remove trailing white spaces and unwanted characters. Use the NTLK library to get English stop words and filter them out.

papers = []
for archive in archive_links:
  page = requests.get(archive['link'])
  soup = BeautifulSoup(page.content, "html.parser")
  links =' a')
  for link in links:
    # clean the title
    title = clean(text=link.text,
            replace_with_url="This is a URL",
    papers.append({ 'year': archive['year'], 'title': title, 'link': link.get('href') })

The dataset created using this process has 258 AI papers published between 2020 and 2022. Use pandas library to create a data frame to hold our text data.

df = pd.DataFrame(papers)

Word embedding is a technique for learning the representation of words. Words that have same meanings have similar representation. You can use these embeddings to:

• cluster large amounts of text
• match a query with other similar sentences
• perform classification tasks like sentiment classification

Cohere’s platform provides an Embed Endpoint that returns text embeddings. An embedding is a list of floating-point numbers. They capture the semantic meaning of the represented text. Models used to create these embeddings are available in 3 sizes: small, medium, and large. Small models are faster while large models offer better performance.

Write a function to create the word embeddings using Cohere. The function should read as follows:

def get_embeddings(text,model='medium'):
  output = co.embed(
  return output.embeddings[0]

Create a new column in your pandas data frame to hold the embeddings created.

df['title_embeds'] = df['title'].apply(get_embeddings)

Congratulations! You have created the word embeddings . Now, you will proceed to visualize the embeddings using a scatter plot. First, you need to reduce the dimensions of the word embeddings. You’ll use the Principal Component Analysis (PCA) method to achieve this task. Import the necessary packages and create a function to return the principle components.

from sklearn.decomposition import PCA

def get_pc(arr,n):
  pca = PCA(n_components=n)
  embeds_transform = pca.fit_transform(arr)
  return embeds_transform

Next, create a function to generate a scatter plot chart. You’ll use the altair library to create the charts.

import altair as alt
def generate_chart(df,xcol,ycol,lbl='off',color='basic',title=''):
  chart = alt.Chart(df).mark_circle(size=500).encode(
    x= alt.X(xcol,
      axis=alt.Axis(labels=False, ticks=False, domain=False)

    y= alt.Y(ycol,
      axis=alt.Axis(labels=False, ticks=False, domain=False)
    color= alt.value('#333293') if color == 'basic' else color,

  if lbl == 'on':
    text = chart.mark_text(align='left', baseline='middle',dx=15, size=13,color='black').encode(text='title', color= alt.value('black'))
    text = chart.mark_text(align='left', baseline='middle',dx=10).encode()

  result = (chart + text).configure(background="#FDF7F0"
  orient='bottom', titleFontSize=18,labelFontSize=18)
  return result

Finally, use the embeddings with reduced dimensionality to create a scatter plot.

sample = 200
embeds = np.array(df['title_embeds'].tolist())
embeds_pc2 = get_pc(embeds,2)
df_pc2 = pd.concat([df, pd.DataFrame(embeds_pc2)], axis=1)

df_pc2.columns = df_pc2.columns.astype(str)

     year                                              title  \
0    2022  metric-distortion bounds under limited informa...   
1    2022  recursion in abstract argumentation is hard --...   
2    2022  crossing the conversational chasm: a primer on...   
3    2022  hebo: pushing the limits of sample-efficient h...   
4    2022  learning bayesian networks under sparsity cons...   
..    ...                                                ...   
195  2020  using machine learning for decreasing state un...   
196  2020  mapping the landscape of artificial intelligen...   
197  2020  contrasting the spread of misinformation in on...   
198  2020  to regulate or not: a social dynamics analysis...   
199  2020  qualitative numeric planning: reductions and c...   

                                                  link  \
..                                                 ...   

                                          title_embeds          0          1  
0    [-0.7879443, 0.14064652, -1.1886923, 1.0581255...   3.773588  13.307009  
1    [0.37486058, 2.2867563, 0.48023587, -1.3632454...  13.585676   8.620154  
2    [-0.7070194, -0.5557753, 2.6077378, 0.11462678...  17.785715  -8.959141  
3    [-0.19081053, 0.05036301, -0.48858774, 0.66812...  -0.499191  -5.828358  
4    [0.84096915, -1.0650194, -0.8836163, -1.631231...  -1.372444  -5.549065  
..                                                 ...        ...        ...  
195  [-1.4959816, 0.8587867, 1.1109167, -0.9420541,... -12.141110 -21.291473  
196  [-1.7567614, 0.12333965, -0.41682896, 0.820096...   9.577528  -8.201695  
197  [-0.25555933, -1.6548307, -1.1497015, -1.00241...   4.017263  16.588557  
198  [-1.1415248, 0.9333024, 0.18291989, 2.2976398,...  -2.903703   6.331687  
199  [-0.8152119, 0.7301979, -1.6634299, -0.395152,... -12.204317 -11.396169  

[200 rows x 6 columns]

Here’s a chart demonstrating the word embeddings for AI papers. It is important to note that the chart represents a sample size of 200 papers.

generate_chart(df_pc2.iloc[:sample],'0','1',title='2D Embeddings')

Data searching techniques focus on using keywords to retrieve text-based information. You can take this a level higher. Use search queries to determine the intent and contextual meaning. In this section, you’ll use Cohere to create embeddings for the search query. Use the embeddings to get the similarity with your dataset’s embeddings. The output is a list of similar AI papers.

First, create a function to get similarities between two embeddings. This will use the cosine similarity algorithm from the sci-kit learn library.

from sklearn.metrics.pairwise import cosine_similarity

def get_similarity(target,candidates):
  # Turn list into array
  candidates = np.array(candidates)
  target = np.expand_dims(np.array(target),axis=0)

  # Calculate cosine similarity
  sim = cosine_similarity(target,candidates)
  sim = np.squeeze(sim).tolist()
  sort_index = np.argsort(sim)[::-1]
  sort_score = [sim[i] for i in sort_index]
  similarity_scores = zip(sort_index,sort_score)

  # Return similarity scores
  return similarity_scores

Next, create embeddings for the search query

new_query = "graph network strategies"

new_query_embeds = get_embeddings(new_query)

Finally, check the similarity between the two embeddings. Display the top 10 similar papers using your result

similarity = get_similarity(new_query_embeds,embeds[:sample])


print('Similar queries:')
for idx,sim in similarity:
  print(f'Similarity: {sim:.2f};',df.iloc[idx]['title'])

graph network strategies 

Similar queries:
Similarity: 0.49; amp chain graphs: minimal separators and structure learning algorithms
Similarity: 0.46; pure nash equilibria in resource graph games
Similarity: 0.44; general value function networks
Similarity: 0.42; on the online coalition structure generation problem
Similarity: 0.42; efficient local search based on dynamic connectivity maintenance for minimum connected dominating set
Similarity: 0.42; graph kernels: a survey
Similarity: 0.39; rwne: a scalable random-walk based network embedding framework with personalized higher-order proximity preserved
Similarity: 0.39; the petlon algorithm to plan efficiently for task-level-optimal navigation
Similarity: 0.38; election manipulation on social networks: seeding, edge removal, edge addition
Similarity: 0.38; a semi-exact algorithm for quickly computing a maximum weight clique in large sparse graphs
Similarity: 0.37; probabilistic temporal networks with ordinary distributions: theory, robustness and expected utility
Similarity: 0.36; adaptive greedy versus non-adaptive greedy for influence maximization
Similarity: 0.35; classifier chains: a review and perspectives
Similarity: 0.35; learning bayesian networks under sparsity constraints: a parameterized complexity analysis
Similarity: 0.34; optimally deceiving a learning leader in stackelberg games
Similarity: 0.34; planning high-level paths in hostile, dynamic, and uncertain environments
Similarity: 0.34; computational complexity of computing symmetries in finite-domain planning
Similarity: 0.33; on the indecisiveness of kelly-strategyproof social choice functions
Similarity: 0.33; qualitative numeric planning: reductions and complexity
Similarity: 0.33; intelligence in strategic games
Similarity: 0.33; steady-state planning in expected reward multichain mdps
Similarity: 0.32; hybrid-order network consensus for distributed multi-agent systems
Similarity: 0.32; evaluating strategic structures in multi-agent inverse reinforcement learning
Similarity: 0.32; multi-agent advisor q-learning
Similarity: 0.32; fairness in influence maximization through randomization
Similarity: 0.32; a sufficient statistic for influence in structured multiagent environments
Similarity: 0.32; constraint and satisfiability reasoning for graph coloring
Similarity: 0.31; sum-of-products with default values: algorithms and complexity results
Similarity: 0.31; contrastive explanations of plans through model restrictions
Similarity: 0.31; zone path construction (zac) based approaches for effective real-time ridesharing
Similarity: 0.31; an external knowledge enhanced graph-based neural network for sentence ordering
Similarity: 0.31; finding and recognizing popular coalition structures
Similarity: 0.30; constraint-based diversification of jop gadgets
Similarity: 0.30; improved high dimensional discrete bayesian network inference using triplet region construction
Similarity: 0.30; migrating techniques from search-based multi-agent path finding solvers to sat-based approach
Similarity: 0.30; proactive dynamic distributed constraint optimization problems
Similarity: 0.30; safe multi-agent pathfinding with time uncertainty
Similarity: 0.30; a logic-based explanation generation framework for classical and hybrid planning problems
Similarity: 0.30; scalable online planning for multi-agent mdps
Similarity: 0.30; on sparse discretization for graphical games
Similarity: 0.30; fond planning with explicit fairness assumptions
Similarity: 0.29; jointly learning environments and control policies with projected stochastic gradient ascent
Similarity: 0.29; computational benefits of intermediate rewards for goal-reaching policy learning
Similarity: 0.29; two-phase multi-document event summarization on core event graphs
Similarity: 0.29; efficient large-scale multi-drone delivery using transit networks
Similarity: 0.29; fast adaptive non-monotone submodular maximization subject to a knapsack constraint
Similarity: 0.29; path counting for grid-based navigation
Similarity: 0.29; preferences single-peaked on a tree: multiwinner elections and structural results
Similarity: 0.29; merge-and-shrink: a compositional theory of transformations of factored transition systems
Similarity: 0.29; lilotane: a lifted sat-based approach to hierarchical planning
Similarity: 0.29; cost-optimal planning, delete relaxation, approximability, and heuristics
Similarity: 0.29; learning over no-preferred and preferred sequence of items for robust recommendation
Similarity: 0.28; a few queries go a long way: information-distortion tradeoffs in matching
Similarity: 0.28; constrained multiagent markov decision processes: a taxonomy of problems and algorithms
Similarity: 0.28; contiguous cake cutting: hardness results and approximation algorithms
Similarity: 0.28; online relaxation refinement for satisficing planning: on partial delete relaxation, complete hill-climbing, and novelty pruning
Similarity: 0.28; playing codenames with language graphs and word embeddings
Similarity: 0.28; a theoretical perspective on hyperdimensional computing
Similarity: 0.28; reasoning with pcp-nets
Similarity: 0.28; improving the effectiveness and efficiency of stochastic neighbour embedding with isolation kernel
Similarity: 0.27; teaching people by justifying tree search decisions: an empirical study in curling
Similarity: 0.27; learning optimal decision sets and lists with sat
Similarity: 0.27; efficient multi-objective reinforcement learning via multiple-gradient descent with iteratively discovered weight-vector sets
Similarity: 0.27; liquid democracy: an algorithmic perspective
Similarity: 0.27; contrasting the spread of misinformation in online social networks
Similarity: 0.27; on the computational complexity of non-dictatorial aggregation
Similarity: 0.27; planning with critical section macros: theory and practice
Similarity: 0.27; regarding goal bounding and jump point search
Similarity: 0.27; using machine learning for decreasing state uncertainty in planning
Similarity: 0.26; strategyproof mechanisms for additively separable and fractional hedonic games
Similarity: 0.26; ranking sets of objects: the complexity of avoiding impossibility results
Similarity: 0.26; socially responsible ai algorithms: issues, purposes, and challenges
Similarity: 0.26; inductive logic programming at 30: a new introduction
Similarity: 0.26; to regulate or not: a social dynamics analysis of an idealised ai race
Similarity: 0.25; learning temporal causal sequence relationships from real-time time-series
Similarity: 0.25; integrated offline and online decision making under uncertainty
Similarity: 0.25; the computational complexity of understanding binary classifier decisions
Similarity: 0.25; a survey of opponent modeling in adversarial domains
Similarity: 0.25; cooperation and learning dynamics under wealth inequality and diversity in individual risk
Similarity: 0.25; sunny-as2: enhancing sunny for algorithm selection
Similarity: 0.25; game plan: what ai can do for football, and what football can do for ai
Similarity: 0.25; approximating perfect recall when model checking strategic abilities: theory and applications
Similarity: 0.25; efficient retrieval of matrix factorization-based top-k recommendations: a survey of recent approaches
Similarity: 0.25; evolutionary dynamics and phi-regret minimization in games
Similarity: 0.24; labeled bipolar argumentation frameworks
Similarity: 0.24; optimal any-angle pathfinding on a sphere
Similarity: 0.24; learning from disagreement: a survey
Similarity: 0.24; on the cluster admission problem for cloud computing
Similarity: 0.24; aggregation over metric spaces: proposing and voting in elections, budgeting, and legislation
Similarity: 0.24; ordinal maximin share approximation for goods
Similarity: 0.24; constraint solving approaches to the business-to-business meeting scheduling problem
Similarity: 0.24; objective bayesian nets for integrating consistent datasets
Similarity: 0.24; quantum mathematics in artificial intelligence
Similarity: 0.24; task-aware verifiable rnn-based policies for partially observable markov decision processes
Similarity: 0.23; samba: a generic framework for secure federated multi-armed bandits
Similarity: 0.23; automated reinforcement learning (autorl): a survey and open problems
Similarity: 0.23; generic constraint-based block modeling using constraint programming
Similarity: 0.23; a comprehensive framework for learning declarative action models
Similarity: 0.23; optimizing for interpretability in deep neural networks with tree regularization
Similarity: 0.22; impact of imputation strategies on fairness in machine learning
Similarity: 0.22; set-to-sequence methods in machine learning: a review
Similarity: 0.22; the ai liability puzzle and a fund-based work-around
Similarity: 0.22; agent-based markov modeling for improved covid-19 mitigation policies
Similarity: 0.22; the parameterized complexity of motion planning for snake-like robots
Similarity: 0.22; induction and exploitation of subgoal automata for reinforcement learning
Similarity: 0.22; point at the triple: generation of text summaries from knowledge base triples
Similarity: 0.22; trends in integration of vision and language research: a survey of tasks, datasets, and methods
Similarity: 0.21; multi-document summarization with determinantal point process attention
Similarity: 0.21; reward machines: exploiting reward function structure in reinforcement learning
Similarity: 0.21; computing bayes-nash equilibria in combinatorial auctions with verification
Similarity: 0.21; taking principles seriously: a hybrid approach to value alignment in artificial intelligence
Similarity: 0.21; two-facility location games with minimum distance requirement
Similarity: 0.21; structure from randomness in halfspace learning with the zero-one loss
Similarity: 0.21; avoiding negative side effects of autonomous systems in the open world
Similarity: 0.21; learning realistic patterns from visually unrealistic stimuli: generalization and data anonymization
Similarity: 0.21; properties of switch-list representations of boolean functions
Similarity: 0.21; conceptual modeling of explainable recommender systems: an ontological formalization to guide their design and development
Similarity: 0.21; predicting decisions in language based persuasion games
Similarity: 0.20; epidemioptim: a toolbox for the optimization of control policies in epidemiological models
Similarity: 0.20; ffci: a framework for interpretable automatic evaluation of summarization
Similarity: 0.20; analysis of the impact of randomization of search-control parameters in monte-carlo tree search
Similarity: 0.20; output space entropy search framework for multi-objective bayesian optimization
Similarity: 0.20; multiobjective tree-structured parzen estimator
Similarity: 0.19; on the decomposition of abstract dialectical frameworks and the complexity of naive-based semantics
Similarity: 0.19; hebo: pushing the limits of sample-efficient hyper-parameter optimisation
Similarity: 0.19; superintelligence cannot be contained: lessons from computability theory
Similarity: 0.19; goal recognition for deceptive human agents through planning and gaze
Similarity: 0.19; dimensional inconsistency measures and postulates in spatio-temporal databases
Similarity: 0.19; explainable deep learning: a field guide for the uninitiated
Similarity: 0.19; multi-label classification neural networks with hard logical constraints
Similarity: 0.19; declarative algorithms and complexity results for assumption-based argumentation
Similarity: 0.19; worst-case bounds on power vs. proportion in weighted voting games with an application to false-name manipulation
Similarity: 0.19; collie: continual learning of language grounding from language-image embeddings
Similarity: 0.18; a word selection method for producing interpretable distributional semantic word vectors
Similarity: 0.18; the bottleneck simulator: a model-based deep reinforcement learning approach
Similarity: 0.18; metric-distortion bounds under limited information
Similarity: 0.18; neural natural language generation: a survey on multilinguality, multimodality, controllability and learning
Similarity: 0.18; on the evolvability of monotone conjunctions with an evolutionary mutation mechanism
Similarity: 0.18; a metric space for point process excitations
Similarity: 0.18; autotelic agents with intrinsically motivated goal-conditioned reinforcement learning: a short survey
Similarity: 0.18; the application of machine learning techniques for predicting match results in team sport: a review
Similarity: 0.18; incremental event calculus for run-time reasoning
Similarity: 0.18; a survey on the explainability of supervised machine learning
Similarity: 0.18; madras : multi agent driving simulator
Similarity: 0.17; maximin share allocations on cycles
Similarity: 0.17; the rediscovery hypothesis: language models need to meet linguistics
Similarity: 0.17; some inapproximability results of map inference and exponentiated determinantal point processes
Similarity: 0.17; visually grounded models of spoken language: a survey of datasets, architectures and evaluation techniques
Similarity: 0.17; the complexity landscape of outcome determination in judgment aggregation
Similarity: 0.17; survey and evaluation of causal discovery methods for time series
Similarity: 0.17; agent-based modeling for predicting pedestrian trajectories around an autonomous vehicle
Similarity: 0.16; rethinking fairness: an interdisciplinary survey of critiques of hegemonic ml fairness approaches
Similarity: 0.16; mapping the landscape of artificial intelligence applications against covid-19
Similarity: 0.16; crossing the conversational chasm: a primer on natural language processing for multilingual task-oriented dialogue systems
Similarity: 0.16; neural machine translation: a review
Similarity: 0.16; viewpoint: ethical by designer - how to grow ethical designers of artificial intelligence
Similarity: 0.16; core challenges in embodied vision-language planning
Similarity: 0.16; neural character-level syntactic parsing for chinese
Similarity: 0.16; marginal distance and hilbert-schmidt covariances-based independence tests for multivariate functional data
Similarity: 0.16; instance-level update in dl-lite ontologies through first-order rewriting
Similarity: 0.16; the societal implications of deep reinforcement learning
Similarity: 0.16; benchmark and survey of automated machine learning frameworks
Similarity: 0.16; experimental comparison and survey of twelve time series anomaly detection algorithms
Similarity: 0.15; annotator rationales for labeling tasks in crowdsourcing
Similarity: 0.15; representative committees of peers
Similarity: 0.15; fine-grained prediction of political leaning on social media with unsupervised deep learning
Similarity: 0.15; out of context: a new clue for context modeling of aspect-based sentiment analysis
Similarity: 0.15; loss functions, axioms, and peer review
Similarity: 0.15; on the tractability of shap explanations
Similarity: 0.15; doubly robust crowdsourcing
Similarity: 0.15; adversarial framework with certified robustness for time-series domain via statistical features
Similarity: 0.14; on quantifying literals in boolean logic and its applications to explainable ai
Similarity: 0.14; bribery and control in stable marriage
Similarity: 0.14; finding the hardest formulas for resolution
Similarity: 0.14; supervised visual attention for simultaneous multimodal machine translation
Similarity: 0.14; a tight bound for stochastic submodular cover
Similarity: 0.14; ethics and governance of artificial intelligence: evidence from a survey of machine learning researchers
Similarity: 0.13; flexible bayesian nonlinear model configuration
Similarity: 0.13; multilingual machine translation: deep analysis of language-specific encoder-decoders
Similarity: 0.13; multilabel classification with partial abstention: bayes-optimal prediction under label independence
Similarity: 0.13; viewpoint: ai as author bridging the gap between machine learning and literary theory
Similarity: 0.13; confident learning: estimating uncertainty in dataset labels
Similarity: 0.13; belief change and 3-valued logics: characterization of 19,683 belief change operators
Similarity: 0.13; get out of the bag! silos in ai ethics education: unsupervised topic modeling analysis of global ai curricula
Similarity: 0.12; on the distortion value of elections with abstention
Similarity: 0.12; measuring the occupational impact of ai: tasks, cognitive abilities and ai benchmarks
Similarity: 0.11; fair division of indivisible goods for a class of concave valuations
Similarity: 0.10; admissibility in probabilistic argumentation
Similarity: 0.10; recursion in abstract argumentation is hard --- on the complexity of semantics based on weak admissibility
Similarity: 0.10; weighted first-order model counting in the two-variable fragment with counting quantifiers
Similarity: 0.10; incompatibilities between iterated and relevance-sensitive belief revision
Similarity: 0.10; automatic recognition of the general-purpose communicative functions defined by the iso 24617-2 standard for dialog act annotation
Similarity: 0.09; nlp methods for extraction of symptoms from unstructured data for use in prognostic covid-19 analytic models
Similarity: 0.09; welfare guarantees in schelling segregation
Similarity: 0.09; casa: conversational aspect sentiment analysis for dialogue understanding
Similarity: 0.09; a survey of algorithms for black-box safety validation of cyber-physical systems
Similarity: 0.07; relevance in belief update
Similarity: 0.06; on super strong eth
Similarity: 0.06; image captioning as an assistive technology: lessons learned from vizwiz 2020 challenge
Similarity: 0.03; confronting abusive language online: a survey from the ethical and human rights perspective

Visualizing semantic search:

Clustering is a process of grouping similar documents into clusters. It allows you to organize many documents into a smaller number of groups. As a result, you can discover emerging patterns in the documents. In this section, you will use the k-Means clustering algorithm to identify the top 5 clusters.

First, import the k-means algorithm from the scikit-learn package. Then configure two variables: the number of clusters and a duplicate dataset.

from sklearn.cluster import KMeans

df_clust = df_pc2.copy()

Next, initialize the k-means model and use it to fit the embeddings to create the clusters.

kmeans_model = KMeans(n_clusters=n_clusters, random_state=0)
classes = kmeans_model.fit_predict(embeds).tolist()
df_clust['cluster'] = (list(map(str,classes)))

[2, 0, 3, 4, 4, 3, 1, 1, 0, 3, 0, 2, 0, 1, 1, 0, 0, 2, 0, 1, 3, 2, 1, 3, 0, 2, 2, 0, 2, 1, 1, 2, 2, 1, 0, 1, 1, 1, 2, 2, 2, 4, 3, 3, 3, 3, 2, 1, 2, 4, 3, 0, 2, 0, 1, 1, 0, 4, 0, 2, 2, 3, 1, 2, 4, 1, 2, 1, 4, 0, 3, 3, 4, 2, 0, 2, 2, 2, 0, 0, 0, 4, 1, 4, 1, 2, 0, 4, 1, 1, 4, 1, 4, 1, 4, 1, 0, 0, 4, 2, 4, 3, 4, 3, 2, 0, 2, 1, 1, 4, 2, 4, 2, 2, 0, 3, 1, 3, 2, 3, 1, 2, 0, 4, 4, 1, 0, 0, 4, 1, 1, 2, 2, 1, 2, 3, 0, 0, 1, 1, 1, 0, 4, 1, 4, 2, 4, 2, 4, 3, 2, 0, 1, 4, 1, 1, 2, 2, 0, 1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 0, 4, 2, 4, 2, 1, 0, 3, 0, 1, 0, 2, 2, 1, 4, 1, 3, 4, 1, 0, 2, 1, 2, 0, 2, 4, 1, 4, 2, 2, 1, 0, 0, 1, 0, 2, 1, 0, 4, 1, 4, 0, 2, 1, 4, 1, 3, 2, 4, 2, 0, 1, 0, 3, 0, 2, 4, 1, 1, 3, 2, 3, 1, 3, 4, 2, 2, 0, 1, 1, 1, 4, 1, 0, 4, 3, 2, 2, 2, 2, 0, 1, 3, 1, 3, 4, 2, 4, 2, 1, 3]

Finally, plot a scatter plot to visualize the 5 clusters in our sample size.

df_clust.columns = df_clust.columns.astype(str)
generate_chart(df_clust.iloc[:sample],'0','1',lbl='off',color='cluster',title='Clustering with 5 Clusters')

Let's recap the NLP tasks implemented in this tutorial. You’ve created word embeddings, perform a semantic search, and text clustering. Cohere’s platform provides NLP tools that are easy and intuitive to integrate. You can create digital experiences that support powerful NLP capabilities like text clustering. It’s easy to Register a Cohere account and gain access to an API key. New cohere accounts have $75 free credits for the first 3 months. It also offers a Pay-as-you-go Pricing Model that bills you upon usage.