Topic Modeling AI Papers
Natural Language Processing (NLP) is a hot topic in machine learning. It involves analyzing and understanding text-based data. Clustering algorithms are quite popular in NLP. They group a set of unlabeled texts in such a way that texts in the same cluster are more like one another. Topic modeling is one application of clustering in NLP. It uses unsupervised learning to extract topics from a collection of documents. Other applications include automatic document organization and fast information retrieval or filtering.
You’ll learn how to use Cohere’s NLP tools to perform semantic search and clustering of AI Papers. This will help you discover trends in AI. You’ll scrape the Journal of Artificial Intelligence Research. The output is a list of recently published AI papers. You’ll use Cohere’s Embed Endpoint to generate word embeddings using your list of AI papers. Finally, visualize the embeddings and proceed to build semantic search and topic modeling.
To follow along with this tutorial, you need to be familiar with Python. Make sure you have python version 3.6+ installed in your development machine. You can also use Google Colab to try out the project in the cloud. Finally, you need to have a Cohere Account. If you haven’t signed up already, register for a New Cohere Account. All new accounts receive $75 free credits. You’ll access a Pay-As-You-Go option after finishing your credits.
First, you need to install the python dependencies required to run the project. Use pip to install them using the command below
Create a new python file named cohere_nlp.py. Write all your code in this file. import the dependencies and initialize Cohere’s client.
This tutorial focuses on applying topic modeling to look for recent trends in AI. This task requires you to source a list of titles for AI papers. Use web scraping techniques to collect a list of AI papers. Use the Journal of Artificial Intelligence Research as your data source. Finally, you will clean this data by removing unwanted characters and stop words.
First, import the required libraries to make web requests and process the web content .
Next, make an HTTP request to the source website that has an archive of the AI papers.
Use this archive to get the list of AI papers published. This archive has papers published since 2015. This tutorial considers papers published recently, on or after 2020 only.
Finally, you’ll need to clean the titles of the AI papers gathered. Remove trailing white spaces and unwanted characters. Use the NTLK library to get English stop words and filter them out.
The dataset created using this process has 258 AI papers published between 2020 and 2022. Use pandas library to create a data frame to hold our text data.
Word embedding is a technique for learning the representation of words. Words that have same meanings have similar representation. You can use these embeddings to:
• cluster large amounts of text
• match a query with other similar sentences
• perform classification tasks like sentiment classification
Cohere’s platform provides an Embed Endpoint that returns text embeddings. An embedding is a list of floating-point numbers. They capture the semantic meaning of the represented text. Models used to create these embeddings are available in 3 sizes: small, medium, and large. Small models are faster while large models offer better performance.
Write a function to create the word embeddings using Cohere. The function should read as follows:
Create a new column in your pandas data frame to hold the embeddings created.
Congratulations! You have created the word embeddings . Now, you will proceed to visualize the embeddings using a scatter plot. First, you need to reduce the dimensions of the word embeddings. You’ll use the Principal Component Analysis (PCA) method to achieve this task. Import the necessary packages and create a function to return the principle components.
Next, create a function to generate a scatter plot chart. You’ll use the altair library to create the charts.
Finally, use the embeddings with reduced dimensionality to create a scatter plot.
Here’s a chart demonstrating the word embeddings for AI papers. It is important to note that the chart represents a sample size of 200 papers.
Data searching techniques focus on using keywords to retrieve text-based information. You can take this a level higher. Use search queries to determine the intent and contextual meaning. In this section, you’ll use Cohere to create embeddings for the search query. Use the embeddings to get the similarity with your dataset’s embeddings. The output is a list of similar AI papers.
First, create a function to get similarities between two embeddings. This will use the cosine similarity algorithm from the sci-kit learn library.
Next, create embeddings for the search query
Finally, check the similarity between the two embeddings. Display the top 10 similar papers using your result
Visualizing semantic search: https://github.com/cohere-ai/cohere-developer-experience/blob/main/notebooks/Visualizing_Text_Embeddings.ipynb
Clustering is a process of grouping similar documents into clusters. It allows you to organize many documents into a smaller number of groups. As a result, you can discover emerging patterns in the documents. In this section, you will use the k-Means clustering algorithm to identify the top 5 clusters.
First, import the k-means algorithm from the scikit-learn package. Then configure two variables: the number of clusters and a duplicate dataset.
Next, initialize the k-means model and use it to fit the embeddings to create the clusters.
Finally, plot a scatter plot to visualize the 5 clusters in our sample size.
Let’s recap the NLP tasks implemented in this tutorial. You’ve created word embeddings, perform a semantic search, and text clustering. Cohere’s platform provides NLP tools that are easy and intuitive to integrate. You can create digital experiences that support powerful NLP capabilities like text clustering. It’s easy to Register a Cohere account and gain access to an API key. New cohere accounts have $75 free credits for the first 3 months. It also offers a Pay-as-you-go Pricing Model that bills you upon usage.