Article Recommender with Text Embedding Classification Extraction
Article Recommender with Text Embedding, Classification, and Extraction
This is a simple demonstration of how we can stack multiple NLP models together
to get an output much closer to our desired outcome.
Embeddings can capture the meaning of a piece of text beyond keyword-matching. In this article, we will build a simple news article recommender system that computes the embeddings of all available articles and recommend the most relevant articles based on embeddings similarity.
We will also make the recommendation tighter by using text classification to recommend only articles within the same category. We will then extract a list of tags from each recommended article, which can further help readers discover new articles.
All this will be done via three Cohere API endpoints stacked together: Embed, Classify, and Chat.
We will implement the following steps:
1: Find the most similar articles to the one currently reading using embeddings.
2: Keep only articles of the same category using text classification.
3: Extract tags from these articles.
4: Show the top 5 recommended articles.
1.1: Get articles
Throughout this article, we’ll use the BBC news article dataset as an example [Source]. This dataset consists of articles from a few categories: business, politics, tech, entertainment, and sport.
We’ll extract a subset of the data and in Step 1, use the first 100 data points.
Text | |
---|---|
0 | worldcom ex-boss launches defence lawyers defe… |
1 | german business confidence slides german busin… |
2 | bbc poll indicates economic gloom citizens in … |
3 | lifestyle governs mobile choice faster bett… |
4 | enron bosses in $168m payout eighteen former e… |
1.2: Turn articles into embeddings
Next we turn each article text into embeddings. An embedding is a list of numbers that our models use to represent a piece of text, capturing its context and meaning.
We do this by calling Cohere’s Embed endpoint, which takes in text as input and returns embeddings as output.
1.3: Pick one article and find the most similar articles
Next, we pick any one article to be the one the reader is currently reading (let’s call this the target) and find other articles with the most similar embeddings (let’s call these candidates) using cosine similarity.
Cosine similarity is a metric that measures how similar sequences of numbers are (embeddings in our case), and we compute it for each target-candidate pair.
Two articles may be similar but they may not necessarily belong to the same category. For example, an article about a sports team manager facing a fine may be similar to another about a business entity facing a fine, but they are not of the same category.
Perhaps we can make the system better by only recommending articles of the same category. For this, let’s build a news category classifier.
2.1: Build a classifier
We use Cohere’s Classify endpoint to build a news category classifier, classifying articles into five categories: Business, Politics, Tech, Entertainment, and Sport.
A typical text classification model requires hundreds/thousands of data points to train, but with this endpoint, we can build a classifier with a few as five examples per class.
To build the classifier, we need a set of examples consisting of text (news text) and labels (news category). The BBC News dataset happens to have both (columns ‘Text’ and ‘Category’), so this time we’ll use the categories for building our examples. For this, we will set aside another portion of dataset.
Text | Category | |
---|---|---|
100 | honda wins china copyright ruling japan s hond… | business |
101 | ukip could sue veritas defectors the uk indepe… | politics |
102 | security warning over fbi virus the us feder… | tech |
103 | europe backs digital tv lifestyle how people r… | tech |
104 | celebrities get to stay in jungle all four con… | entertainment |
With the Classify endpoint, there is a limit of 512 tokens per input. This means full articles won’t be able to fit in the examples, so we will approximate and limit each article to its first 300 characters.
The Classify endpoint needs a minimum of 2 examples for each category. We’ll have 5 examples each, sampled randomly from the dataset. We have 5 categories, so we will have a total of 25 examples.
Once the examples are ready, we can now get the classifications. Here is a function that returns the classification given an input.
2.2: Measure its performance
Before actually using the classifier, let’s first test its performance. Here we take another 100 data points as the test dataset and the classifier will predict its class i.e. news category.
Text | Category | |
---|---|---|
200 | sa return to mauritius top seeds south africa … | sport |
201 | snow patrol feted at irish awards snow patrol … | entertainment |
202 | clyde 0-5 celtic celtic brushed aside clyde to… | sport |
203 | bad weather hits nestle sales a combination of… | business |
204 | net fingerprints combat attacks eighty large n… | tech |
We get a good accuracy score of 91%, so the classifier is ready to be implemented in our recommender system.
We now proceed to the tags extraction step. Compared to the previous two steps, this step is not about sorting or filtering articles, but rather enriching them with more information.
We do this with the Chat endpoint.
We call the endpoint by specifying a few settings, and it will generate the corresponding extractions.
Let’s now put everything together for our article recommender system.
First, we select the target article and compute the similarity scores against the candidate articles.
Next, we filter the articles via classification. Finally, we extract the keywords from each article and show the recommendations.
Keeping to the Section 1.3 example, here we see how the classification and extraction steps have improved our recommendation outcome.
First, now the article with ID 73 (non sport) doesn’t get recommended anymore. And now we have the tags related to each article being generated.
Let’s try a couple of other articles in business and tech and see the output…
Business article (returning recommendations around German economy and economic growth/slump):
Tech article (returning recommendations around consumer devices):
In conclusion, this demonstrates an example of how we can stack multiple NLP endpoints together to get an output much closer to our desired outcome.
In practice, hosting and maintaining multiple models can turn quickly into a complex activity. But by leveraging Cohere endpoints, this task is reduced to a simple API call.