Article Recommender with Text Embedding Classification Extraction

Article Recommender with Text Embedding, Classification, and Extraction

This is a simple demonstration of how we can stack multiple NLP models together
to get an output much closer to our desired outcome.

Embeddings can capture the meaning of a piece of text beyond keyword-matching. In this article, we will build a simple news article recommender system that computes the embeddings of all available articles and recommend the most relevant articles based on embeddings similarity.

We will also make the recommendation tighter by using text classification to recommend only articles within the same category. We will then extract a list of tags from each recommended article, which can further help readers discover new articles.

All this will be done via three Cohere API endpoints stacked together: Embed, Classify, and Chat.

Article recommender with Embed, Classify, and Chat

We will implement the following steps:

1: Find the most similar articles to the one currently reading using embeddings.

2: Keep only articles of the same category using text classification.

3: Extract tags from these articles.

4: Show the top 5 recommended articles.

PYTHON
1! pip install cohere
Output
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cohere
Downloading cohere-1.3.10-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
 |████████████████████████████████| 18.0 MB 135 kB/s
[?25hRequirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from cohere) (2.23.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->cohere) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->cohere) (2022.6.15)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->cohere) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->cohere) (3.0.4)
Installing collected packages: cohere
Successfully installed cohere-1.3.10
PYTHON
1import numpy as np
2import pandas as pd
3import re
4import cohere
5
6co = cohere.Client("COHERE_API_KEY") # Get your API key: https://dashboard.cohere.com/api-keys
Step 1 - Embed

1.1: Get articles

Throughout this article, we’ll use the BBC news article dataset as an example [Source]. This dataset consists of articles from a few categories: business, politics, tech, entertainment, and sport.

We’ll extract a subset of the data and in Step 1, use the first 100 data points.

PYTHON
1df = pd.read_csv('https://raw.githubusercontent.com/cohere-ai/cohere-developer-experience/main/notebooks/data/bbc_news_subset.csv', delimiter=',')
2
3INP_START = 0
4INP_END = 100
5df_inputs = df.iloc[INP_START:INP_END]
6df_inputs = df_inputs.copy()
7
8df_inputs.drop(['ArticleId','Category'],axis=1,inplace=True)
9
10df_inputs.head()
Text
0worldcom ex-boss launches defence lawyers defe…
1german business confidence slides german busin…
2bbc poll indicates economic gloom citizens in …
3lifestyle governs mobile choice faster bett…
4enron bosses in $168m payout eighteen former e…

1.2: Turn articles into embeddings

Next we turn each article text into embeddings. An embedding is a list of numbers that our models use to represent a piece of text, capturing its context and meaning.

We do this by calling Cohere’s Embed endpoint, which takes in text as input and returns embeddings as output.

PYTHON
1articles = df_inputs['Text'].tolist()
2
3output = co.embed(
4 model ='embed-english-v3.0',
5 input_type='search_document',
6 texts = articles)
7embeds = output.embeddings
8
9print('Number of articles:', len(embeds))
Number of articles: 100

1.3: Pick one article and find the most similar articles

Next, we pick any one article to be the one the reader is currently reading (let’s call this the target) and find other articles with the most similar embeddings (let’s call these candidates) using cosine similarity.

Cosine similarity is a metric that measures how similar sequences of numbers are (embeddings in our case), and we compute it for each target-candidate pair.

PYTHON
1print(f'Choose one article ID between {INP_START} and {INP_END-1} below...')
Choose one article ID between 0 and 99 below...
PYTHON
1READING_IDX = 70
2
3reading = embeds[READING_IDX]
PYTHON
1from sklearn.metrics.pairwise import cosine_similarity
2
3def get_similarity(target,candidates):
4 # Turn list into array
5 candidates = np.array(candidates)
6 target = np.expand_dims(np.array(target),axis=0)
7
8 # Calculate cosine similarity
9 similarity_scores = cosine_similarity(target,candidates)
10 similarity_scores = np.squeeze(similarity_scores).tolist()
11
12 # Sort by descending order in similarity
13 similarity_scores = list(enumerate(similarity_scores))
14 similarity_scores = sorted(similarity_scores, key=lambda x:x[1], reverse=True)
15
16 # Return similarity scores
17 return similarity_scores
PYTHON
1similarity = get_similarity(reading,embeds)
2
3print('Target:')
4print(f'[ID {READING_IDX}]',df_inputs['Text'][READING_IDX][:100],'...','\n')
5
6print('Candidates:')
7for i in similarity[1:6]: # Exclude the target article
8 print(f'[ID {i[0]}]',df_inputs['Text'][i[0]][:100],'...')
Target:
[ID 70] aragones angered by racism fine spain coach luis aragones is furious after being fined by the spanis ...
Candidates:
[ID 23] ferguson urges henry punishment sir alex ferguson has called on the football association to punish a ...
[ID 51] mourinho defiant on chelsea form chelsea boss jose mourinho has insisted that sir alex ferguson and ...
[ID 73] balco case trial date pushed back the trial date for the bay area laboratory cooperative (balco) ste ...
[ID 41] mcleish ready for criticism rangers manager alex mcleish accepts he is going to be criticised after ...
[ID 42] premier league planning cole date the premier league is attempting to find a mutually convenient dat ...
Step 2 - Classify

Two articles may be similar but they may not necessarily belong to the same category. For example, an article about a sports team manager facing a fine may be similar to another about a business entity facing a fine, but they are not of the same category.

Perhaps we can make the system better by only recommending articles of the same category. For this, let’s build a news category classifier.

2.1: Build a classifier

We use Cohere’s Classify endpoint to build a news category classifier, classifying articles into five categories: Business, Politics, Tech, Entertainment, and Sport.

A typical text classification model requires hundreds/thousands of data points to train, but with this endpoint, we can build a classifier with a few as five examples per class.

To build the classifier, we need a set of examples consisting of text (news text) and labels (news category). The BBC News dataset happens to have both (columns ‘Text’ and ‘Category’), so this time we’ll use the categories for building our examples. For this, we will set aside another portion of dataset.

PYTHON
1EX_START = 100
2EX_END = 200
3df_examples = df.iloc[EX_START:EX_END]
4df_examples = df_examples.copy()
5
6df_examples.drop(['ArticleId'],axis=1,inplace=True)
7
8df_examples.head()
TextCategory
100honda wins china copyright ruling japan s hond…business
101ukip could sue veritas defectors the uk indepe…politics
102security warning over fbi virus the us feder…tech
103europe backs digital tv lifestyle how people r…tech
104celebrities get to stay in jungle all four con…entertainment

With the Classify endpoint, there is a limit of 512 tokens per input. This means full articles won’t be able to fit in the examples, so we will approximate and limit each article to its first 300 characters.

PYTHON
1MAX_CHARS = 300
2
3def shorten_text(text):
4 return text[:MAX_CHARS]
5
6df_examples['Text'] = df_examples['Text'].apply(shorten_text)

The Classify endpoint needs a minimum of 2 examples for each category. We’ll have 5 examples each, sampled randomly from the dataset. We have 5 categories, so we will have a total of 25 examples.

PYTHON
1EX_PER_CAT = 5
2
3categories = df_examples['Category'].unique().tolist()
4
5ex_texts = []
6ex_labels = []
7for category in categories:
8 df_category = df_examples[df_examples['Category'] == category]
9 samples = df_category.sample(n=EX_PER_CAT, random_state=42)
10 ex_texts += samples['Text'].tolist()
11 ex_labels += samples['Category'].tolist()
12
13print(f'Number of examples per category: {EX_PER_CAT}')
14print(f'List of categories: {categories}')
15print(f'Number of categories: {len(categories)}')
16print(f'Total number of examples: {len(ex_texts)}')
Number of examples per category: 5
List of categories: ['business', 'politics', 'tech', 'entertainment', 'sport']
Number of categories: 5
Total number of examples: 25

Once the examples are ready, we can now get the classifications. Here is a function that returns the classification given an input.

PYTHON
1from cohere import ClassifyExample
2
3examples = []
4for txt, lbl in zip(ex_texts,ex_labels):
5 examples.append(ClassifyExample(text=txt, label=lbl))
6
7def classify_text(texts, examples):
8 classifications = co.classify(
9 inputs=texts,
10 examples=examples
11 )
12
13 return [c.prediction for c in classifications.classifications]

2.2: Measure its performance

Before actually using the classifier, let’s first test its performance. Here we take another 100 data points as the test dataset and the classifier will predict its class i.e. news category.

PYTHON
1TEST_START = 200
2TEST_END = 300
3df_test = df.iloc[TEST_START:TEST_END]
4df_test = df_test.copy()
5
6df_test.drop(['ArticleId'],axis=1,inplace=True)
7
8df_test['Text'] = df_test['Text'].apply(shorten_text)
9
10df_test.head()
TextCategory
200sa return to mauritius top seeds south africa …sport
201snow patrol feted at irish awards snow patrol …entertainment
202clyde 0-5 celtic celtic brushed aside clyde to…sport
203bad weather hits nestle sales a combination of…business
204net fingerprints combat attacks eighty large n…tech
PYTHON
1predictions = []
2BATCH_SIZE = 90 # The API accepts a maximum of 96 inputs
3for i in range(0, len(df_test['Text']), BATCH_SIZE):
4 batch_texts = df_test['Text'][i:i+BATCH_SIZE].tolist()
5 predictions.extend(classify_text(batch_texts, examples))
6
7actual = df_test['Category'].tolist()
PYTHON
1from sklearn.metrics import accuracy_score
2
3accuracy = accuracy_score(actual, predictions)
4print(f'Accuracy: {accuracy*100}')
Accuracy: 89.0

We get a good accuracy score of 91%, so the classifier is ready to be implemented in our recommender system.

Step 3 - Extract

We now proceed to the tags extraction step. Compared to the previous two steps, this step is not about sorting or filtering articles, but rather enriching them with more information.

We do this with the Chat endpoint.

We call the endpoint by specifying a few settings, and it will generate the corresponding extractions.

PYTHON
1def extract_tags(article):
2 prompt = f"""Given an article, extract a list of tags containing keywords of that article.
3
4Article: japanese banking battle at an end japan s sumitomo mitsui \
5financial has withdrawn its takeover offer for rival bank ufj holdings enabling the \
6latter to merge with mitsubishi tokyo. sumitomo bosses told counterparts at ufj of its \
7decision on friday clearing the way for it to conclude a 3 trillion
8
9Tags: sumitomo mitsui financial, ufj holdings, mitsubishi tokyo, japanese banking
10
11Article:france starts digital terrestrial france has become the last big european country to \
12launch a digital terrestrial tv (dtt) service. initially more than a third of the \
13population will be able to receive 14 free-to-air channels. despite the long wait for a \
14french dtt roll-out the new platform s bac
15
16Tags: france, digital terrestrial
17
18Article: apple laptop is greatest gadget the apple powerbook 100 has been chosen as the greatest \
19gadget of all time by us magazine mobile pc. the 1991 laptop was chosen because it was \
20one of the first lightweight portable computers and helped define the layout of all future \
21notebook pcs. the magazine h
22
23Tags: apple, apple powerbook 100, laptop
24
25
26Article:{article}
27
28Tags:"""
29
30
31 response = co.chat(
32 model='command-r',
33 message=prompt,
34 preamble="")
35
36 return response.text
Complete all steps

Let’s now put everything together for our article recommender system.

First, we select the target article and compute the similarity scores against the candidate articles.

PYTHON
1print(f'Choose one article ID between {INP_START} and {INP_END-1} below...')
Choose one article ID between 0 and 99 below...
PYTHON
1READING_IDX = 70
2
3reading = embeds[READING_IDX]
4
5similarity = get_similarity(reading,embeds)

Next, we filter the articles via classification. Finally, we extract the keywords from each article and show the recommendations.

PYTHON
1SHOW_TOP = 5
2
3df_inputs = df_inputs.copy()
4df_inputs['Text'] = df_inputs['Text'].apply(shorten_text)
5
6def get_recommendations(reading_idx,similarity,show_top):
7
8 # Show the current article
9 print('------ You are reading... ------')
10 print(f'[ID {READING_IDX}] Article:',df_inputs['Text'][reading_idx][:MAX_CHARS]+'...\n')
11
12 # Show the recommended articles
13 print('------ You might also like... ------')
14
15 # Classify the target article
16 target_class = classify_text([df_inputs['Text'][reading_idx]],examples)
17 print(target_class)
18
19 count = 0
20 for idx,score in similarity:
21
22 # Classify each candidate article
23 candidate_class = classify_text([df_inputs['Text'][idx]],examples)
24
25 # Show recommendations
26 if target_class == candidate_class and idx != reading_idx:
27 selection = df_inputs['Text'][idx][:MAX_CHARS]
28 print(f'[ID {idx}] Article:',selection+'...')
29
30 # Extract and show tags
31 tags = extract_tags(selection)
32 if tags:
33 print(f'Tags: {tags.strip()}\n')
34 else:
35 print(f'Tags: none\n')
36
37 # Increment the article count
38 count += 1
39
40 # Stop once articles reach the SHOW_TOP number
41 if count == show_top:
42 break
PYTHON
1get_recommendations(READING_IDX,similarity,SHOW_TOP)
------ You are reading... ------
[ID 70] Article: aragones angered by racism fine spain coach luis aragones is furious after being fined by the spanish football federation for his comments about thierry henry. the 66-year-old criticised his 3000 euros (£2 060) punishment even though it was far below the maximum penalty. i am not guilty nor do i ...
------ You might also like... ------
[ID 23] Article: ferguson urges henry punishment sir alex ferguson has called on the football association to punish arsenal s thierry henry for an incident involving gabriel heinze. ferguson believes henry deliberately caught heinze on the head with his knee during united s controversial win. the united boss said i...
Tags: football, sir alex ferguson, thierry henry, arsenal, manchester united
[ID 51] Article: mourinho defiant on chelsea form chelsea boss jose mourinho has insisted that sir alex ferguson and arsene wenger would swap places with him. mourinho s side were knocked out of the fa cup by newcastle last sunday before seeing barcelona secure a 2-1 champions league first-leg lead in the nou camp....
Tags: chelsea, jose mourinho, sir alex ferguson, arsene wenger, fa cup, newcastle, barcelona, champions league
[ID 41] Article: mcleish ready for criticism rangers manager alex mcleish accepts he is going to be criticised after their disastrous uefa cup exit at the hands of auxerre at ibrox on wednesday. mcleish told bbc radio five live: we were in pole position to get through to the next stage but we blew it we absolutel...
Tags: rangers, alex mcleish, auxerre, uefa cup, ibrox
[ID 42] Article: premier league planning cole date the premier league is attempting to find a mutually convenient date to investigate allegations chelsea made an illegal approach for ashley cole. both chelsea and arsenal will be asked to give evidence to a premier league commission but no deadline has been put on ...
Tags: premier league, chelsea, arsenal, ashley cole
[ID 14] Article: ireland 21-19 argentina an injury-time dropped goal by ronan o gara stole victory for ireland from underneath the noses of argentina at lansdowne road on saturday. o gara kicked all of ireland s points with two dropped goals and five penalties to give the home side a 100% record in their autumn i...
Tags: rugby, ireland, argentina, ronan o gara

Keeping to the Section 1.3 example, here we see how the classification and extraction steps have improved our recommendation outcome.

First, now the article with ID 73 (non sport) doesn’t get recommended anymore. And now we have the tags related to each article being generated.

Let’s try a couple of other articles in business and tech and see the output…

Business article (returning recommendations around German economy and economic growth/slump):

PYTHON
1READING_IDX = 1
2
3reading = embeds[READING_IDX]
4
5similarity = get_similarity(reading,embeds)
6
7get_recommendations(READING_IDX,similarity,SHOW_TOP)
------ You are reading... ------
[ID 1] Article: german business confidence slides german business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy. munich-based research institute ifo said that its confidence index fell to 95.5 in february from 97.5 in january its first decline in three months. the stu...
------ You might also like... ------
[ID 56] Article: borussia dortmund near bust german football club and former european champion borussia dortmund has warned it will go bankrupt if rescue talks with creditors fail. the company s shares tumbled after it said it has entered a life-threatening profitability and financial situation . borussia dortmund...
Tags: borussia dortmund, german football, bankruptcy
[ID 2] Article: bbc poll indicates economic gloom citizens in a majority of nations surveyed in a bbc world service poll believe the world economy is worsening. most respondents also said their national economy was getting worse. but when asked about their own family s financial outlook a majority in 14 countries...
Tags: bbc, economy, financial outlook
[ID 8] Article: car giant hit by mercedes slump a slump in profitability at luxury car maker mercedes has prompted a big drop in profits at parent daimlerchrysler. the german-us carmaker saw fourth quarter operating profits fall to 785m euros ($1bn) from 2.4bn euros in 2003. mercedes-benz s woes - its profits slid...
Tags: daimlerchrysler, mercedes, luxury car, profitability
[ID 32] Article: china continues rapid growth china s economy has expanded by a breakneck 9.5% during 2004 faster than predicted and well above 2003 s 9.1%. the news may mean more limits on investment and lending as beijing tries to take the economy off the boil. china has sucked in raw materials and energy to fee...
Tags: china, economy, beijing
[ID 96] Article: bmw to recall faulty diesel cars bmw is to recall all cars equipped with a faulty diesel fuel-injection pump supplied by parts maker robert bosch. the faulty part does not represent a safety risk and the recall only affects pumps made in december and january. bmw said that it was too early to say h...
Tags: bmw, diesel cars, robert bosch, fuel injection pump

Tech article (returning recommendations around consumer devices):

PYTHON
1READING_IDX = 71
2
3reading = embeds[READING_IDX]
4
5similarity = get_similarity(reading,embeds)
6
7get_recommendations(READING_IDX,similarity,SHOW_TOP)
Output
------ You are reading... ------
[ID 71] Article: camera phones are must-haves four times more mobiles with cameras in them will be sold in europe by the end of 2004 than last year says a report from analysts gartner. globally the number sold will reach 159 million an increase of 104%. the report predicts that nearly 70% of all mobile phones ...
------ You might also like... ------
[ID 3] Article: lifestyle governs mobile choice faster better or funkier hardware alone is not going to help phone firms sell more handsets research suggests. instead phone firms keen to get more out of their customers should not just be pushing the technology for its own sake. consumers are far more interest...
Tags: mobile, lifestyle, phone firms, handsets
[ID 69] Article: gates opens biggest gadget fair bill gates has opened the consumer electronics show (ces) in las vegas saying that gadgets are working together more to help people manage multimedia content around the home and on the move. mr gates made no announcement about the next generation xbox games console ...
Tags: bill gates, consumer electronics show, gadgets, xbox
[ID 46] Article: china ripe for media explosion asia is set to drive global media growth to 2008 and beyond with china and india filling the two top spots analysts have predicted. japan south korea and singapore will also be strong players but china s demographics give it the edge a media conference in londo...
Tags: china, india, japan, south korea, singapore, global media growth
[ID 19] Article: moving mobile improves golf swing a mobile phone that recognises and responds to movements has been launched in japan. the motion-sensitive phone - officially titled the v603sh - was developed by sharp and launched by vodafone s japanese division. devised mainly for mobile gaming users can also ac...
Tags: mobile phone, japan, sharp, vodafone, golf swing
[ID 63] Article: what high-definition will do to dvds first it was the humble home video then it was the dvd and now hollywood is preparing for the next revolution in home entertainment - high-definition. high-definition gives incredible 3d-like pictures and surround sound. the dvd disks and the gear to play the...
Tags: high-definition, dvd, hollywood, home entertainment

In conclusion, this demonstrates an example of how we can stack multiple NLP endpoints together to get an output much closer to our desired outcome.

In practice, hosting and maintaining multiple models can turn quickly into a complex activity. But by leveraging Cohere endpoints, this task is reduced to a simple API call.

Built with