Article Recommender with Text Embedding Classification Extraction

Article Recommender with Text Embedding Classification Extraction

Article Recommender with Text Embedding, Classification, and Extraction

This is a simple demonstration of how we can stack multiple NLP models together
to get an output much closer to our desired outcome.

Embeddings can capture the meaning of a piece of text beyond keyword-matching. In this article, we will build a simple news article recommender system that computes the embeddings of all available articles and recommend the most relevant articles based on embeddings similarity.

We will also make the recommendation tighter by using text classification to recommend only articles within the same category. We will then extract a list of tags from each recommended article, which can further help readers discover new articles.

All this will be done via three Cohere API endpoints stacked together: Embed, Classify, and Chat.

Article recommender with Embed, Classify, and Chat

We will implement the following steps:

1: Find the most similar articles to the one currently reading using embeddings.

2: Keep only articles of the same category using text classification.

3: Extract tags from these articles.

4: Show the top 5 recommended articles.

! pip install cohere
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cohere
  Downloading cohere-1.3.10-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
     |████████████████████████████████| 18.0 MB 135 kB/s 
[?25hRequirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from cohere) (2.23.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->cohere) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->cohere) (2022.6.15)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->cohere) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->cohere) (3.0.4)
Installing collected packages: cohere
Successfully installed cohere-1.3.10
import numpy as np
import pandas as pd
import re
import cohere

co = cohere.Client("COHERE_API_KEY") # Get your API key: https://dashboard.cohere.com/api-keys

Step 1 - Embed

1.1: Get articles

Throughout this article, we'll use the BBC news article dataset as an example [Source]. This dataset consists of articles from a few categories: business, politics, tech, entertainment, and sport.

We'll extract a subset of the data and in Step 1, use the first 100 data points.

df = pd.read_csv('https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/bbc_news_subset.csv', delimiter=',')

INP_START = 0
INP_END = 100
df_inputs = df.iloc[INP_START:INP_END]
df_inputs = df_inputs.copy()

df_inputs.drop(['ArticleId','Category'],axis=1,inplace=True)

df_inputs.head()
Text
0 worldcom ex-boss launches defence lawyers defe...
1 german business confidence slides german busin...
2 bbc poll indicates economic gloom citizens in ...
3 lifestyle governs mobile choice faster bett...
4 enron bosses in $168m payout eighteen former e...

1.2: Turn articles into embeddings

Next we turn each article text into embeddings. An embedding is a list of numbers that our models use to represent a piece of text, capturing its context and meaning.

We do this by calling Cohere's Embed endpoint, which takes in text as input and returns embeddings as output.

articles = df_inputs['Text'].tolist()

output = co.embed(
            model ='embed-english-v3.0',
            input_type='search_document',
            texts = articles)
embeds = output.embeddings

print('Number of articles:', len(embeds))
Number of articles: 100

1.3: Pick one article and find the most similar articles

Next, we pick any one article to be the one the reader is currently reading (let's call this the target) and find other articles with the most similar embeddings (let's call these candidates) using cosine similarity.

Cosine similarity is a metric that measures how similar sequences of numbers are (embeddings in our case), and we compute it for each target-candidate pair.

print(f'Choose one article ID between {INP_START} and {INP_END-1} below...')
Choose one article ID between 0 and 99 below...
READING_IDX = 70

reading = embeds[READING_IDX]

from sklearn.metrics.pairwise import cosine_similarity

def get_similarity(target,candidates):
  # Turn list into array
  candidates = np.array(candidates)
  target = np.expand_dims(np.array(target),axis=0)

  # Calculate cosine similarity
  similarity_scores = cosine_similarity(target,candidates)
  similarity_scores = np.squeeze(similarity_scores).tolist()

  # Sort by descending order in similarity
  similarity_scores = list(enumerate(similarity_scores))
  similarity_scores = sorted(similarity_scores, key=lambda x:x[1], reverse=True)

  # Return similarity scores
  return similarity_scores
similarity = get_similarity(reading,embeds)

print('Target:')
print(f'[ID {READING_IDX}]',df_inputs['Text'][READING_IDX][:100],'...','\n')

print('Candidates:')
for i in similarity[1:6]: # Exclude the target article
  print(f'[ID {i[0]}]',df_inputs['Text'][i[0]][:100],'...')
Target:
[ID 70] aragones angered by racism fine spain coach luis aragones is furious after being fined by the spanis ... 

Candidates:
[ID 23] ferguson urges henry punishment sir alex ferguson has called on the football association to punish a ...
[ID 51] mourinho defiant on chelsea form chelsea boss jose mourinho has insisted that sir alex ferguson and  ...
[ID 73] balco case trial date pushed back the trial date for the bay area laboratory cooperative (balco) ste ...
[ID 41] mcleish ready for criticism rangers manager alex mcleish accepts he is going to be criticised after  ...
[ID 42] premier league planning cole date the premier league is attempting to find a mutually convenient dat ...
Step 2 - Classify

Two articles may be similar but they may not necessarily belong to the same category. For example, an article about a sports team manager facing a fine may be similar to another about a business entity facing a fine, but they are not of the same category.

Perhaps we can make the system better by only recommending articles of the same category. For this, let's build a news category classifier.

2.1: Build a classifier

We use Cohere's Classify endpoint to build a news category classifier, classifying articles into five categories: Business, Politics, Tech, Entertainment, and Sport.

A typical text classification model requires hundreds/thousands of data points to train, but with this endpoint, we can build a classifier with a few as five examples per class.

To build the classifier, we need a set of examples consisting of text (news text) and labels (news category). The BBC News dataset happens to have both (columns 'Text' and 'Category'), so this time we’ll use the categories for building our examples. For this, we will set aside another portion of dataset.

EX_START = 100
EX_END = 200
df_examples = df.iloc[EX_START:EX_END]
df_examples = df_examples.copy()

df_examples.drop(['ArticleId'],axis=1,inplace=True)

df_examples.head()
Text Category
100 honda wins china copyright ruling japan s hond... business
101 ukip could sue veritas defectors the uk indepe... politics
102 security warning over fbi virus the us feder... tech
103 europe backs digital tv lifestyle how people r... tech
104 celebrities get to stay in jungle all four con... entertainment

With the Classify endpoint, there is a limit of 512 tokens per input. This means full articles won't be able to fit in the examples, so we will approximate and limit each article to its first 300 characters.

MAX_CHARS = 300

def shorten_text(text):
  return text[:MAX_CHARS]

df_examples['Text'] = df_examples['Text'].apply(shorten_text)

The Classify endpoint needs a minimum of 2 examples for each category. We'll have 5 examples each, sampled randomly from the dataset. We have 5 categories, so we will have a total of 25 examples.

EX_PER_CAT = 5 

categories = df_examples['Category'].unique().tolist()

ex_texts = []
ex_labels = []
for category in categories:
  df_category = df_examples[df_examples['Category'] == category]
  samples = df_category.sample(n=EX_PER_CAT, random_state=42)
  ex_texts += samples['Text'].tolist()
  ex_labels += samples['Category'].tolist()

print(f'Number of examples per category: {EX_PER_CAT}')
print(f'List of categories: {categories}')
print(f'Number of categories: {len(categories)}')
print(f'Total number of examples: {len(ex_texts)}')
Number of examples per category: 5
List of categories: ['business', 'politics', 'tech', 'entertainment', 'sport']
Number of categories: 5
Total number of examples: 25

Once the examples are ready, we can now get the classifications. Here is a function that returns the classification given an input.


from cohere import ClassifyExample

examples = []
for txt, lbl in zip(ex_texts,ex_labels):
  examples.append(ClassifyExample(text=txt, label=lbl))

def classify_text(texts, examples):
    classifications = co.classify(
        inputs=texts,
        examples=examples
    )

    return [c.prediction for c in classifications.classifications]

2.2: Measure its performance

Before actually using the classifier, let's first test its performance. Here we take another 100 data points as the test dataset and the classifier will predict its class i.e. news category.

TEST_START = 200
TEST_END = 300
df_test = df.iloc[TEST_START:TEST_END]
df_test = df_test.copy()

df_test.drop(['ArticleId'],axis=1,inplace=True)

df_test['Text'] = df_test['Text'].apply(shorten_text)

df_test.head()
Text Category
200 sa return to mauritius top seeds south africa ... sport
201 snow patrol feted at irish awards snow patrol ... entertainment
202 clyde 0-5 celtic celtic brushed aside clyde to... sport
203 bad weather hits nestle sales a combination of... business
204 net fingerprints combat attacks eighty large n... tech
predictions = []
BATCH_SIZE = 90 # The API accepts a maximum of 96 inputs
for i in range(0, len(df_test['Text']), BATCH_SIZE):
    batch_texts = df_test['Text'][i:i+BATCH_SIZE].tolist()
    predictions.extend(classify_text(batch_texts, examples))    
    
actual = df_test['Category'].tolist()
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(actual, predictions)
print(f'Accuracy: {accuracy*100}')
Accuracy: 89.0

We get a good accuracy score of 91%, so the classifier is ready to be
implemented in our recommender system.

Step 3 - Extract

We now proceed to the tags extraction step. Compared to the previous two steps, this step is not about sorting or filtering articles, but rather enriching them with more information.

We do this with the Chat endpoint.

We call the endpoint by specifying a few settings, and it will generate the corresponding extractions.

def extract_tags(article):
  prompt = f"""Given an article, extract a list of tags containing keywords of that article.

Article: japanese banking battle at an end japan s sumitomo mitsui \
financial has withdrawn its takeover offer for rival bank ufj holdings enabling the \
latter to merge with mitsubishi tokyo.  sumitomo bosses told counterparts at ufj of its \
decision on friday  clearing the way for it to conclude a 3 trillion

Tags: sumitomo mitsui financial, ufj holdings, mitsubishi tokyo, japanese banking

Article:france starts digital terrestrial france has become the last big european country to \
launch a digital terrestrial tv (dtt) service.  initially  more than a third of the \
population will be able to receive 14 free-to-air channels. despite the long wait for a \
french dtt roll-out  the new platform s bac

Tags: france, digital terrestrial

Article: apple laptop is  greatest gadget  the apple powerbook 100 has been chosen as the greatest \
gadget of all time  by us magazine mobile pc.  the 1991 laptop was chosen because it was \
one of the first  lightweight  portable computers and helped define the layout of all future \
notebook pcs. the magazine h

Tags: apple, apple powerbook 100, laptop


Article:{article}

Tags:"""
  
  
  response = co.chat(
    model='command-r',
    message=prompt,
    preamble="")

  return response.text
Complete all steps

Let's now put everything together for our article recommender system.

First, we select the target article and compute the similarity scores against the candidate articles.

print(f'Choose one article ID between {INP_START} and {INP_END-1} below...')
Choose one article ID between 0 and 99 below...
READING_IDX = 70

reading = embeds[READING_IDX]

similarity = get_similarity(reading,embeds)

Next, we filter the articles via classification. Finally, we extract the keywords from each article and show the recommendations.

SHOW_TOP = 5

df_inputs = df_inputs.copy()
df_inputs['Text'] = df_inputs['Text'].apply(shorten_text)

def get_recommendations(reading_idx,similarity,show_top):

  # Show the current article
  print('------  You are reading...  ------')
  print(f'[ID {READING_IDX}] Article:',df_inputs['Text'][reading_idx][:MAX_CHARS]+'...\n')

  # Show the recommended articles
  print('------  You might also like...  ------')

  # Classify the target article
  target_class = classify_text([df_inputs['Text'][reading_idx]],examples)
  print(target_class)

  count = 0
  for idx,score in similarity:

    # Classify each candidate article
    candidate_class = classify_text([df_inputs['Text'][idx]],examples)
    
    # Show recommendations
    if target_class == candidate_class and idx != reading_idx:
      selection = df_inputs['Text'][idx][:MAX_CHARS]
      print(f'[ID {idx}] Article:',selection+'...')

      # Extract and show tags
      tags = extract_tags(selection)
      if tags:
          print(f'Tags: {tags.strip()}\n')
      else:
          print(f'Tags: none\n')      

      # Increment the article count
      count += 1

      # Stop once articles reach the SHOW_TOP number
      if count == show_top:
        break
get_recommendations(READING_IDX,similarity,SHOW_TOP)
------  You are reading...  ------
[ID 70] Article: aragones angered by racism fine spain coach luis aragones is furious after being fined by the spanish football federation for his comments about thierry henry.  the 66-year-old criticised his 3000 euros (£2 060) punishment even though it was far below the maximum penalty.  i am not guilty  nor do i ...

------  You might also like...  ------
[ID 23] Article: ferguson urges henry punishment sir alex ferguson has called on the football association to punish arsenal s thierry henry for an incident involving gabriel heinze.  ferguson believes henry deliberately caught heinze on the head with his knee during united s controversial win. the united boss said i...
Tags: football, sir alex ferguson, thierry henry, arsenal, manchester united

[ID 51] Article: mourinho defiant on chelsea form chelsea boss jose mourinho has insisted that sir alex ferguson and arsene wenger would swap places with him.  mourinho s side were knocked out of the fa cup by newcastle last sunday before seeing barcelona secure a 2-1 champions league first-leg lead in the nou camp....
Tags: chelsea, jose mourinho, sir alex ferguson, arsene wenger, fa cup, newcastle, barcelona, champions league

[ID 41] Article: mcleish ready for criticism rangers manager alex mcleish accepts he is going to be criticised after their disastrous uefa cup exit at the hands of auxerre at ibrox on wednesday.  mcleish told bbc radio five live:  we were in pole position to get through to the next stage but we blew it  we absolutel...
Tags: rangers, alex mcleish, auxerre, uefa cup, ibrox

[ID 42] Article: premier league planning cole date the premier league is attempting to find a mutually convenient date to investigate allegations chelsea made an illegal approach for ashley cole.  both chelsea and arsenal will be asked to give evidence to a premier league commission  but no deadline has been put on ...
Tags: premier league, chelsea, arsenal, ashley cole

[ID 14] Article: ireland 21-19 argentina an injury-time dropped goal by ronan o gara stole victory for ireland from underneath the noses of argentina at lansdowne road on saturday.  o gara kicked all of ireland s points  with two dropped goals and five penalties  to give the home side a 100% record in their autumn i...
Tags: rugby, ireland, argentina, ronan o gara

Keeping to the Section 1.3 example, here we see how the classification and extraction steps have improved our recommendation outcome.

First, now the article with ID 73 (non sport) doesn't get recommended anymore. And now we have the tags related to each article being generated.

Let's try a couple of other articles in business and tech and see the output...

Business article (returning recommendations around German economy and economic growth/slump):


READING_IDX = 1

reading = embeds[READING_IDX]

similarity = get_similarity(reading,embeds)

get_recommendations(READING_IDX,similarity,SHOW_TOP)
------  You are reading...  ------
[ID 1] Article: german business confidence slides german business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy.  munich-based research institute ifo said that its confidence index fell to 95.5 in february from 97.5 in january  its first decline in three months. the stu...

------  You might also like...  ------
[ID 56] Article: borussia dortmund near bust german football club and former european champion borussia dortmund has warned it will go bankrupt if rescue talks with creditors fail.  the company s shares tumbled after it said it has  entered a life-threatening profitability and financial situation . borussia dortmund...
Tags: borussia dortmund, german football, bankruptcy

[ID 2] Article: bbc poll indicates economic gloom citizens in a majority of nations surveyed in a bbc world service poll believe the world economy is worsening.  most respondents also said their national economy was getting worse. but when asked about their own family s financial outlook  a majority in 14 countries...
Tags: bbc, economy, financial outlook

[ID 8] Article: car giant hit by mercedes slump a slump in profitability at luxury car maker mercedes has prompted a big drop in profits at parent daimlerchrysler.  the german-us carmaker saw fourth quarter operating profits fall to 785m euros ($1bn) from 2.4bn euros in 2003. mercedes-benz s woes - its profits slid...
Tags: daimlerchrysler, mercedes, luxury car, profitability

[ID 32] Article: china continues rapid growth china s economy has expanded by a breakneck 9.5% during 2004  faster than predicted and well above 2003 s 9.1%.  the news may mean more limits on investment and lending as beijing tries to take the economy off the boil. china has sucked in raw materials and energy to fee...
Tags: china, economy, beijing

[ID 96] Article: bmw to recall faulty diesel cars bmw is to recall all cars equipped with a faulty diesel fuel-injection pump supplied by parts maker robert bosch.  the faulty part does not represent a safety risk and the recall only affects pumps made in december and january. bmw said that it was too early to say h...
Tags: bmw, diesel cars, robert bosch, fuel injection pump

Tech article (returning recommendations around consumer devices):


READING_IDX = 71

reading = embeds[READING_IDX]

similarity = get_similarity(reading,embeds)

get_recommendations(READING_IDX,similarity,SHOW_TOP)
------  You are reading...  ------
[ID 71] Article: camera phones are  must-haves  four times more mobiles with cameras in them will be sold in europe by the end of 2004 than last year  says a report from analysts gartner.  globally  the number sold will reach 159 million  an increase of 104%. the report predicts that nearly 70% of all mobile phones ...

------  You might also like...  ------
[ID 3] Article: lifestyle  governs mobile choice  faster  better or funkier hardware alone is not going to help phone firms sell more handsets  research suggests.  instead  phone firms keen to get more out of their customers should not just be pushing the technology for its own sake. consumers are far more interest...
Tags: mobile, lifestyle, phone firms, handsets

[ID 69] Article: gates opens biggest gadget fair bill gates has opened the consumer electronics show (ces) in las vegas  saying that gadgets are working together more to help people manage multimedia content around the home and on the move.  mr gates made no announcement about the next generation xbox games console ...
Tags: bill gates, consumer electronics show, gadgets, xbox

[ID 46] Article: china  ripe  for media explosion asia is set to drive global media growth to 2008 and beyond  with china and india filling the two top spots  analysts have predicted.  japan  south korea and singapore will also be strong players  but china s demographics give it the edge  a media conference in londo...
Tags: china, india, japan, south korea, singapore, global media growth

[ID 19] Article: moving mobile improves golf swing a mobile phone that recognises and responds to movements has been launched in japan.  the motion-sensitive phone - officially titled the v603sh - was developed by sharp and launched by vodafone s japanese division. devised mainly for mobile gaming  users can also ac...
Tags: mobile phone, japan, sharp, vodafone, golf swing

[ID 63] Article: what high-definition will do to dvds first it was the humble home video  then it was the dvd  and now hollywood is preparing for the next revolution in home entertainment - high-definition.  high-definition gives incredible  3d-like pictures and surround sound. the dvd disks and the gear to play the...
Tags: high-definition, dvd, hollywood, home entertainment

In conclusion, this demonstrates an example of how we can stack multiple NLP endpoints together to get an output much closer to our desired outcome.

In practice, hosting and maintaining multiple models can turn quickly into a complex activity. But by leveraging Cohere endpoints, this task is reduced to a simple API call.