Analyzing Hacker News with Six Language Understanding Methods

Analyzing Hacker News with Six Language Understanding Methods

Large language models give machines a vastly improved representation and understanding of language. These abilities give developers more options for content recommendation, analysis, and filtering.

In this notebook we take thousands of the most popular posts from Hacker News and demonstrate some of these functionalities:

  1. Given an existing post title, retrieve the most similar posts (nearest neighbor search using embeddings)
  2. Given a query that we write, retrieve the most similar posts
  3. Plot the archive of articles by similarity (where similar posts are close together and different ones are far)
  4. Cluster the posts to identify the major common themes
  5. Extract major keywords from each cluster so we can identify what the clsuter is about
  6. (Experimental) Name clusters with a generative language model

Setup

Let's start by installing the tools we'll need and then importing them.

!pip install cohere umap-learn altair annoy bertopic
Requirement already satisfied: cohere in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (5.1.5)
Requirement already satisfied: umap-learn in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (0.5.5)
Requirement already satisfied: altair in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (5.2.0)
Requirement already satisfied: annoy in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (1.17.3)
Requirement already satisfied: bertopic in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (0.16.0)
Requirement already satisfied: httpx>=0.21.2 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from cohere) (0.27.0)
Requirement already satisfied: pydantic>=1.9.2 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from cohere) (2.6.0)
Requirement already satisfied: typing_extensions>=4.0.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from cohere) (4.10.0)
Requirement already satisfied: numpy>=1.17 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from umap-learn) (1.24.3)
Requirement already satisfied: scipy>=1.3.1 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from umap-learn) (1.11.1)
Requirement already satisfied: scikit-learn>=0.22 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from umap-learn) (1.3.0)
Requirement already satisfied: numba>=0.51.2 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from umap-learn) (0.57.0)
Requirement already satisfied: pynndescent>=0.5 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from umap-learn) (0.5.12)
Requirement already satisfied: tqdm in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from umap-learn) (4.65.0)
Requirement already satisfied: jinja2 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from altair) (3.1.2)
Requirement already satisfied: jsonschema>=3.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from altair) (4.17.3)
Requirement already satisfied: packaging in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from altair) (23.2)
Requirement already satisfied: pandas>=0.25 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from altair) (2.0.3)
Requirement already satisfied: toolz in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from altair) (0.12.0)
Requirement already satisfied: hdbscan>=0.8.29 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from bertopic) (0.8.33)
Requirement already satisfied: sentence-transformers>=0.4.1 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from bertopic) (2.6.1)
Requirement already satisfied: plotly>=4.7.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from bertopic) (5.9.0)
Requirement already satisfied: cython<3,>=0.27 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from hdbscan>=0.8.29->bertopic) (0.29.37)
Requirement already satisfied: joblib>=1.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from hdbscan>=0.8.29->bertopic) (1.2.0)
Requirement already satisfied: anyio in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from httpx>=0.21.2->cohere) (3.5.0)
Requirement already satisfied: certifi in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from httpx>=0.21.2->cohere) (2023.11.17)
Requirement already satisfied: httpcore==1.* in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from httpx>=0.21.2->cohere) (1.0.2)
Requirement already satisfied: idna in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from httpx>=0.21.2->cohere) (3.4)
Requirement already satisfied: sniffio in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from httpx>=0.21.2->cohere) (1.2.0)
Requirement already satisfied: h11<0.15,>=0.13 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from httpcore==1.*->httpx>=0.21.2->cohere) (0.14.0)
Requirement already satisfied: attrs>=17.4.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from jsonschema>=3.0->altair) (22.1.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from jsonschema>=3.0->altair) (0.18.0)
Requirement already satisfied: llvmlite<0.41,>=0.40.0dev0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from numba>=0.51.2->umap-learn) (0.40.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from pandas>=0.25->altair) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from pandas>=0.25->altair) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from pandas>=0.25->altair) (2023.3)
Requirement already satisfied: tenacity>=6.2.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from plotly>=4.7.0->bertopic) (8.2.2)
Requirement already satisfied: annotated-types>=0.4.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from pydantic>=1.9.2->cohere) (0.6.0)
Requirement already satisfied: pydantic-core==2.16.1 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from pydantic>=1.9.2->cohere) (2.16.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from scikit-learn>=0.22->umap-learn) (2.2.0)
Requirement already satisfied: transformers<5.0.0,>=4.32.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from sentence-transformers>=0.4.1->bertopic) (4.39.3)
Requirement already satisfied: torch>=1.11.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from sentence-transformers>=0.4.1->bertopic) (2.2.2)
Requirement already satisfied: huggingface-hub>=0.15.1 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from sentence-transformers>=0.4.1->bertopic) (0.22.2)
Requirement already satisfied: Pillow in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from sentence-transformers>=0.4.1->bertopic) (10.0.1)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from jinja2->altair) (2.1.1)
Requirement already satisfied: filelock in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from huggingface-hub>=0.15.1->sentence-transformers>=0.4.1->bertopic) (3.9.0)
Requirement already satisfied: fsspec>=2023.5.0 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from huggingface-hub>=0.15.1->sentence-transformers>=0.4.1->bertopic) (2024.3.1)
Requirement already satisfied: pyyaml>=5.1 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from huggingface-hub>=0.15.1->sentence-transformers>=0.4.1->bertopic) (6.0)
Requirement already satisfied: requests in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from huggingface-hub>=0.15.1->sentence-transformers>=0.4.1->bertopic) (2.31.0)
Requirement already satisfied: six>=1.5 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas>=0.25->altair) (1.16.0)
Requirement already satisfied: sympy in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (1.11.1)
Requirement already satisfied: networkx in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (3.1)
Requirement already satisfied: regex!=2019.12.17 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers>=0.4.1->bertopic) (2022.7.9)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers>=0.4.1->bertopic) (0.15.2)
Requirement already satisfied: safetensors>=0.4.1 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers>=0.4.1->bertopic) (0.4.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from requests->huggingface-hub>=0.15.1->sentence-transformers>=0.4.1->bertopic) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from requests->huggingface-hub>=0.15.1->sentence-transformers>=0.4.1->bertopic) (1.26.18)
Requirement already satisfied: mpmath>=0.19 in /Users/alexiscook/anaconda3/lib/python3.11/site-packages (from sympy->torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (1.3.0)
import cohere
import numpy as np
import pandas as pd
import umap
import altair as alt
from annoy import AnnoyIndex
import warnings
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer

warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)

Fill in your Cohere API key in the next cell. To do this, begin by signing up to Cohere (for free!) if you haven't yet. Then get your API key here.

co = cohere.Client("COHERE_API_KEY") # Insert your Cohere API key

Dataset: Top 3,000 Ask HN posts

We will use the top 3,000 posts from the Ask HN section of Hacker News. We provide a CSV containing the posts.

df = pd.read_csv('https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_df.csv', index_col=0)

print(f'Loaded a DataFrame with {len(df)} rows')
Loaded a DataFrame with 3000 rows
df.head()
title url text dead by score time timestamp type id parent descendants ranking deleted
0 I'm a software engineer going blind, how should I prepare? NaN I&#x27;m a 24 y&#x2F;o full stack engineer (I know some of you are rolling your eyes right now, just highlighting that I have experience on frontend apps as well as backend architecture). I&#x27;ve been working professionally for ~7 years building mostly javascript projects but also some PHP. Two years ago I was diagnosed with a condition called &quot;Usher&#x27;s Syndrome&quot; - characterized by hearing loss, balance issues, and progressive vision loss.<p>I know there are blind software engineers out there. My main questions are:<p>- Are there blind frontend engineers?<p>- What kinds of software engineering lend themselves to someone with limited vision? Backend only?<p>- Besides a screen reader, what are some of the best tools for building software with limited vision?<p>- Does your company employ blind engineers? How well does it work? What kind of engineer are they?<p>I&#x27;m really trying to get ahead of this thing and prepare myself as my vision is degrading rather quickly. I&#x27;m not sure what I can do if I can&#x27;t do SE as I don&#x27;t have any formal education in anything. I&#x27;ve worked really hard to get to where I am and don&#x27;t want it to go to waste.<p>Thank you for any input, and stay safe out there!<p>Edit:<p>Thank you all for your links, suggestions, and moral support, I really appreciate it. Since my diagnosis I&#x27;ve slowly developed a crippling anxiety centered around a feeling that I need to figure out the rest of my life before it&#x27;s too late. I know I shouldn&#x27;t think this way but it is hard not to. I&#x27;m very independent and I feel a pressure to &quot;show up.&quot; I will look into these opportunities mentioned and try to get in touch with some more members of the blind engineering community. NaN zachrip 3270 1587332026 2020-04-19 21:33:46+00:00 story 22918980 NaN 473.0 NaN NaN
1 Am I the longest-serving programmer – 57 years and counting? NaN In May of 1963, I started my first full-time job as a computer programmer for Mitchell Engineering Company, a supplier of steel buildings. At Mitchell, I developed programs in Fortran II on an IBM 1620 mostly to improve the efficiency of order processing and fulfillment. Since then, all my jobs for the past 57 years have involved computer programming. I am now a data scientist developing cloud-based big data fraud detection algorithms using machine learning and other advanced analytical technologies. Along the way, I earned a Master’s in Operations Research and a Master’s in Management Science, studied artificial intelligence for 3 years in a Ph.D. program for engineering, and just two years ago I received Graduate Certificates in Big Data Analytics from the schools of business and computer science at a local university (FAU). In addition, I currently hold the designation of Certified Analytics Professional (CAP). At 74, I still have no plans to retire or to stop programming. NaN genedangelo 2634 1590890024 2020-05-31 01:53:44+00:00 story 23366546 NaN 531.0 NaN NaN
2 Is S3 down? NaN I&#x27;m getting<p>{\n &quot;errorCode&quot; : &quot;InternalError&quot;\n}<p>When I attempt to use the AWS Console to view s3 NaN iamdeedubs 2589 1488303958 2017-02-28 17:45:58+00:00 story 13755673 NaN 1055.0 NaN NaN
3 What tech job would let me get away with the least real work possible? NaN Hey HN,<p>I&#x27;ll probably get a lot of flak for this. Sorry.<p>I&#x27;m an average developer looking for ways to work as little as humanely possible.<p>The pandemic made me realize that I do not care about working anymore. The software I build is useless. Time flies real fast and I have to focus on my passions (which are not monetizable).<p>Unfortunately, I require shelter, calories and hobby materials. Thus the need for some kind of job.<p>Which leads me to ask my fellow tech workers, what kind of job (if any) do you think would fit the following requirements :<p>- No &#x2F; very little involvement in the product itself (I do not care.)<p>- Fully remote (You can&#x27;t do much when stuck in the office. Ideally being done in 2 hours in the morning then chilling would be perfect.)<p>- Low expectactions &#x2F; vague job description.<p>- Salary can be on the lower side.<p>- No career advancement possibilities required. Only tech, I do not want to manage people.<p>- Can be about helping other developers, setting up infrastructure&#x2F;deploy or pure data management since this is fun.<p>I think the only possible jobs would be some kind of backend-only dev or devops&#x2F;sysadmin work. But I&#x27;m not sure these exist anymore, it seems like you always end up having to think about the product itself. Web dev jobs always required some involvement in the frontend.<p>Thanks for any advice (or hate, which I can&#x27;t really blame you for). NaN lmueongoqx 2022 1617784863 2021-04-07 08:41:03+00:00 story 26721951 NaN 1091.0 NaN NaN
4 What books changed the way you think about almost everything? NaN I was reflecting today about how often I think about Freakonomics. I don&#x27;t study it religiously. I read it one time more than 10 years ago. I can only remember maybe a single specific anecdote from the book. And yet the simple idea that basically every action humans take can be traced back to an incentive has fundamentally changed the way I view the world. Can anyone recommend books that have had a similar impact on them? NaN anderspitman 2009 1549387905 2019-02-05 17:31:45+00:00 story 19087418 NaN 1165.0 NaN NaN

We calculate the embeddings using Cohere's embed-english-v3.0 model. The resulting embeddings matrix has 3,000 rows (one for each post) and 1024 columns (meaning each post title is represented with a 1024-dimensional embedding).

batch_size = 90

embeds_list = []
for i in range(0, len(df), batch_size):
    batch = df[i : min(i + batch_size, len(df))]
    texts = list(batch["title"])
    embs_batch = co.embed(
        texts=texts, model="embed-english-v3.0", input_type="search_document"
    ).embeddings
    embeds_list.extend(embs_batch)

embeds = np.array(embeds_list)
embeds.shape
(3000, 1024)

Building a semantic search index

For nearest-neighbor search, we can use the open-source Annoy library. Let's create a semantic search index and feed it all the embeddings.

search_index = AnnoyIndex(embeds.shape[1], 'angular')
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('askhn.ann')
True

1- Given an existing post title, retrieve the most similar posts (nearest neighbor search using embeddings)

We can query neighbors of a specific post using get_nns_by_item.

example_id = 50

similar_item_ids = search_index.get_nns_by_item(example_id,
                                                10, # Number of results to retrieve
                                                include_distances=True)
results = pd.DataFrame(data={'post titles': df.iloc[similar_item_ids[0]]['title'],
                             'distance': similar_item_ids[1]}).drop(example_id)

print(f"Query post:'{df.iloc[example_id]['title']}'\nNearest neighbors:")
results
Query post:'Pick startups for YC to fund'
Nearest neighbors:
post titles distance
2991 Best Bank for Startups? 0.883494
2910 Who's looking for a cofounder? 0.885087
31 What startup/technology is on your 'to watch' list? 0.887212
685 What startup/technology is on your 'to watch' list? 0.887212
2123 Who is seeking a cofounder? 0.889451
727 Agriculture startups doing interesting work? 0.899192
2972 How should I evaluate a startup as I job hunt? 0.901621
2589 What methods do you use to gain early customers for your startup? 0.903065
2708 Is there VC appetite for defense related startups? 0.904016

2- Given a query that we write, retrieve the most similar posts

We're not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.

query = "How can I improve my knowledge of calculus?"

query_embed = co.embed(texts=[query],
                       model="embed-english-v3.0",
                       truncate="RIGHT",
                       input_type="search_query").embeddings

similar_item_ids = search_index.get_nns_by_vector(query_embed[0], 10, include_distances=True)

results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['title'],
                             'distance': similar_item_ids[1]})
print(f"Query:'{query}'\nNearest neighbors:")
results
Query:'How can I improve my knowledge of calculus?'
Nearest neighbors:
texts distance
2457 How do I improve my command of mathematical language? 0.931286
1235 How to learn new things better? 1.024635
145 How to self-learn math? 1.044135
1317 How can I learn to read mathematical notation? 1.050976
910 How Do You Learn? 1.061253
2432 How did you learn math notation? 1.070800
1994 How do I become smarter? 1.083434
1529 How do you personally learn? 1.086088
796 How do you keep improving? 1.087251
1286 How do I learn drawing? 1.088468

3- Plot the archive of articles by similarity

What if we want to browse the archive instead of only searching it? Let's plot all the questions in a 2D chart so you're able to visualize the posts in the archive and their similarities.

reducer = umap.UMAP(n_neighbors=100)
umap_embeds = reducer.fit_transform(embeds)
df['x'] = umap_embeds[:,0]
df['y'] = umap_embeds[:,1]

chart = alt.Chart(df).mark_circle(size=60).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False),
        axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False),
        axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    tooltip=['title']
    ).configure(background="#FDF7F0"
    ).properties(
        width=700,
        height=400,
        title='Ask HN: top 3,000 posts'
        )

chart.interactive()

4- Cluster the posts to identify the major common themes

Let's proceed to cluster the embeddings using KMeans from scikit-learn.

n_clusters = 8

kmeans_model = KMeans(n_clusters=n_clusters, random_state=0)
classes = kmeans_model.fit_predict(embeds)

5- Extract major keywords from each cluster so we can identify what the cluster is about

documents =  df['title']
documents = pd.DataFrame({"Document": documents,
                          "ID": range(len(documents)),
                          "Topic": None})
documents['Topic'] = classes
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
count_vectorizer = CountVectorizer(stop_words="english").fit(documents_per_topic.Document)
count = count_vectorizer.transform(documents_per_topic.Document)
words = count_vectorizer.get_feature_names_out()
ctfidf = ClassTfidfTransformer().fit_transform(count).toarray()
words_per_class = {label: [words[index] for index in ctfidf[label].argsort()[-10:]] for label in documents_per_topic.Topic}
df['cluster'] = classes
df['keywords'] = df['cluster'].map(lambda topic_num: ", ".join(np.array(words_per_class[topic_num])[:]))

Plot with clusters and keywords information

We can now plot the documents with their clusters and keywords

selection = alt.selection_multi(fields=['keywords'], bind='legend')

chart = alt.Chart(df).transform_calculate(
    url='https://news.ycombinator.com/item?id=' + alt.datum.id
).mark_circle(size=60, stroke='#666', strokeWidth=1, opacity=0.3).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False),
        axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False),
        axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    href='url:N',
    color=alt.Color('keywords:N',
                    legend=alt.Legend(columns=1, symbolLimit=0, labelFontSize=14)
                   ),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),
    tooltip=['title', 'keywords', 'cluster', 'score', 'descendants']
).properties(
    width=800,
    height=500
).add_selection(
    selection
).configure_legend(labelLimit= 0).configure_view(
    strokeWidth=0
).configure(background="#FDF7F0").properties(
    title='Ask HN: Top 3,000 Posts'
)
chart.interactive()

6- (Experimental) Naming clusters with a generative language model

While the extracted keywords do add a lot of information to help us identify the clusters at a glance, we should be able to have a generative model look at these keywords and suggest a name. So far I have reasonable results from a prompt that looks like this:

The common theme of the following words: books, book, read, the, you, are, what, best, in, your
is that they all relate to favorite books to read.
---
The common theme of the following words: startup, company, yc, failed
is that they all relate to startup companies and their failures.
---
The common theme of the following words: freelancer, wants, hired, be, who, seeking, to, 2014, 2020, april
is that they all relate to hiring for a freelancer to join the team of a startup.
---
The common theme of the following words: <insert keywords here>
is that they all relate to

There's a lot of room for improvement though. I'm really excited by this use case because it adds so much information. Imagine if the in the following tree of topics, you assigned each cluster an intelligible name. Then imagine if you assigned each branching a name as well

We can’t wait to see what you start building! Share your projects or find support on our Discord server.