Analyzing Hacker News with Six Language Understanding Methods
Large language models give machines a vastly improved representation and understanding of language. These abilities give developers more options for content recommendation, analysis, and filtering.
In this notebook we take thousands of the most popular posts from Hacker News and demonstrate some of these functionalities:
- Given an existing post title, retrieve the most similar posts (nearest neighbor search using embeddings)
- Given a query that we write, retrieve the most similar posts
- Plot the archive of articles by similarity (where similar posts are close together and different ones are far)
- Cluster the posts to identify the major common themes
- Extract major keywords from each cluster so we can identify what the clsuter is about
- (Experimental) Name clusters with a generative language model
Setup
Let’s start by installing the tools we’ll need and then importing them.
Fill in your Cohere API key in the next cell. To do this, begin by signing up to Cohere (for free!) if you haven’t yet. Then get your API key here.
Dataset: Top 3,000 Ask HN posts
We will use the top 3,000 posts from the Ask HN section of Hacker News. We provide a CSV containing the posts.
title | url | text | dead | by | score | time | timestamp | type | id | parent | descendants | ranking | deleted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | I’m a software engineer going blind, how should I prepare? | NaN | I'm a 24 y/o full stack engineer (I know some of you are rolling your eyes right now, just highlighting that I have experience on frontend apps as well as backend architecture). I've been working professionally for ~7 years building mostly javascript projects but also some PHP. Two years ago I was diagnosed with a condition called "Usher's Syndrome" - characterized by hearing loss, balance issues, and progressive vision loss.<p>I know there are blind software engineers out there. My main questions are:<p>- Are there blind frontend engineers?<p>- What kinds of software engineering lend themselves to someone with limited vision? Backend only?<p>- Besides a screen reader, what are some of the best tools for building software with limited vision?<p>- Does your company employ blind engineers? How well does it work? What kind of engineer are they?<p>I'm really trying to get ahead of this thing and prepare myself as my vision is degrading rather quickly. I'm not sure what I can do if I can't do SE as I don't have any formal education in anything. I've worked really hard to get to where I am and don't want it to go to waste.<p>Thank you for any input, and stay safe out there!<p>Edit:<p>Thank you all for your links, suggestions, and moral support, I really appreciate it. Since my diagnosis I've slowly developed a crippling anxiety centered around a feeling that I need to figure out the rest of my life before it's too late. I know I shouldn't think this way but it is hard not to. I'm very independent and I feel a pressure to "show up." I will look into these opportunities mentioned and try to get in touch with some more members of the blind engineering community. | NaN | zachrip | 3270 | 1587332026 | 2020-04-19 21:33:46+00:00 | story | 22918980 | NaN | 473.0 | NaN | NaN |
1 | Am I the longest-serving programmer – 57 years and counting? | NaN | In May of 1963, I started my first full-time job as a computer programmer for Mitchell Engineering Company, a supplier of steel buildings. At Mitchell, I developed programs in Fortran II on an IBM 1620 mostly to improve the efficiency of order processing and fulfillment. Since then, all my jobs for the past 57 years have involved computer programming. I am now a data scientist developing cloud-based big data fraud detection algorithms using machine learning and other advanced analytical technologies. Along the way, I earned a Master’s in Operations Research and a Master’s in Management Science, studied artificial intelligence for 3 years in a Ph.D. program for engineering, and just two years ago I received Graduate Certificates in Big Data Analytics from the schools of business and computer science at a local university (FAU). In addition, I currently hold the designation of Certified Analytics Professional (CAP). At 74, I still have no plans to retire or to stop programming. | NaN | genedangelo | 2634 | 1590890024 | 2020-05-31 01:53:44+00:00 | story | 23366546 | NaN | 531.0 | NaN | NaN |
2 | Is S3 down? | NaN | I'm getting<p>{\n "errorCode" : "InternalError"\n}<p>When I attempt to use the AWS Console to view s3 | NaN | iamdeedubs | 2589 | 1488303958 | 2017-02-28 17:45:58+00:00 | story | 13755673 | NaN | 1055.0 | NaN | NaN |
3 | What tech job would let me get away with the least real work possible? | NaN | Hey HN,<p>I'll probably get a lot of flak for this. Sorry.<p>I'm an average developer looking for ways to work as little as humanely possible.<p>The pandemic made me realize that I do not care about working anymore. The software I build is useless. Time flies real fast and I have to focus on my passions (which are not monetizable).<p>Unfortunately, I require shelter, calories and hobby materials. Thus the need for some kind of job.<p>Which leads me to ask my fellow tech workers, what kind of job (if any) do you think would fit the following requirements :<p>- No / very little involvement in the product itself (I do not care.)<p>- Fully remote (You can't do much when stuck in the office. Ideally being done in 2 hours in the morning then chilling would be perfect.)<p>- Low expectactions / vague job description.<p>- Salary can be on the lower side.<p>- No career advancement possibilities required. Only tech, I do not want to manage people.<p>- Can be about helping other developers, setting up infrastructure/deploy or pure data management since this is fun.<p>I think the only possible jobs would be some kind of backend-only dev or devops/sysadmin work. But I'm not sure these exist anymore, it seems like you always end up having to think about the product itself. Web dev jobs always required some involvement in the frontend.<p>Thanks for any advice (or hate, which I can't really blame you for). | NaN | lmueongoqx | 2022 | 1617784863 | 2021-04-07 08:41:03+00:00 | story | 26721951 | NaN | 1091.0 | NaN | NaN |
4 | What books changed the way you think about almost everything? | NaN | I was reflecting today about how often I think about Freakonomics. I don't study it religiously. I read it one time more than 10 years ago. I can only remember maybe a single specific anecdote from the book. And yet the simple idea that basically every action humans take can be traced back to an incentive has fundamentally changed the way I view the world. Can anyone recommend books that have had a similar impact on them? | NaN | anderspitman | 2009 | 1549387905 | 2019-02-05 17:31:45+00:00 | story | 19087418 | NaN | 1165.0 | NaN | NaN |
We calculate the embeddings using Cohere’s embed-english-v3.0
model. The resulting embeddings matrix has 3,000 rows (one for each post) and 1024 columns (meaning each post title is represented with a 1024-dimensional embedding).
Building a semantic search index
For nearest-neighbor search, we can use the open-source Annoy library. Let’s create a semantic search index and feed it all the embeddings.
1- Given an existing post title, retrieve the most similar posts (nearest neighbor search using embeddings)
We can query neighbors of a specific post using get_nns_by_item
.
post titles | distance | |
---|---|---|
2991 | Best Bank for Startups? | 0.883494 |
2910 | Who’s looking for a cofounder? | 0.885087 |
31 | What startup/technology is on your ‘to watch’ list? | 0.887212 |
685 | What startup/technology is on your ‘to watch’ list? | 0.887212 |
2123 | Who is seeking a cofounder? | 0.889451 |
727 | Agriculture startups doing interesting work? | 0.899192 |
2972 | How should I evaluate a startup as I job hunt? | 0.901621 |
2589 | What methods do you use to gain early customers for your startup? | 0.903065 |
2708 | Is there VC appetite for defense related startups? | 0.904016 |
2- Given a query that we write, retrieve the most similar posts
We’re not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.
texts | distance | |
---|---|---|
2457 | How do I improve my command of mathematical language? | 0.931286 |
1235 | How to learn new things better? | 1.024635 |
145 | How to self-learn math? | 1.044135 |
1317 | How can I learn to read mathematical notation? | 1.050976 |
910 | How Do You Learn? | 1.061253 |
2432 | How did you learn math notation? | 1.070800 |
1994 | How do I become smarter? | 1.083434 |
1529 | How do you personally learn? | 1.086088 |
796 | How do you keep improving? | 1.087251 |
1286 | How do I learn drawing? | 1.088468 |
3- Plot the archive of articles by similarity
What if we want to browse the archive instead of only searching it? Let’s plot all the questions in a 2D chart so you’re able to visualize the posts in the archive and their similarities.
4- Cluster the posts to identify the major common themes
Let’s proceed to cluster the embeddings using KMeans from scikit-learn.
5- Extract major keywords from each cluster so we can identify what the cluster is about
Plot with clusters and keywords information
We can now plot the documents with their clusters and keywords
6- (Experimental) Naming clusters with a generative language model
While the extracted keywords do add a lot of information to help us identify the clusters at a glance, we should be able to have a generative model look at these keywords and suggest a name. So far I have reasonable results from a prompt that looks like this:
There’s a lot of room for improvement though. I’m really excited by this use case because it adds so much information. Imagine if the in the following tree of topics, you assigned each cluster an intelligible name. Then imagine if you assigned each branching a name as well
We can’t wait to see what you start building! Share your projects or find support on our Discord server.