Clustering Using Embeddings

Colab Notebook

This chapter uses the same notebook as the previous chapter.

For the setup, please refer to the Setting Up chapter at the beginning of this module.

Clustering

As the amount of unstructured text data increases, organizations will want to be able to derive an understanding of its contents. One example would be to discover underlying topics in a collection of documents so we can explore trends and insights. Another could be for businesses to segment customers based on preferences and activity.

These kinds of tasks fall under a category called clustering. In machine learning, clustering is a process of grouping similar documents into clusters. It is used to organize a large number of documents into a smaller number of groups. And it lets us discover emerging patterns in a collection of documents without us having to specify much information beyond supplying the data.

And now that we have text represented by their embeddings, putting them through a clustering algorithm becomes simple. Let’s look at an example using the same 9 data points.

Implementation-wise, we use the K-means algorithms to cluster these data points (if you’d like to learn more about it, please check this video about the K-means algorithm).

Other than providing the embeddings, the only other key information we need to provide for the algorithm is the number of clusters we want to find. This is normally larger in actual applications, but since our dataset is small, we’ll set the number of clusters to 2.

PYTHON

1 from sklearn.cluster import KMeans
2 
3 # Pick the number of clusters
4 df_clust = df_pc2.copy()
5 n_clusters = 2
6 
7 # Cluster the embeddings
8 kmeans_model = KMeans(n_clusters=n_clusters, random_state=0)
9 classes = kmeans_model.fit_predict(embeds).tolist()
10 df_clust["cluster"] = list(map(str, classes))
11 
12 # Plot on a chart
13 df_clust.columns = df_clust.columns.astype(str)
14 generate_chart(
15     df_clust.iloc[:sample],
16     "0",
17     "1",
18     lbl="on",
19     color="cluster",
20     title="Clustering with 2 Clusters",
21 )

The plot below shows the clusters that the algorithm returned. It looks to be spot on, where we have one cluster related to airline information and one cluster related to ground service information.

Conclusion

In this chapter, you learned how to cluster a dataset of sentences, and you observed that each cluster corresponds to a particular topic.

Original Source

This material comes from the post Text Embeddings Visually Explained