The Embed Endpoint
In this chapter, you'll learn how to use embeddings and Cohere's Embed endpoint to explore and get insights on a dataset of sentences
In Module 2 you learned about text embeddings, and how they are a very useful way to turn text into numbers that capture its meaning and context. In this chapter you'll learn how to put them in practice using the Embed endpoint. You'll use it to explore a dataset of sentences, and be able to plot them in the plane and observe graphically that indeed similar sentences are mapped to close points in the embedding.
This chapter comes with a corresponding Colab notebook, and we encourage you to follow it along as you read the chapter.
For the setup, please refer to the Setting Up chapter at the beginning of this module.
The dataset we'll use is formed of 50 top search terms on the web about "Hello, World!".
df = pd.read_csv("https://github.com/cohere-ai/notebooks/raw/main/notebooks/data/hello-world-kw.csv", names=["search_term"]) df.head()
The following are a few examples:
|0||how to print hello world in python|
|1||what is hello world|
|2||how do you write hello world in an alert box|
|3||how to print hello world in java|
|4||how to write hello world in eclipse|
The Embed endpoint is quite straightforward to use:
Prepare input — The input is the list of text you want to embed.
Define model type — At the time of writing, there are three models available:
Model Name Language Support Embedding Size
Multilingual (100+ languages) 768
embed-english-v2.0for our example.
Generate output — The output is the corresponding embeddings for the input text.
The code looks like this:
def embed_text(texts): output = co.embed( model="embed-english-v2.0", texts=texts) embedding = output.embeddings return embedding df["search_term_embeds"] = embed_text(df["search_term"].tolist())
embed-english-v2.0 model generates embeddings of 4,096 dimensions. This means, for every piece of text passed to the Embed endpoint, a sequence of 4,096 numbers will be generated. Each number represents a piece of information about the meaning contained in that piece of text.
To understand what these numbers represent, there are techniques we can use to compress the embeddings down to just two dimensions while retaining as much information as possible. And once we can get it down to two dimensions, we can plot these embeddings on a 2D plot.
We can make use of the UMAP technique to do this. The code is as follows:
import umap # Compress the embeddings to 2 dimensions (UMAP’s default reduction is to 2 dimensions) reducer = umap.UMAP(n_neighbors=49) umap_embeds = reducer.fit_transform(embeds) # Store the compressed embeddings in the dataframe/table df['x'] = umap_embeds[:,0] df['y'] = umap_embeds[:,1]
You can then use any plotting library to visualize these compressed embeddings on a 2D plot.
Here is the plot showing all 50 data points:
And here are a few zoomed-in plots, clearly showing text of similar meaning being closer to each other.
Example #1: Hello, World! In Python
Example #2: Origins of Hello, World!
These kinds of insights enable various downstream analyses and applications, such as topic modeling, by clustering documents into groups. In other words, text embeddings allow us to take a huge corpus of unstructured text and turn it into a structured form, making it possible to objectively compare, dissect, and derive insights from all that text.
In the coming chapters, we'll dive deeper into these topics.
In this chapter you learned about the Embed endpoint. Text embeddings make possible a wide array of downstream applications such as semantic search, clustering, and classification. You'll learn more about those in the subsequent chapters.
This material comes from the post Hello, World! Meet Language AI: Part 2
Updated 14 days ago
Learn how to use the Embed endpoint to visualize large datasets and get useful insights!