The Embed Endpoint

In Module 1 you learned about text embeddings, and how they are a very useful way to turn text into numbers that capture its meaning and context. In this chapter you’ll learn how to put them in practice using the Embed endpoint. You’ll use it to explore a dataset of sentences, and be able to plot them in the plane and observe graphically that indeed similar sentences are mapped to close points in the embedding.

Colab Notebook

This chapter comes with a corresponding notebook, and we encourage you to follow it along as you read the chapter.

For the setup, please refer to the Setting Up chapter at the beginning of this module.

Semantic Exploration

The dataset we’ll use is formed of 50 top search terms on the web about “Hello, World!“.

PYTHON

1 df = pd.read_csv(
2     "https://github.com/cohere-ai/cohere-developer-experience/raw/main/notebooks/data/hello-world-kw.csv",
3     names=["search_term"],
4 )
5 df.head()

The following are a few examples:

No.	Keyword
0	how to print hello world in python
1	what is hello world
2	how do you write hello world in an alert box
3	how to print hello world in java
4	how to write hello world in eclipse

Here’s how to use the Embed endpoint:

Prepare input — The input is the list of text you want to embed.
Define model type — At the time of writing, there are three models available:
- embed-english-v3.0 (English)
- embed-english-light-v3.0 (English)
- embed-multilingual-v3.0(Multilingual: 100+ languages)
- embed-multilingual-light-v3.0(Multilingual: 100+ languages)
We’ll use embed-english-v3.0 for our example.
Generate output — The output is the corresponding embeddings for the input text.

The code looks like this:

PYTHON

1 def embed_text(texts):
2     output = co.embed(
3         model="embed-english-v3.0",
4         input_type="search_document",
5         texts=texts,
6     )
7     embedding = output.embeddings
8 
9     return embedding
10 
11 
12 df["search_term_embeds"] = embed_text(df["search_term"].tolist())

For every piece of text passed to the Embed endpoint, a sequence of numbers will be generated. Each number represents a piece of information about the meaning contained in that piece of text.

Note that we defined a parameter input_type with search_document as the value. There are several options available, which you must choose according to the type of document to be embedded:

search_document: Use this for the documents against which search is performed.
search_query: Use this for the query document.
classification: Use this when you use the embeddings as an input to a text classifier.
clustering: Use this when you want to cluster the embeddings.

To understand what these numbers represent, there are techniques we can use to compress the embeddings down to just two dimensions while retaining as much information as possible. And once we can get it down to two dimensions, we can plot these embeddings on a 2D plot.

We can make use of the UMAP technique to do this. The code is as follows:

PYTHON

1 # If you don't have umap installed, pleased run `pip install umap-learn` first!
2 import umap
3 
4 embeds = list(df["search_term_embeds"])
5 
6 # Compress the embeddings to 2 dimensions (UMAP's default reduction is to 2 dimensions)
7 reducer = umap.UMAP(n_neighbors=49)
8 umap_embeds = reducer.fit_transform(embeds)
9 
10 # Store the compressed embeddings in the dataframe/table
11 df["x"] = umap_embeds[:, 0]
12 df["y"] = umap_embeds[:, 1]

You can then use any plotting library to visualize these compressed embeddings on a 2D plot.

Here is the plot showing all 50 data points:

And here are a few zoomed-in plots, clearly showing text of similar meaning being closer to each other.

Example #1: Hello, World! In Python

Example #2: Origins of Hello, World!

These kinds of insights enable various downstream analyses and applications, such as topic modeling, by clustering documents into groups. In other words, text embeddings allow us to take a huge corpus of unstructured text and turn it into a structured form, making it possible to objectively compare, dissect, and derive insights from all that text.

In the coming chapters, we’ll dive deeper into these topics.

Conclusion

In this chapter you learned about the Embed endpoint. Text embeddings make possible a wide array of downstream applications such as semantic search, clustering, and classification. You’ll learn more about those in the subsequent chapters.

Original Source

This material comes from the post Hello, World! Meet Language AI: Part 2

1	df = pd.read_csv(
2	"https://github.com/cohere-ai/cohere-developer-experience/raw/main/notebooks/data/hello-world-kw.csv",
3	names=["search_term"],
4	)
5	df.head()

1	def embed_text(texts):
2	output = co.embed(
3	model="embed-english-v3.0",
4	input_type="search_document",
5	texts=texts,
6	)
7	embedding = output.embeddings
8
9	return embedding
10
11
12	df["search_term_embeds"] = embed_text(df["search_term"].tolist())

1	# If you don't have umap installed, pleased run `pip install umap-learn` first!
2	import umap
3
4	embeds = list(df["search_term_embeds"])
5
6	# Compress the embeddings to 2 dimensions (UMAP's default reduction is to 2 dimensions)
7	reducer = umap.UMAP(n_neighbors=49)
8	umap_embeds = reducer.fit_transform(embeds)
9
10	# Store the compressed embeddings in the dataframe/table
11	df["x"] = umap_embeds[:, 0]
12	df["y"] = umap_embeds[:, 1]