A Deeper Dive Into Semantic Search

Colab Notebook

This chapter comes with a corresponding Colab notebook, and we encourage you to follow it along as you read the chapter.

For the setup, please refer to the Setting Up chapter at the beginning of this module.

Introduction

In module 2, you learned about semantic search, and then in a previous chapter in this module, you built a simple semantic search model using text embeddings. In this chapter, you’ll build a similar semantic search model in a much larger dataset, which is made up of questions. Since the dataset is larger, we’ll use a tool that will speed up the nearest neighbors algorithm here.

As you’ve seen before, semantic search goes way beyond keyword search. The applications of semantic search go beyond building a web search engine. They can empower a private search engine for internal documents or records. It can be used to power features like StackOverflow’s “similar questions” feature.

Get the archive of questions
Embed the archive
Search using an index and nearest neighbour search
Visualize the archive based on the embeddings.

1. Download the Dependencies

PYTHON

1 # title Import libraries (Run this cell to execute required code) {display-mode: "form"}
2 
3 import cohere
4 import numpy as np
5 import re
6 import pandas as pd
7 from tqdm import tqdm
8 from datasets import load_dataset
9 import umap
10 import altair as alt
11 from sklearn.metrics.pairwise import cosine_similarity
12 from annoy import AnnoyIndex
13 import warnings
14 
15 warnings.filterwarnings("ignore")
16 pd.set_option("display.max_colwidth", None)

2. Get the Archive of Questions

We’ll use the trec dataset which is made up of questions and their categories.

PYTHON

1 # Get dataset
2 dataset = load_dataset("trec", split="train")
3 # Import into a pandas dataframe, take only the first 1000 rows
4 df = pd.DataFrame(dataset)[:1000]
5 # Preview the data to ensure it has loaded correctly
6 df.head(10)

label-coarse	label-fine	text
0	0	How did serfdom develop in and then leave Russia ?
1	1	What films featured the character Popeye Doyle ?
2	2	How can I find a list of celebrities ’ real names ?
3	3	What fowl grabs the spotlight after the Chinese Year of the Monkey ?
4	4	What is the full form of .com ?
5	5	What contemptible scoundrel stole the cork from my lunch ?
6	6	What team did baseball ‘s St. Louis Browns become ?
7	7	What is the oldest profession ?
8	8	What are liver enzymes ?
9	9	Name the scar-faced bounty hunter of The Old West .

3. Embed the Archive

Let’s now embed the text of the questions.

To get a thousand embeddings of this length should take a few seconds.

PYTHON

1 # Paste your API key here. Remember to not share publicly
2 api_key = ""
3 
4 # Create and retrieve a Cohere API key from dashboard.cohere.ai/welcome/register
5 co = cohere.Client(api_key)
6 
7 # Get the embeddings
8 embeds = co.embed(
9     texts=list(df["text"]),
10     model="embed-english-v3.0",
11     input_type="search_document",
12 ).embeddings

4. Build the Index, search Using an Index and Conduct Nearest Neighbour Search

Let’s build an index using the library called annoy. Annoy is a library created by Spotify to do nearest neighbour search; nearest neighbour search is an optimization problem of finding the point in a given set that is closest (or most similar) to a given point.

PYTHON

1 # Create the search index, pass the size of embedding
2 search_index = AnnoyIndex(np.array(embeds).shape[1], "angular")
3 # Add all the vectors to the search index
4 for i in range(len(embeds)):
5     search_index.add_item(i, embeds[i])
6 search_index.build(10)  # 10 trees
7 search_index.save("test.ann")

After building the index, we can use it to retrieve the nearest neighbours either of existing questions (section 3.1), or of new questions that we embed (section 3.2).

4a. Find the Neighbours of an Example from the Dataset

If we’re only interested in measuring the similarities between the questions in the dataset (no outside queries), a simple way is to calculate the similarities between every pair of embeddings we have.

PYTHON

1 # Choose an example (we'll retrieve others similar to it)
2 example_id = 92
3 # Retrieve nearest neighbors
4 similar_item_ids = search_index.get_nns_by_item(
5     example_id, 10, include_distances=True
6 )
7 # Format and print the text and distances
8 results = pd.DataFrame(
9     data={
10         "texts": df.iloc[similar_item_ids[0]]["text"],
11         "distance": similar_item_ids[1],
12     }
13 ).drop(example_id)
14 print(f"Question:'{df.iloc[example_id]['text']}'\nNearest neighbors:")
15 results

PYTHON

1 # Output:
2 "Question:'What are bear and bull markets ?'"
3 
4 "Nearest neighbors:"

	texts	distance
614	What animals do you find in the stock market ?	0.896121
137	What are equity securities ?	0.970260
601	What is “ the bear of beers ” ?	0.978348
307	What does NASDAQ stand for ?	0.997819
683	What is the rarest coin ?	1.027727
112	What are the world ‘s four oceans ?	1.049661
864	When did the Dow first reach ?	1.050362
547	Where can stocks be traded on-line ?	1.053685
871	What are the Benelux countries ?	1.054899

4b. Find the Neighbours of a User Query

We’re not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbours from the dataset.

PYTHON

1 query = "What is the tallest mountain in the world?"
2 
3 # Get the query's embedding
4 query_embed = co.embed(
5     texts=[query],
6     model="embed-english-v3.0",
7     input_type="search_query",
8 ).embeddings
9 
10 # Retrieve the nearest neighbors
11 similar_item_ids = search_index.get_nns_by_vector(
12     query_embed[0], 10, include_distances=True
13 )
14 # Format the results
15 results = pd.DataFrame(
16     data={
17         "texts": df.iloc[similar_item_ids[0]]["text"],
18         "distance": similar_item_ids[1],
19     }
20 )
21 
22 
23 print(f"Query:'{query}'\nNearest neighbors:")
24 results

	texts	distance
236	What is the name of the tallest mountain in the world ?	0.431913
670	What is the highest mountain in the world ?	0.436290
907	What mountain range is traversed by the highest railroad in the world ?	0.715265
435	What is the highest peak in Africa ?	0.717943
354	What ocean is the largest in the world ?	0.762917
412	What was the highest mountain on earth before Mount Everest was discovered ?	0.767649
109	Where is the highest point in Japan ?	0.784319
114	What is the largest snake in the world ?	0.789743
656	What ‘s the tallest building in New York City ?	0.793982
901	What ‘s the longest river in the world ?	0.794352

5. Visualize the archive

PYTHON

1 # @title Plot the archive {display-mode: "form"}
2 
3 # UMAP reduces the dimensions from 1024 to 2 dimensions that we can plot
4 reducer = umap.UMAP(n_neighbors=20)
5 umap_embeds = reducer.fit_transform(embeds)
6 # Prepare the data to plot and interactive visualization
7 # using Altair
8 df_explore = pd.DataFrame(data={"text": df["text"]})
9 df_explore["x"] = umap_embeds[:, 0]
10 df_explore["y"] = umap_embeds[:, 1]
11 
12 # Plot
13 chart = (
14     alt.Chart(df_explore)
15     .mark_circle(size=60)
16     .encode(
17         x=alt.X("x", scale=alt.Scale(zero=False)),  #'x',
18         y=alt.Y("y", scale=alt.Scale(zero=False)),
19         tooltip=["text"],
20     )
21     .properties(width=700, height=400)
22 )
23 chart.interactive()

Conclusion

This concludes this introductory guide to semantic search using sentence embeddings. As you continue the path of building a search product additional considerations arise (like dealing with long texts, or training to better improve the embeddings for a specific use case).

Original Source

This material comes from the post Semantic Search

Colab Notebook

Introduction

Contents

1. Download the Dependencies

2. Get the Archive of Questions

3. Embed the Archive

4. Build the Index, search Using an Index and Conduct Nearest Neighbour Search

4a. Find the Neighbours of an Example from the Dataset

4b. Find the Neighbours of a User Query

5. Visualize the archive

Conclusion

Original Source