Wikipedia Semantic Search with Cohere Embedding Archives
This notebook contains the starter code to do simple semantic search on the Wikipedia embeddings archives published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we’ll use Wikipedia Simple English.
Let’s now download 1,000 records from the English Wikipedia embeddings archive so we can search it afterwards.
Now, doc_embeddings
holds the embeddings of the first 1,000 documents in the dataset. Each document is represented as an embeddings vector of 768 values.
We can now search these vectors for any query we want. For this toy example, we’ll ask a question about Wikipedia since we know the Wikipedia page is included in the first 1000 documents we used here.
To search, we embed the query, then get the nearest neighbors to its embedding (using dot product).
This shows the top three passages that are relevant to the query. We can retrieve more results by changing the k
value. The question in this simple demo is about Wikipedia because we know that the Wikipedia page is part of the documents in this subset of the archive.