For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DASHBOARDPLAYGROUNDDOCSCOMMUNITYLOG IN
Guides and conceptsAPI ReferenceRelease NotesLLMUCookbooks
Guides and conceptsAPI ReferenceRelease NotesLLMUCookbooks
    • Cookbooks
    • Agent API Calls
    • Short-Term Memory Handling for Agents
    • Agentic Multi-Stage RAG with Cohere Tools API
    • Agentic RAG for PDFs with mixed data
    • Analysis of Form 10-K/10-Q Using Cohere and RAG
    • Analyzing Hacker News with Six Language Understanding Methods
    • Article Recommender with Text Embedding Classification Extraction
    • Multi-Step Tool Use
    • Basic RAG
    • Basic Semantic Search
    • Basic Tool Use
    • Calendar Agent with Native Multi Step Tool
    • Chunking Strategies
    • Creating a QA Bot From Technical Documentation
    • Financial CSV Agent with Native Multi-Step Cohere API
    • Financial CSV Agent with Langchain
    • Migrating away from create_csv_agent in langchain-cohere
    • A Data Analyst Agent Built with Cohere and Langchain
    • Advanced Document Parsing For Enterprises
    • End-to-end RAG using Elasticsearch and Cohere
    • Semantic Search with Cohere Embed Jobs and Pinecone serverless Solution
    • Semantic Search with Cohere Embed Jobs
    • Fueling Generative Content with Keyword Research
    • Grounded Summarization Using Command R
    • Hello World! Meet Language AI
    • Long Form General Strategies
    • Migrating Monolithic Prompts to Command-R with RAG
    • Multilingual Search with Cohere and Langchain
    • PDF Extractor with Native Multi Step Tool Use
    • Pondr, Fostering Connection through Good Conversation
    • Deep Dive Into RAG Evaluation
    • RAG With Chat Embed and Rerank via Pinecone
    • Demo of Rerank
    • SQL Agent
    • Summarization Evals
    • Text Classification Using Embeddings
    • Topic Modeling AI Papers
    • Wikipedia Semantic Search with Cohere + Weaviate
    • Wikipedia Semantic Search with Cohere Embedding Archives
    • Build Chatbots That Know Your Business with MongoDB and Cohere
    • Finetuning on Cohere's Platform
    • Deploy your finetuned model on AWS Marketplace
    • Finetuning on AWS Sagemaker
    • SQL Agent with Cohere and LangChain (i-5O Case Study)
    • Introduction to Aya Vision
    • Retrieval Evaluation with LLM-as-a-Judge via Pydantic AI
    • Document Translation with Command A Translate
LogoLogodocs
DASHBOARDPLAYGROUNDDOCSCOMMUNITYLOG IN

Wikipedia Semantic Search with Cohere Embedding Archives

Was this page helpful?
Edit this page
Previous

Build Chatbots with MongoDB and Cohere

Next
Built with
Back to Cookbooks
Open in GitHub

This notebook contains the starter code to do simple semantic search on the Wikipedia embeddings archives published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we’ll use Wikipedia Simple English.

Let’s now download 1,000 records from the English Wikipedia embeddings archive so we can search it afterwards.

PYTHON
1from datasets import load_dataset
2import torch
3import cohere
4s
5co = cohere.Client("")
6
7#Load at max 1000 documents + embeddings
8max_docs = 1000
9docs_stream = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True)
10
11docs = []
12doc_embeddings = []
13
14for doc in docs_stream:
15 docs.append(doc)
16 doc_embeddings.append(doc['emb'])
17 if len(docs) >= max_docs:
18 break
19
20doc_embeddings = torch.tensor(doc_embeddings)
Output
Downloading: 0%| | 0.00/1.29k [00:00<?, ?B/s]
Using custom data configuration Cohere--wikipedia-22-12-simple-embeddings-94deea3d55a22093

Now, doc_embeddings holds the embeddings of the first 1,000 documents in the dataset. Each document is represented as an embeddings vector of 768 values.

PYTHON
1doc_embeddings.shape
Output
torch.Size([1000, 768])

We can now search these vectors for any query we want. For this toy example, we’ll ask a question about Wikipedia since we know the Wikipedia page is included in the first 1000 documents we used here.

To search, we embed the query, then get the nearest neighbors to its embedding (using dot product).

PYTHON
1query = 'Who founded Wikipedia'
2response = co.embed(texts=[query], model='embed-v4.0')
3query_embedding = response.embeddings
4query_embedding = torch.tensor(query_embedding)
5
6dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
7top_k = torch.topk(dot_scores, k=3)
8
9print("Query:", query)
10for doc_id in top_k.indices[0].tolist():
11 print(docs[doc_id]['title'])
12 print(docs[doc_id]['text'], "\n")
Output
Query: Who founded Wikipedia
Wikipedia
Larry Sanger and Jimmy Wales are the ones who started Wikipedia. Wales is credited with defining the goals of the project. Sanger created the strategy of using a wiki to reach Wales' goal. On January 10, 2001, Larry Sanger proposed on the Nupedia mailing list to create a wiki as a "feeder" project for Nupedia. Wikipedia was launched on January 15, 2001. It was launched as an English-language edition at www.wikipedia.com, and announced by Sanger on the Nupedia mailing list. Wikipedia's policy of "neutral point-of-view" was enforced in its initial months, and was similar to Nupedia's earlier "nonbiased" policy. Otherwise, there weren't very many rules initially, and Wikipedia operated independently of Nupedia.
Wikipedia
Wikipedia began as a related project for Nupedia. Nupedia was a free English-language online encyclopedia project. Nupedia's articles were written and owned by Bomis, Inc which was a web portal company. The important people of the company were Jimmy Wales, the person in charge of Bomis, and Larry Sanger, the editor-in-chief of Nupedia. Nupedia was first licensed under the Nupedia Open Content License which was changed to the GNU Free Documentation License before Wikipedia was founded and made their first article when Richard Stallman requested them.
Wikipedia
Wikipedia was started on January 10, 2001, by Jimmy Wales and Larry Sanger as part of an earlier online encyclopedia named Nupedia. On January 15, 2001, Wikipedia became a separate website of its own. It is a wiki that uses the software MediaWiki (like all other Wikimedia Foundation projects).

This shows the top three passages that are relevant to the query. We can retrieve more results by changing the k value. The question in this simple demo is about Wikipedia because we know that the Wikipedia page is part of the documents in this subset of the archive.