Multilingual Search with Cohere and Langchain

Read the accompanying blog post here.

This notebook contains two examples for performing multilingual search using Cohere and Langchain. Langchain is a library that assists the development of applications built on top of large language models (LLMs), such as Cohere’s models.

In short, Cohere makes it easy for developers to leverage LLMs and Langchain makes it easy to build applications with these models.

We’ll go through the following examples:

  • Example 1 - Basic Multilingual Search

    This is a simple example of multilingual search over a list of documents.

    The steps in summary:

    • Import a list of documents
    • Embed the documents and store them in an index
    • Enter a query
    • Return the document most similar to the query
  • Example 2 - Search-Based Question Answering

    This example shows a more involved example where search is combined with text generation to answer questions about long-form documents.

    The steps in summary:

    • Add an article and chunk it into smaller passages
    • Embed the passages and store them in an index
    • Enter a question
    • Answer the question based on the most relevant documents
PYTHON
1from langchain.embeddings.cohere import CohereEmbeddings
2from langchain.llms import Cohere
3from langchain.prompts import PromptTemplate
4from langchain.text_splitter import RecursiveCharacterTextSplitter
5from langchain.chains.question_answering import load_qa_chain
6from langchain.chains import RetrievalQA
7from langchain.vectorstores import Qdrant
8from langchain.document_loaders import TextLoader
9import textwrap as tr
10import random
11import dotenv
12import os
13
14dotenv.load_dotenv(".env") # Upload an '.env' file containing an environment variable named 'COHERE_API_KEY' using your Cohere API Key
Output
True
Example-1---Basic-Multilingual-Search.png

Import a list of documents

PYTHON
1import tensorflow_datasets as tfds
2dataset = tfds.load('trec', split='train')
3texts = [item['text'].decode('utf-8') for item in tfds.as_numpy(dataset)]
4print(f"Number of documents: {len(texts)}")
Output
Downloading and preparing dataset 350.79 KiB (download: 350.79 KiB, generated: 636.90 KiB, total: 987.69 KiB) to /root/tensorflow_datasets/trec/1.0.0...
Dl Completed...: 0 url [00:00, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s]
Extraction completed...: 0 file [00:00, ? file/s]
Generating splits...: 0%| | 0/2 [00:00<?, ? splits/s]
Generating train examples...: 0%| | 0/5452 [00:00<?, ? examples/s]
Shuffling /root/tensorflow_datasets/trec/1.0.0.incompleteWOR5EP/trec-train.tfrecord*...: 0%| | 0/54…
Generating test examples...: 0%| | 0/500 [00:00<?, ? examples/s]
Shuffling /root/tensorflow_datasets/trec/1.0.0.incompleteWOR5EP/trec-test.tfrecord*...: 0%| | 0/500…
Dataset trec downloaded and prepared to /root/tensorflow_datasets/trec/1.0.0. Subsequent calls will reuse this data.
Number of documents: 5452
PYTHON
1random.seed(11)
2for item in random.sample(texts, 5):
3 print(item)
Output
What is the starting salary for beginning lawyers ?
Where did Bill Gates go to college ?
What task does the Bouvier breed of dog perform ?
What are the top boy names in the U.S. ?
What is a female rabbit called ?

Embed the documents and store them in an index

PYTHON
1embeddings = CohereEmbeddings(model = "multilingual-22-12")
2
3db = Qdrant.from_texts(texts, embeddings, location=":memory:", collection_name="my_documents", distance_func="Dot")

Enter a query

PYTHON
1queries = ["How to get in touch with Bill Gates",
2 "Comment entrer en contact avec Bill Gates",
3 "Cara menghubungi Bill Gates"]
4
5queries_lang = ["English", "French", "Indonesian"]

Return the document most similar to the query

PYTHON
1answers = []
2for query in queries:
3 docs = db.similarity_search(query)
4 answers.append(docs[0].page_content)
PYTHON
1for idx,query in enumerate(queries):
2 print(f"Query language: {queries_lang[idx]}")
3 print(f"Query: {query}")
4 print(f"Most similar existing question: {answers[idx]}")
5 print("-"*20,"\n")
Output
Query language: English
Query: How to get in touch with Bill Gates
Most similar existing question: What is Bill Gates of Microsoft E-mail address ?
--------------------
Query language: French
Query: Comment entrer en contact avec Bill Gates
Most similar existing question: What is Bill Gates of Microsoft E-mail address ?
--------------------
Query language: Indonesian
Query: Cara menghubungi Bill Gates
Most similar existing question: What is Bill Gates of Microsoft E-mail address ?
--------------------
Example-2---Search-Based-Question-Answering.png

Add an article and chunk it into smaller passages

PYTHON
1!wget 'https://docs.google.com/uc?export=download&amp;id=1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F' -O steve-jobs-commencement.txt
Output
--2023-06-08 06:11:19-- https://docs.google.com/uc?export=download&amp;id=1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F
Resolving docs.google.com (docs.google.com)... 74.125.200.101, 74.125.200.138, 74.125.200.102, ...
Connecting to docs.google.com (docs.google.com)|74.125.200.101|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0g-84-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/84t4moii9dmg08hmrh6rfpp8ecrjh6jq/1686204675000/12721472133292131824/*/1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F?e=download&amp;uuid=a26288c7-ad0c-4707-ae0b-72cb94c224dc [following]
Warning: wildcards not supported in HTTP.
--2023-06-08 06:11:19-- https://doc-0g-84-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/84t4moii9dmg08hmrh6rfpp8ecrjh6jq/1686204675000/12721472133292131824/*/1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F?e=download&amp;uuid=a26288c7-ad0c-4707-ae0b-72cb94c224dc
Resolving doc-0g-84-docs.googleusercontent.com (doc-0g-84-docs.googleusercontent.com)... 74.125.130.132, 2404:6800:4003:c01::84
Connecting to doc-0g-84-docs.googleusercontent.com (doc-0g-84-docs.googleusercontent.com)|74.125.130.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11993 (12K) [text/plain]
Saving to: ‘steve-jobs-commencement.txt’
steve-jobs-commence 100%[===================>] 11.71K --.-KB/s in 0s
2023-06-08 06:11:20 (115 MB/s) - ‘steve-jobs-commencement.txt’ saved [11993/11993]
PYTHON
1loader = TextLoader("steve-jobs-commencement.txt")
2documents = loader.load()
3text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
4texts = text_splitter.split_documents(documents)

Embed the passages and store them in an index

PYTHON
1embeddings = CohereEmbeddings(model = "multilingual-22-12")
2db = Qdrant.from_documents(texts, embeddings, location=":memory:", collection_name="my_documents", distance_func="Dot")

Enter a question

PYTHON
1questions = [
2 "What did the author liken The Whole Earth Catalog to?",
3 "What was Reed College great at?",
4 "What was the author diagnosed with?",
5 "What is the key lesson from this article?",
6 "What did the article say about Michael Jackson?",
7 ]

Answer the question based on the most relevant documents

PYTHON
1prompt_template = """Text: {context}
2
3Question: {question}
4
5Answer the question based on the text provided. If the text doesn't contain the answer, reply that the answer is not available."""
6
7PROMPT = PromptTemplate(
8 template=prompt_template, input_variables=["context", "question"]
9)
PYTHON
1chain_type_kwargs = {"prompt": PROMPT}
2
3qa = RetrievalQA.from_chain_type(llm=Cohere(model="command", temperature=0),
4 chain_type="stuff",
5 retriever=db.as_retriever(),
6 chain_type_kwargs=chain_type_kwargs,
7 return_source_documents=True)
8
9for question in questions:
10 answer = qa({"query": question})
11 result = answer["result"].replace("\n","").replace("Answer:","")
12 sources = answer['source_documents']
13 print("-"*150,"\n")
14 print(f"Question: {question}")
15 print(f"Answer: {result}")
16
17 ### COMMENT OUT THE 4 LINES BELOW TO HIDE THE SOURCES
18 print(f"\nSources:")
19 for idx, source in enumerate(sources):
20 source_wrapped = tr.fill(str(source.page_content), width=150)
21 print(f"{idx+1}: {source_wrapped}")
Output
------------------------------------------------------------------------------------------------------------------------------------------------------
Question: What did the author liken The Whole Earth Catalog to?
Answer: It was sort of like Google in paperback form, 35 years before Google came along
Sources:
1: When I was young, there was an amazing publication called The Whole Earth Catalog, which was one of the bibles of my generation. It was created by a
fellow named Stewart Brand not far from here in Menlo Park, and he brought it to life with his poetic touch. This was in the late 1960s, before
personal computers and desktop publishing, so it was all made with typewriters, scissors and Polaroid cameras. It was sort of like Google in paperback
form, 35 years before Google came along: It was
2: Stewart and his team put out several issues of The Whole Earth Catalog, and then when it had run its course, they put out a final issue. It was the
mid-1970s, and I was your age. On the back cover of their final issue was a photograph of an early morning country road, the kind you might find
yourself hitchhiking on if you were so adventurous. Beneath it were the words: “Stay Hungry. Stay Foolish.” It was their farewell message as they
signed off. Stay Hungry. Stay Foolish. And I have always
3: idealistic, and overflowing with neat tools and great notions.
4: beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.
------------------------------------------------------------------------------------------------------------------------------------------------------
Question: What was Reed College great at?
Answer: Reed College was great at calligraphy instruction.
Sources:
1: Reed College at that time offered perhaps the best calligraphy instruction in the country. Throughout the campus every poster, every label on every
drawer, was beautifully hand calligraphed. Because I had dropped out and didn’t have to take the normal classes, I decided to take a calligraphy class
to learn how to do this. I learned about serif and sans serif typefaces, about varying the amount of space between different letter combinations,
about what makes great typography great. It was
2: I dropped out of Reed College after the first 6 months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why
did I drop out?
3: never dropped out, I would have never dropped in on this calligraphy class, and personal computers might not have the wonderful typography that they
do. Of course it was impossible to connect the dots looking forward when I was in college. But it was very, very clear looking backward 10 years
later.
4: OK. It was pretty scary at the time, but looking back it was one of the best decisions I ever made. The minute I dropped out I could stop taking the
required classes that didn’t interest me, and begin dropping in on the ones that looked interesting.
------------------------------------------------------------------------------------------------------------------------------------------------------
Question: What was the author diagnosed with?
Answer: The author was diagnosed with cancer.
Sources:
1: I lived with that diagnosis all day. Later that evening I had a biopsy, where they stuck an endoscope down my throat, through my stomach and into my
intestines, put a needle into my pancreas and got a few cells from the tumor. I was sedated, but my wife, who was there, told me that when they viewed
the cells under a microscope the doctors started crying because it turned out to be a very rare form of pancreatic cancer that is curable with
surgery. I had the surgery and I’m fine now.
2: About a year ago I was diagnosed with cancer. I had a scan at 7:30 in the morning, and it clearly showed a tumor on my pancreas. I didn’t even know
what a pancreas was. The doctors told me this was almost certainly a type of cancer that is incurable, and that I should expect to live no longer than
three to six months. My doctor advised me to go home and get my affairs in order, which is doctor’s code for prepare to die. It means to try to tell
your kids everything you thought you’d have the
3: Stewart and his team put out several issues of The Whole Earth Catalog, and then when it had run its course, they put out a final issue. It was the
mid-1970s, and I was your age. On the back cover of their final issue was a photograph of an early morning country road, the kind you might find
yourself hitchhiking on if you were so adventurous. Beneath it were the words: “Stay Hungry. Stay Foolish.” It was their farewell message as they
signed off. Stay Hungry. Stay Foolish. And I have always
4: beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.
------------------------------------------------------------------------------------------------------------------------------------------------------
Question: What is the key lesson from this article?
Answer: The key lesson from this article is that you have to trust that the dots will somehow connect in your future. You have to trust in something -- your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all the difference in my life.
Sources:
1: Again, you can’t connect the dots looking forward; you can only connect them looking backward. So you have to trust that the dots will somehow connect
in your future. You have to trust in something — your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all
the difference in my life. My second story is about love and loss.
2: Remembering that I’ll be dead soon is the most important tool I’ve ever encountered to help me make the big choices in life. Because almost everything
— all external expectations, all pride, all fear of embarrassment or failure — these things just fall away in the face of death, leaving only what is
truly important. Remembering that you are going to die is the best way I know to avoid the trap of thinking you have something to lose. You are
already naked. There is no reason not to follow your
3: Your time is limited, so don’t waste it living someone else’s life. Don’t be trapped by dogma — which is living with the results of other people’s
thinking. Don’t let the noise of others’ opinions drown out your own inner voice. And most important, have the courage to follow your heart and
intuition. They somehow already know what you truly want to become. Everything else is secondary.
4: I really didn’t know what to do for a few months. I felt that I had let the previous generation of entrepreneurs down — that I had dropped the baton
as it was being passed to me. I met with David Packard and Bob Noyce and tried to apologize for screwing up so badly. I was a very public failure, and
I even thought about running away from the valley. But something slowly began to dawn on me — I still loved what I did. The turn of events at Apple
had not changed that one bit. I had been rejected,
------------------------------------------------------------------------------------------------------------------------------------------------------
Question: What did the article say about Michael Jackson?
Answer: The text did not provide information about Michael Jackson.
Sources:
1: baby boy; do you want him?” They said: “Of course.” My biological mother later found out that my mother had never graduated from college and that my
father had never graduated from high school. She refused to sign the final adoption papers. She only relented a few months later when my parents
promised that I would someday go to college.
2: beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.
3: I really didn’t know what to do for a few months. I felt that I had let the previous generation of entrepreneurs down — that I had dropped the baton
as it was being passed to me. I met with David Packard and Bob Noyce and tried to apologize for screwing up so badly. I was a very public failure, and
I even thought about running away from the valley. But something slowly began to dawn on me — I still loved what I did. The turn of events at Apple
had not changed that one bit. I had been rejected,
4: This was the closest I’ve been to facing death, and I hope it’s the closest I get for a few more decades. Having lived through it, I can now say this
to you with a bit more certainty than when death was a useful but purely intellectual concept:

Questions in French

PYTHON
1questions_fr = [
2 "À quoi se compare The Whole Earth Catalog ?",
3 "Dans quoi Reed College était-il excellent ?",
4 "De quoi l'auteur a-t-il été diagnostiqué ?",
5 "Quelle est la leçon clé de cet article ?",
6 "Que disait l'article sur Michael Jackson ?",
7 ]
PYTHON
1chain_type_kwargs = {"prompt": PROMPT}
2
3qa = RetrievalQA.from_chain_type(llm=Cohere(model="command", temperature=0),
4 chain_type="stuff",
5 retriever=db.as_retriever(),
6 chain_type_kwargs=chain_type_kwargs,
7 return_source_documents=True)
8
9for question in questions_fr:
10 answer = qa({"query": question})
11 result = answer["result"].replace("\n","").replace("Answer:","")
12 sources = answer['source_documents']
13 print("-"*20,"\n")
14 print(f"Question: {question}")
15 print(f"Answer: {result}")
Output
--------------------
Question: À quoi se compare The Whole Earth Catalog ?
Answer: The Whole Earth Catalog was like Google in paperback form, 35 years before Google came along.
--------------------
Question: Dans quoi Reed College était-il excellent ?
Answer: Reed College offered the best calligraphy instruction in the country.
--------------------
Question: De quoi l'auteur a-t-il été diagnostiqué ?
Answer: The author was diagnosed with a very rare form of pancreatic cancer that is curable with surgery.
--------------------
Question: Quelle est la leçon clé de cet article ?
Answer: The key lesson of this article is that remembering that you will die soon is the most important tool to help one make the big choices in life.
--------------------
Question: Que disait l'article sur Michael Jackson ?
Answer: The text does not contain the answer to the question.