Multilingual Embed Models

At Cohere, we are committed to breaking down barriers and expanding access to cutting-edge NLP technologies that power projects across the globe. By making our innovative multilingual language models available to all developers, we continue to move toward our goal of empowering developers, researchers, and innovators with state-of-the-art NLP technologies that push the boundaries of Language AI.

Our Multilingual Model maps text to a semantic vector space, positioning text with a similar meaning in close proximity. This process unlocks a range of valuable use cases for multilingual settings. For example, one can map a query to this vector space during a search to locate relevant documents nearby. This often yields search results that are several times better than keyword search.

Use Cases

Multilingual Semantic Search: Improve your search results regardless of the language.
Aggregate Customer Feedback: Organize customer feedback across hundreds of languages, simplifying a major challenge for international operations.
Cross-Lingual Zero-Shot Content Moderation: Identify harmful content in online communities is challenging, especially as users speak hundreds of languages. Train a model with a few English examples, then detect harmful content in 100+ languages.

Get Started

To get started using the multilingual embed models, you can either query our endpoints or install our SDK to use the model within Python:

import cohere  
co = cohere.Client(api_key="<YOUR API KEY>")  
texts = [  
   'Hello from Cohere!', 'مرحبًا من كوهير!', 'Hallo von Cohere!',  
   'Bonjour de Cohere!', '¡Hola desde Cohere!', 'Olá do Cohere!',  
   'Ciao da Cohere!', '您好，来自 Cohere！', 'कोहेरे से नमस्ते!'  
]  
response = co.embed(texts=texts,input_type='classification', embedding_types=['float'], model='embed-multilingual-v3.0')  
embeddings = response.embeddings.float # All text embeddings 
print(embeddings[0][:5]) # Print embeddings for the first text

Model Performance

Model	Clustering	Search- English	Search- Multilingual	Cross-lingual Classification
Cohere: `embed-multilingual-v3.0`		55.07	66.8
Cohere: `embed-multilingual-light-v3.0`		5.0.75	65.8
Cohere: `embed-multilingual-v2.0`	51.0	55.8	51.4	64.6
Sentence-transformers: `paraphrase-multilingual-mpnet-base-v2`	46.7	44.4	15.3	56.1
Google: `LaBSE`	41.0	20.9	13.2	59.2
Google: `Universal Sentence Encoder`	40.1	14.3	3.4	59.8

List of Supported Languages

Our multilingual embed model supports over 100 languages, including Chinese, Spanish, and French.

ISO Code	Language Name
af	Afrikaans
am	Amharic
ar	Arabic
as	Assamese
az	Azerbaijani
be	Belarusian
bg	Bulgarian
bn	Bengali
bo	Tibetan
bs	Bosnian
ca	Catalan
ceb	Cebuano
co	Corsican
cs	Czech
cy	Welsh
da	Danish
de	German
el	Greek
en	English
eo	Esperanto
es	Spanish
et	Estonian
eu	Basque
fa	Persian
fi	Finnish
fr	French
fy	Frisian
ga	Irish
gd	Scots_gaelic
gl	Galician
gu	Gujarati
ha	Hausa
haw	Hawaiian
he	Hebrew
hi	Hindi
hmn	Hmong
hr	Croatian
ht	Haitian_creole
hu	Hungarian
hy	Armenian
id	Indonesian
ig	Igbo
is	Icelandic
it	Italian
ja	Japanese
jv	Javanese
ka	Georgian
kk	Kazakh
km	Khmer
kn	Kannada
ko	Korean
ku	Kurdish
ky	Kyrgyz
La	Latin
Lb	Luxembourgish
Lo	Laothian
Lt	Lithuanian
Lv	Latvian
mg	Malagasy
mi	Maori
mk	Macedonian
ml	Malayalam
mn	Mongolian
mr	Marathi
ms	Malay
mt	Maltese
my	Burmese
ne	Nepali
nl	Dutch
no	Norwegian
ny	Nyanja
or	Oriya
pa	Punjabi
pl	Polish
pt	Portuguese
ro	Romanian
ru	Russian
rw	Kinyarwanda
si	Sinhalese
sk	Slovak
sl	Slovenian
sm	Samoan
sn	Shona
so	Somali
sq	Albanian
sr	Serbian
st	Sesotho
su	Sundanese
sv	Swedish
sw	Swahili
ta	Tamil
te	Telugu
tg	Tajik
th	Thai
tk	Turkmen
tl	Tagalog
tr	Turkish
tt	Tatar
ug	Uighur
uk	Ukrainian
ur	Urdu
uz	Uzbek
vi	Vietnamese
wo	Wolof
xh	Xhosa
yi	Yiddish
yo	Yoruba
zh	Chinese
zu	Zulu