ModelsAudio

Cohere Transcribe

About the Cohere Transcribe model

Cohere Transcribe is an open source research release of a 2B parameters dedicated audio-in, text-out, automatic speech recognition (ASR) model. The model supports a total of 14 languages.

Model details

  • Input: Audio waveform
  • Output: Text
  • Model name: cohere-transcribe-03-2026
  • Languages covered: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Vietnamese, Chinese, Arabic, Japanese, Korean.
  • Maximum file size: 25MB
  • License: Apache 2.0

Availability

You can access Cohere Transcribe via our API for free, low-setup experimentation subject to rate limits.

For production deployment without rate limits, provision a dedicated Model Vault. This enables low-latency, private cloud inference without having to manage infrastructure. Pricing is calculated per hour-instance, with discounted plans for longer-term commitments. Contact our team to discuss your requirements.

Strengths

Cohere Transcribe demonstrates best-in-class transcription accuracy on 14 languages. As a dedicated speech recognition model, it is also efficient, benefitting from a real-time factor up to three times faster than that of other, dedicated ASR models in the same size range. The model was trained from scratch, and from the outset, we deliberately focused on minimizing word error rate (WER) while keeping production readiness top-of-mind.

Limitations

  • Single language: The model performs best when remaining in-distribution of a single, pre-specified language amongst the 14 in the range it supports. It does not feature explicit, automatic language detection.

  • Timestamps/speaker diarization: The model does not feature either of these.

Model architecture

Cohere Transcribe is built on a speech-optimized Transformer variant: a Conformer. Input audio waveforms are converted into a Mel spectrogram and then processed by a Conformer encoder that holds the majority of the model’s parameters. The encoder’s representations are then passed to a lightweight Transformer decoder that generates text tokens. Cohere Transcribe is trained using standard supervised cross-entropy.

Further Resources