๐Ÿš€ New multimodal model: Command A Vision! (Learn more) ๐Ÿš€

ModelsCommand

Cohere's Command A Vision Model

Command A Vision model details and specifications

Command A Vision is Cohereโ€™s first multimodal model capable of understanding and interpreting visual data alongside text. With a 128K context length and support for up to 20 images per request, Command Vision excels at enterprise use cases including document analysis, chart interpretation, optical character recognition (OCR), and processing images featuring multiple languages. The model maintains the same API interface as other Command models, making it easy to integrate vision capabilities into existing applications.

Model Details

Model NameDescriptionModalityContext LengthMaximum Output TokensEndpoints
command-a-vision-07-2025Command A Vision is our first model capable of processing images, excelling in enterprise use cases such as analyzing charts, graphs, and diagrams, table understanding, OCR, document Q&A, and scene analysis. It officially supports English, Portuguese, Italian, French, German, and Spanish.Text, Images128K8KChat

What Can Command A Vision be Used For?

Command A Vision is excellent in enterprise use cases such as:

  • Analysis of charts, graphs, and diagrams;
  • Extracting and understanding in-image tables;
  • Document optical character recognition (OCR) and question answering;
  • Natural-language image processing.

Limitations

Be aware that tool use isnโ€™t supported with this model.

Also, itโ€™s important to mention that Command A Vision can accept images as input, but doesnโ€™t generate them.

For more detailed breakdowns of these and other applications, check out our cookbooks. To learn more about how token counts work, the maximum number of images, and so on, check out our dedicated Image Inputs document.