Cohere Text Generation Tutorial

Open in Colab

Command is Cohere’s flagship LLM model family. Command models generate a response based on a user message or prompt. It is trained to follow user commands and to be instantly useful in practical business applications, like summarization, copywriting, extraction, and question-answering.

Command A and Command R7B are the most recent models in the Command family. They are the market-leading models that balance high efficiency with strong accuracy to enable enterprises to move from proof of concept into production-grade AI.

You’ll use Chat, the Cohere endpoint for accessing the Command models.

In this tutorial, you’ll learn about:

Basic text generation
Prompt engineering
Parameters for controlling output
Structured output generation
Streamed output

You’ll learn these by building an onboarding assistant for new hires.

Setup

To get started, first we need to install the cohere library and create a Cohere client.

PYTHON

1 # pip install cohere
2 
3 import cohere
4 import json
5 
6 # Get your free API key: https://dashboard.cohere.com/api-keys
7 co = cohere.ClientV2(api_key="COHERE_API_KEY")

Basic text generation

To get started with Chat, we need to pass two parameters, model for the LLM model ID and messages, which we add a single user message. We then call the Chat endpoint through the client we created earlier.

The response contains several objects. For simplicity, what we want right now is the message.content[0].text object.

Here’s an example of the assistant responding to a new hire’s query asking for help to make introductions.

PYTHON

1 # Add the user message
2 message = "I'm joining a new startup called Co1t today. Could you help me write a short introduction message to my teammates."
3 
4 # Generate the response
5 response = co.chat(
6     model="command-a-03-2025",
7     messages=[{"role": "user", "content": message}],
8 )
9 #    messages=[cohere.UserMessage(content=message)])
10 
11 print(response.message.content[0].text)

Sure! Here is a draft of an introduction message: 
"Hi everyone! My name is [Your Name], and I am thrilled to be joining the Co1t team today. I am excited to get to know you all and contribute to the amazing work being done at this startup. A little about me: [Brief description of your role, experience, and interests]. Outside of work, I enjoy [Hobbies and interests]. I look forward to collaborating with you all and being a part of Co1t's journey. Let's connect and make something great together!" 
Feel free to edit and personalize the message to your liking. Good luck with your new role at Co1t!

Further reading:

Prompt engineering

Prompting is at the heart of working with LLMs. The prompt provides context for the text that we want the model to generate. The prompts we create can be anything from simple instructions to more complex pieces of text, and they are used to encourage the model to produce a specific type of output.

In this section, we’ll look at a couple of prompting techniques.

The first is to add more specific instructions to the prompt. The more instructions you provide in the prompt, the closer you can get to the response you need.

The limit of how long a prompt can be is dependent on the maximum context length that a model can support (in the case Command A, it’s 256k tokens).

Below, we’ll add one additional instruction to the earlier prompt: the length we need the response to be.

PYTHON

1 # Add the user message
2 message = "I'm joining a new startup called Co1t today. Could you help me write a one-sentence introduction message to my teammates."
3 
4 # Generate the response
5 response = co.chat(
6     model="command-a-03-2025",
7     messages=[{"role": "user", "content": message}],
8 )
9 #    messages=[cohere.UserMessage(content=message)])
10 
11 print(response.message.content[0].text)

"Hi everyone, my name is [Your Name], and I am thrilled to join the Co1t team today as a [Your Role], eager to contribute my skills and ideas to the company's growth and success!"

All our prompts so far use what is called zero-shot prompting, which means that provide instruction without any example. But in many cases, it is extremely helpful to provide examples to the model to guide its response. This is called few-shot prompting.

Few-shot prompting is especially useful when we want the model response to follow a particular style or format. Also, it is sometimes hard to explain what you want in an instruction, and easier to show examples.

Below, we want the response to be similar in style and length to the convention, as we show in the examples.

PYTHON

1 # Add the user message
2 user_input = (
3     "Why can't I access the server? Is it a permissions issue?"
4 )
5 
6 # Create a prompt containing example outputs
7 message = f"""Write a ticket title for the following user request:
8 
9 User request: Where are the usual storage places for project files?
10 Ticket title: Project File Storage Location
11 
12 User request: Emails won't send. What could be the issue?
13 Ticket title: Email Sending Issues
14 
15 User request: How can I set up a connection to the office printer?
16 Ticket title: Printer Connection Setup
17 
18 User request: {user_input}
19 Ticket title:"""
20 
21 # Generate the response
22 response = co.chat(
23     model="command-a-03-2025",
24     messages=[{"role": "user", "content": message}],
25 )
26 
27 print(response.message.content[0].text)

Ticket title: "Server Access Permissions Issue"

Further reading:

Parameters for controlling output

The Chat endpoint provides developers with an array of options and parameters.

For example, you can choose from several variations of the Command model. Different models produce different output profiles, such as quality and latency.

PYTHON

1 # Add the user message
2 message = "I'm joining a new startup called Co1t today. Could you help me write a one-sentence introduction message to my teammates."
3 
4 # Generate the response
5 response = co.chat(
6     model="command-a-03-2025",
7     messages=[{"role": "user", "content": message}],
8 )
9 
10 print(response.message.content[0].text)

"Hi, I'm [Your Name] and I'm thrilled to join the Co1t team today as a [Your Role], eager to contribute my skills and ideas to help drive innovation and success for our startup!"

Often, you’ll need to control the level of randomness of the output. You can control this using a few parameters.

The most commonly used parameter is temperature, which is a number used to tune the degree of randomness. You can enter values between 0.0 to 1.0.

A lower temperature gives more predictable outputs, and a higher temperature gives more “creative” outputs.

Here’s an example of setting temperature to 0.

PYTHON

1 # Add the user message
2 message = "I like learning about the industrial revolution and how it shapes the modern world. How I can introduce myself in five words or less."
3 
4 # Generate the response multiple times by specifying a low temperature value
5 for idx in range(3):
6     response = co.chat(
7         model="command-a-03-2025",
8         messages=[{"role": "user", "content": message}],
9         temperature=0,
10     )
11 
12     print(f"{idx+1}: {response.message.content[0].text}\n")

1: "Revolution Enthusiast"
2: "Revolution Enthusiast"
3: "Revolution Enthusiast"

And here’s an example of setting temperature to 1.

PYTHON

1 # Add the user message
2 message = "I like learning about the industrial revolution and how it shapes the modern world. How I can introduce myself in five words or less."
3 
4 # Generate the response multiple times by specifying a low temperature value
5 for idx in range(3):
6     response = co.chat(
7         model="command-a-03-2025",
8         messages=[{"role": "user", "content": message}],
9         temperature=1,
10     )
11 
12     print(f"{idx+1}: {response.message.content[0].text}\n")

1: Here is a suggestion: 
"Revolution Enthusiast. History Fan." 
This introduction highlights your passion for the industrial revolution and its impact on history while keeping within the word limit.
2: "Revolution fan."
3: "IR enthusiast."

Further reading:

Structured output generation

By adding the response_format parameter, you can get the model to generate the output as a JSON object. By generating JSON objects, you can structure and organize the model’s responses in a way that can be used in downstream applications.

The response_format parameter allows you to specify the schema the JSON object must follow. It takes the following parameters:

message: The user message
response_format: The schema of the JSON object

PYTHON

1 # Add the user message
2 user_input = (
3     "Why can't I access the server? Is it a permissions issue?"
4 )
5 message = f"""Create an IT ticket for the following user request. Generate a JSON object.
6 {user_input}"""
7 
8 # Generate the response multiple times by adding the JSON schema
9 response = co.chat(
10     model="command-a-03-2025",
11     messages=[{"role": "user", "content": message}],
12     response_format={
13         "type": "json_object",
14         "schema": {
15             "type": "object",
16             "required": ["title", "category", "status"],
17             "properties": {
18                 "title": {"type": "string"},
19                 "category": {
20                     "type": "string",
21                     "enum": ["access", "software"],
22                 },
23                 "status": {
24                     "type": "string",
25                     "enum": ["open", "closed"],
26                 },
27             },
28         },
29     },
30 )
31 
32 json_object = json.loads(response.message.content[0].text)
33 
34 print(json_object)

{'title': 'Unable to Access Server', 'category': 'access', 'status': 'open'}

Further reading:

Documentation on Structured Outputs

Streaming responses

All the previous examples above generate responses in a non-streamed manner. This means that the endpoint would return a response object only after the model has generated the text in full.

The Chat endpoint also provides streaming support. In a streamed response, the endpoint would return a response object for each token as it is being generated. This means you can display the text incrementally without having to wait for the full completion.

To activate it, use co.chat_stream() instead of co.chat().

In streaming mode, the endpoint will generate a series of objects. To get the actual text contents, we take objects whose event_type is content-delta.

PYTHON

1 # Add the user message
2 message = "I'm joining a new startup called Co1t today. Could you help me write a one-sentence introduction message to my teammates."
3 
4 # Generate the response by streaming it
5 response = co.chat_stream(
6     model="command-a-03-2025",
7     messages=[{"role": "user", "content": message}],
8 )
9 
10 for event in response:
11     if event:
12         if event.type == "content-delta":
13             print(event.delta.message.content.text, end="")

"Hi, I'm [Your Name] and I'm thrilled to join the Co1t team today as a [Your Role], passionate about [Your Expertise], and excited to contribute to our shared mission of [Startup's Mission]!"

Further reading:

Documentation on streaming responses

Conclusion

In this tutorial, you learned about:

How to get started with a basic text generation
How to improve outputs with prompt engineering
How to control outputs using parameter changes
How to generate structured outputs
How to stream text generation outputs

However, we have only done all this using direct text generations. As its name implies, the Chat endpoint can also support building chatbots, which require features to support multi-turn conversations and maintain the conversation state.

In the next tutorial, you’ll learn how to build chatbots with the Chat endpoint.