Cohere Text Generation Tutorial
Command is Cohere’s flagship LLM. It generates a response based on a user message or prompt. It is trained to follow user commands and to be instantly useful in practical business applications, like summarization, copywriting, extraction, and question-answering.
Command R and Command R+ are the most recent models in the Command family. They are the market-leading models that balance high efficiency with strong accuracy to enable enterprises to move from proof of concept into production-grade AI.
You’ll use Chat, the Cohere endpoint for accessing the Command models.
In this tutorial, you’ll learn about:
- Basic text generation
- Prompt engineering
- Parameters for controlling output
- Structured output generation
- Streamed output
You’ll learn these by building an onboarding assistant for new hires.
Setup
To get started, first we need to install the cohere
library and create a Cohere client.
Basic text generation
To get started with Chat, we need to pass two parameters, model
for the LLM model ID and messages
, which we add a single user message. We then call the Chat endpoint through the client we created earlier.
The response contains several objects. For simplicity, what we want right now is the message.content[0].text
object.
Here’s an example of the assistant responding to a new hire’s query asking for help to make introductions.
Further reading:
- Chat endpoint API reference
- Documentation on Chat fine-tuning
- Documentation on Command R+
- LLM University module on text generation
Prompt engineering
Prompting is at the heart of working with LLMs. The prompt provides context for the text that we want the model to generate. The prompts we create can be anything from simple instructions to more complex pieces of text, and they are used to encourage the model to produce a specific type of output.
In this section, we’ll look at a couple of prompting techniques.
The first is to add more specific instructions to the prompt. The more instructions you provide in the prompt, the closer you can get to the response you need.
The limit of how long a prompt can be is dependent on the maximum context length that a model can support (in the case Command R/R+, it’s 128k tokens).
Below, we’ll add one additional instruction to the earlier prompt: the length we need the response to be.
All our prompts so far use what is called zero-shot prompting, which means that provide instruction without any example. But in many cases, it is extremely helpful to provide examples to the model to guide its response. This is called few-shot prompting.
Few-shot prompting is especially useful when we want the model response to follow a particular style or format. Also, it is sometimes hard to explain what you want in an instruction, and easier to show examples.
Below, we want the response to be similar in style and length to the convention, as we show in the examples.
Further reading:
Parameters for controlling output
The Chat endpoint provides developers with an array of options and parameters.
For example, you can choose from several variations of the Command model. Different models produce different output profiles, such as quality and latency.
Often, you’ll need to control the level of randomness of the output. You can control this using a few parameters.
The most commonly used parameter is temperature
, which is a number used to tune the degree of randomness. You can enter values between 0.0 to 1.0.
A lower temperature gives more predictable outputs, and a higher temperature gives more “creative” outputs.
Here’s an example of setting temperature
to 0.
And here’s an example of setting temperature
to 1.
Further reading:
- Available models for the Chat endpoint
- Documentation on predictable outputs
- Documentation on advanced generation parameters
Structured output generation
By adding the response_format
parameter, you can get the model to generate the output as a JSON object. By generating JSON objects, you can structure and organize the model’s responses in a way that can be used in downstream applications.
The response_format
parameter allows you to specify the schema the JSON object must follow. It takes the following parameters:
message
: The user messageresponse_format
: The schema of the JSON object
Further reading:
Streaming responses
All the previous examples above generate responses in a non-streamed manner. This means that the endpoint would return a response object only after the model has generated the text in full.
The Chat endpoint also provides streaming support. In a streamed response, the endpoint would return a response object for each token as it is being generated. This means you can display the text incrementally without having to wait for the full completion.
To activate it, use co.chat_stream()
instead of co.chat()
.
In streaming mode, the endpoint will generate a series of objects. To get the actual text contents, we take objects whose event_type
is content-delta
.
Further reading:
Conclusion
In this tutorial, you learned about:
- How to get started with a basic text generation
- How to improve outputs with prompt engineering
- How to control outputs using parameter changes
- How to generate structured outputs
- How to stream text generation outputs
However, we have only done all this using direct text generations. As its name implies, the Chat endpoint can also support building chatbots, which require features to support multi-turn conversations and maintain the conversation state.
In the next tutorial, you’ll learn how to build chatbots with the Chat endpoint.