A Guide to Streaming Responses
The Chat API is capable of streaming events (such as text generation) as they come. This means that partial results from the model can be displayed within moments, even if the full generation takes longer.
You’re likely already familiar with streaming. When you ask the model a question using the Coral UI, the interface doesn’t output a single block of text, instead it streams the text out a few words at a time. In many user interfaces enabling streaming improves the user experience by lowering the perceived latency.
Example
Stream Events
When streaming is enabled, the API sends events down one by one. Each event has an event_type. Events of different types need to be handled correctly.
Basic Stream Events
stream-start
The first event in the stream contains metadata for the request such as the generation_id
. Only one stream-start
event will be emitted.
stream-end
A stream-end
event is the final event of the stream, and is returned only when streaming is finished. This event contains aggregated data from all the other events such as the complete text
, as well as a finish_reason
to indicate why the stream ended (i.e. either because it finished or due to an error).
Only one stream-end
event will be returned.
text-generation
A text-generation
event is emitted whenever the next chunk of text comes back from the model. As the model continues generating text, multiple events of this type will be emitted.
Retrieval Augmented Generation Stream Events
These events are generated when using the API with various RAG parameters.
search-queries-generation
Emitted when search queries are generated by the model. Only happens when the Chat API is used with the search_queries_only
or connectors
parameters .
search-results
Emitted when the specified connectors
respond with search results. Only one event of this type will be returned for a given stream.
citation-generation
This event contains streamed citations and references to the documents being cited (if citations have been generated by the model). Multiple citation-generation
events will be returned.
For an illustration of a generated citation with document-specific indices, look at the “Example Response” below. As you can see, each document
has an id
, and when that document is used as part of the response, it’s cited by that id.
Tool Use Stream Events
tool-calls-chunk
Emitted when the next token of the tool plan or the tool call is generated.
tool-calls-generation
Emitted when the model generates tool calls that require actioning upon. The event contains a list of tool_calls
.
Example Responses
Below, we have a stream of events which shows the full output you might see during a streaming session:
It contains information about whether the streaming session is finished, what type of event is being fired, and the text that was generated by the model.
Of course, the print(event.text)
and print(event.finish_reason)
lines in the code snippet above peels a lot of the extra information away, so what your output would look more like this:
It should be (more or less) the same text, but that text is on its own rather than being accompanied by search queries, event types, etc.
Note that the citation objects in the response are returned as part of a RAG request, which you can learn more about in the Retrieval Augmented Generation guide
When the model has finished generating, it returns the full text, some metadata, citations, and the documents that were used to ground the reply.