How Does Prompt Truncation Work?

LLMs come with limitations; specifically, they can only handle so much text as input. This means that you will often need to figure out which part of a document or chat history to keep, and which ones to omit.

To make this easier, the Chat API comes with a helpful prompt_truncation parameter. When prompt_truncation is set to AUTO, the API will automatically break up the documents into smaller chunks, rerank those chunks according to how relevant they are, and then start dropping the least relevant documents until the text fits within the model’s context length limit.

Note: The last few messages in the chat history will never be truncated or dropped. The RAG API will throw a 400 Too Many Tokens error if it can’t fit those messages along with a single document under the context limit.