This notebook demonstrates how to create a chatbot (single turn) that answers user questions based on technical documentation made available to the model.
We use the aws-documentation dataset (link) for representativeness. This dataset contains 26k+ AWS documentation pages, preprocessed into 120k+ chunks, and 100 questions based on real user questions.
We proceed as follows:
llama_indexrerank for better accuracy, lower inference costs and lower latencyCohereEmbeddingVectorStoreIndex from LlamaIndexBecause this process is lengthy (~2h for all documents on a MacBookPro), we store the index to disc for future reuse. We also provide a (commented) code snippet to index only a subset of the data. If you use this snippet, bear in mind that many documents will become unavailable to the model and, as a result, performance will suffer!
rerankThe vector database we built using VectorStoreIndex comes with an in-built retriever. We can call that retriever to fetch the top documents most relevant to the user question with:
We recently released Rerank-3 (April ‘24), which we can use to improve the quality of retrieval, as well as reduce latency and the cost of inference. To use the retriever with rerank, we create a thin wrapper around index.as_retriever as follows:
This works! With co.chat, you get the additional benefit that citations are returned for every span of text. Here’s a simple function to display the citations inside square brackets.
Now that we have a running pipeline, we need to assess its performance.
The author of the repository provides 100 QA pairs that we can test the model on. Let’s download these questions, then run inference on all 100 questions. Later, we will use Command A — Cohere’s largest and most powerful model — to measure performance.
We’ll use the fields as follows:
Question: the user question, passed to co.chat to generate the answerAnswer_True: treat as the ground gruth; compare to the model-generated answer to determine its correctnessDocument_True: treat as the (single) golden document; check the rank of this document inside the model’s retrieved documentsWe’ll loop over each question and generate our model answer. We’ll also complete two steps that will be useful for evaluating our model next:
We want to test our model performance on two dimensions:
Note that this pipeline is for illustration only. To measure performance in practice, we would want to run more in-depths tests on a broader, representative dataset.
We’ll use Command A as a judge of whether the answers produced by our model convey the same information as the golden answers. Since we’ve defined the grading prompts earlier, we can simply ask our LLM judge to evaluate that grading prompt. After a little bit of postprocessing, we can then extract our model scores.
We’ve already computed the rank of the golden documents using get_rank_of_golden_within_retrieved. Here, we’ll plot the histogram of ranks, using blue when the answer scored a 1, and red when the answer scored a 0.
We see that retrieval works well overall: for 80% of questions, the golden document is within the top 5 documents. However, we also notice that approx. half the false answers come from instances where the golden document wasn’t retrieved (rank = top_k = 20). This should be improved, e.g. by adding metadata to the documents such as their section headings, or altering the chunking strategy.
There is also a non-negligible instance of false answers where the top document was retrieved. On closer inspection, many of these are due to the model phrasing its answers more verbosely than the (very laconic) golden documents. This highlights the importance of checking eval results before jumping to conclusions about model performance.
In this notebook, we’ve built a QA bot that answers user questions based on technical documentation. We’ve learnt:
llama_indexrerank