Data Statement

Dataset Names: coheretext-filtered, coheretext-unfiltered
Dataset Developers: Cohere Infrastructure Team
Data Statement Authors: Cohere Safety Team & Responsibility Council
Size: ~200GB filtered, ~3TB unfiltered

Overview of Training Datasets

The unfiltered dataset is used to train Representation models that reflect the world, including its current harms and biases. This enables them to be effective for use cases such as content moderation.

The filtered dataset is used to train the Generation models to complete sentences based on a prompt, while minimizing harmful generations. Our use of filtered training data is motivated by the observation of (Bender et al., 2021) that uncurated data used to train language models encodes the dominant view, further harming people at the margins. Cohere continues to invest significant resources towards dataset curation to prevent harm.

Document Collection

Cohere takes measures to ensure responsible collection and processing of data, in compliance with best practices, industry guidelines, and applicable regulations. We are deeply committed to building and training our models in a responsible, safe, and ethical manner, free from toxicity and bias, and take all commercial reasonable measures and precautions to protect customer data and treat it with respect.

Source Demographics

The scraped data is similar in composition to many other large, Internet-sourced language modeling datasets, and hence reflects perspectives that skew young, white, and male (Bender et al., 2021). Language models trained on such data encode the hegemonic viewpoint; Jo and Gebru, 2021 detail issues and solutions around this topic in-depth. Enhancing the diversity of our training data is a top priority as we continue to iterate our data collection process.

Document Curation

Filtering harmful, biased, or otherwise undesirable documents from training data can improve language model performance (Raffel et al., 2020) and reduce the chances of the model perpetuating harm. However, doing so with precision is critical so that we do not silence marginalized voices (Bender et. al, 2021).

With these considerations in mind, we designed a document curation process which aims to minimize undesirable text within our training data. The best way to do this is an active area of research within Cohere and the broader machine learning research community (Sharoff, 2020). As Cohere learns more about the types of harm large language models exhibit, it will adapt the composition of its datasets accordingly.

More Nuanced than Blockwords

We recognize the dangers of using a blockword list (i.e. removing any documents containing words from a list of selected words). Our filtration techniques are designed to retain counterspeech by taking into account language and context in a nuanced way. For example, we do not want to remove documents addressing racism, but we do want to filter racist texts. An example of a harm filtration technique we use has been published to Arxiv.

Language Filtration

We currently train our language models on English documents only, and model performance is evaluated on English benchmarks. The heuristics we use to detect non-English text during document curation are imperfect and other languages may still remain in the dataset. Multilingual datasets and benchmarks will be incorporated into future iterations of our data and evaluation pipelines.