Dataset Torrent

Dataset Card for AO3

Dataset containing ArchiveOfOurOwn data archives.

  • text-generation
  • text-classification
  • multilingual
  • language-modeling

Dataset Summary

This dataset contains approximately 12.6 million publicly available works from AO3. The dataset was created by processing works with IDs from 1 to 63,200,000 that are publicly accessible. Each entry contains the full text of the work along with comprehensive metadata including title, author, fandom, relationships, characters, tags, warnings, and other classification information.

Languages

The dataset is multilingual, with works in many different languages, though English is predominant.

Dataset Structure

Data Files

The dataset is stored in compressed JSONL files (jsonl.zst format), with each archive containing 100,000 sequential IDs. For example, ao3_40500001-40600000.jsonl.zst contains works with IDs in that range.

Data Fields

This dataset includes fields for id, title, metadata, and text. Metadata includes archive warnings, category, characters, fandom, language, rating, relationships, series, author, chapters, completion status, publication date, and word count.

Data Splits

All examples are in a single split.