BookCorpus
BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 11,000 unpublished books scraped from the Internet. It was the main corpus used to train the initial GPT model by OpenAI,[1] and has been used as training data for other early large language models including Google's BERT.[2] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.[2]
The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors".[3][4] The dataset was initially hosted on a University of Toronto webpage.[4] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.[5] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.[4][5]
References
- "Improving Language Understanding by Generative Pre-Training" (PDF). Archived (PDF) from the original on January 26, 2021. Retrieved June 9, 2020.
- Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].
- Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
- Lea, Richard (28 September 2016). "Google swallows 11,000 novels to improve AI's conversation". The Guardian.
- Bandy, John; Vincent, Nicholas (2021). "Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus" (PDF). Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.