BookCorpus

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 11,000 unpublished books scraped from the Internet. It was the main corpus used to train the initial GPT model by OpenAI,[1] and has been used as training data for other early large language models including Google's BERT.[2] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.[2]

The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors".[3][4] The dataset was initially hosted on a University of Toronto webpage.[4] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.[5] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.[4][5]

References

"Improving Language Understanding by Generative Pre-Training" (PDF). Archived (PDF) from the original on January 26, 2021. Retrieved June 9, 2020.
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].
Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Lea, Richard (28 September 2016). "Google swallows 11,000 novels to improve AI's conversation". The Guardian.
Bandy, John; Vincent, Nicholas (2021). "Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus" (PDF). Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[gpt-1-paper-1] "Improving Language Understanding by Generative Pre-Training" (PDF). Archived (PDF) from the original on January 26, 2021. Retrieved June 9, 2020.

[bert-paper-2] Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].

[bookpaper-3] Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[swallows-4] Lea, Richard (28 September 2016). "Google swallows 11,000 novels to improve AI's conversation". The Guardian.

[debt-5] Bandy, John; Vincent, Nicholas (2021). "Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus" (PDF). Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.