9 Corpora

9.1 Sample Research Questions

How many subjects exist within the early modern corpus?
How pervasive is poetry in early modern print?
Which texts are most similar or different?

9.2 Tools and Websites

9.3 Activities

9.3.1 Viewing the Whole Corpus

Explore the Bibliographia: first, search for a specific text and find the subject headings it’s part of. Then, find all of the texts with that subject heading and see if the “map” has organized them into the same area.
Compare this map to the version we made in Nomic Atlas, using semantic search to find topics you’re interested in. How do the subject headings from the Bibliographia compare to the automatically-generated topics and search results?

9.3.2 Modeling Early Modern Corpora

Follow the tutorials in the [EarlyPrint + Python] notebooks, especially the Word Embeddings and Unsupervised Clustering exercises.
- You can work through the EarlyPrint notebooks using Google Colab:
  - Word Embeddings
  - Text Clustering

9.4 References

Siefring and Meyer (2013)
D’Souza and Mimno (2023)
Bender et al. (2021)
Mitchell (2024)
Gadd (2009)
Gavin (2022)
Yang and Eisenstein (2016)
Kulick and Ryant (2020)
Bode (2017)
Basu (2025)

Basu, Anupam. 2025. Shakespeare and Scale: The Archive of Early Printed English. Cambridge Elements: Shakespeare and Text. Cambridge UP. https://www.cambridge.org/core/elements/shakespeare-and-scale/B3EB39E9F6049B8A203150FCB2DE4E5E.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. FAccT ’21. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922.

Bode, Katherine. 2017. “The Equivalence of ‘Close’ And ‘Distant’ Reading; Or, Toward a New Object for Data-Rich Literary History.” Modern Language Quarterly 78 (1): 77–106. https://doi.org/10.1215/00267929-3699787.

D’Souza, Lyra, and David Mimno. 2023. “The Chatbot and the Canon: Poetry Memorization in LLMs.” In. https://www.semanticscholar.org/paper/The-Chatbot-and-the-Canon%3A-Poetry-Memorization-in-D'Souza-Mimno/c4e6167d156c8d20e7f3579ad1edd9bac8b5bbca.

Gadd, Ian. 2009. “The Use and Misuse of Early English Books Online.” Literature Compass 6 (3): 680–92. https://doi.org/10.1111/j.1741-4113.2009.00632.x.

Gavin, Michael. 2022. Literary Mathematics: Quantitative Theory for Textual Studies. Stanford: Stanford University Press.

Kulick, Seth, and Neville Ryant. 2020. “Parsing Early Modern English for Linguistic Search.” arXiv:2002.10546 [Cs], February. http://arxiv.org/abs/2002.10546.

Mitchell, Melanie. 2024. “The Metaphors of Artificial Intelligence.” Science 386 (6723): eadt6140. https://doi.org/10.1126/science.adt6140.

Siefring, Judith, and Eric T. Meyer. 2013. “Sustaining the EEBO-TCP Corpus in Transition: Report on the TIDSR Benchmarking Study.” {SSRN} {Scholarly} {Paper}. Rochester, NY: Social Science Research Network. https://doi.org/10.2139/ssrn.2236202.

Yang, Yi, and Jacob Eisenstein. 2016. “Part-of-Speech Tagging for Historical English.” arXiv:1603.03144 [Cs], April. http://arxiv.org/abs/1603.03144.