A collection of encoded archival description XML documents for text and content analysis.
These materials were collected between January and April of 2024 to form an unannotated text corpus for content analysis. Text was identified and extracted using the eadretrieve
wrapper script invoking xmllint and compiled with Unix text processing utilities.
Elements selected for the project include abstract
, scopecontent
, bioghist
, custodhist
, and head
. Content was evaluated using the style command, and results listed in the readability-data
table.