Selected Datasets and Corpora for the Humanities

Bitsavers PDF Document Archive
An archive of 27600 computer manuals containing over 2.9 million pages.
British Parliament (Hansard) 1803-2005 7.6 million speeches, 1.6 billion words. This Hansard corpus (or collection of texts) contains nearly every speech given in the British Parliament from 1803-2005, and it allows you to search these speeches (including semantically-based searches) in ways that are not possible with any other resource.
Corpus of Contemporary American English is the only large and "balanced" corpus of American English. It contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, TV and movies subtitles, blogs, and other web pages.
The Dream Bank
Welcome to The DreamBank, a collection of over 20,000 dream reports. The reports come from a variety of different sources and research studies, from people ages 7 to 74, and they can be analyzed using the search engine and statistical programs built into this site.
European Parliament Proceedings Parallel Corpus 1996-2011
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
The University of Oxford Text Archive
The University of Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. The OTA also gives advice on the creation and use of these resources, and is involved in the development of standards and infrastructure for electronic language resources.
The FOIA XML Schema
Federal agencies publish FOIA information in accordance with guidelines prepared by the U. S. Department of Justice Office of Information Policy. These guidelines, available here, describe the format and meaning of FOIA annual report information. In addition, a FOIA Annual Report XML schema has been developed allowing agency FOIA annual report information to be represented and exchanged in a standardized format. This XML schema closely follows the structure and terminology of the guidance document, and conforms to the NIEM standard (http://niem.gov).
Harvard Library Bibliographic Dataset
This dataset contains over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.
Library Of Congress Twitter Archive
April 2010, the Library and Twitter signed an agreement providing the Library the public tweets from the company’s inception through the date of the agreement, an archive of tweets from 2006 through April 2010. Additionally, the Library and Twitter agreed that Twitter would provide all public tweets on an ongoing basis under the same terms. The Library’s first objectives were to acquire and preserve the 2006-10 archive; to establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date.This month, all those objectives will be completed. We now have an archive of approximately 170 billion tweets and growing. The volume of tweets the Library receives each day has grown from 140 million beginning in February 2011 to nearly half a billion tweets each day as of October 2012.
NSA Primary Sources
The NSA’s domestic spying program, known in official government documents as the “President’s Surveillance Program,” ("The Program") was implemented by President George W. Bush shortly after the attacks on September 11, 2001. The US Government still considers the Program officially classified, but a tremendous amount of information has been exposed by various whistleblowers, admitted to by government officials during Congressional hearings and with public statements, and reported on in investigations by major newspaper across the country.
NYC Open Data
NYC Open Data makes the wealth of public data generated by various New York City agencies and other City organizations available for public use. As part of an initiative to improve the accessibility, transparency, and accountability of City government, this catalog offers access to a repository of government-produced, machine-readable data sets.
Perseus Corpus and the Leipzig Corpus of Open Greek and Latin
One of the most significant classics collections online. The Perseus project involves about 17 million words of Greek and Latin along with about 9 million words of English translation.
Public Data Sets on AWS
Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications.
Sci-Hub download data These data include 28 million download request events from the server logs of Sci-Hub from 1 September 2015 through 29 February 2016. The uncompressed 2.7 gigabytes of data are separated into 6 data files, one for each month, in tab-delimited text format.
Webb Spam Corpus 2011
Web spam is defined as Web pages that are created to manipulate search engines and deceive Web users. As such, Web spam is regarded as one of the most important challenges currently facing search engines and Web users, and recent studies suggest that it accounts for a significant portion of all Web content. Although the problems associated with Web spam have been widely acknowledged, research efforts have been somewhat limited by the lack of a publicly available Web spam data set. To help combat this situation, the Webb Spam Corpus was created. The Webb Spam Corpus 2011 was collected by De Wang (original Webb Spam Corpus collected by Steve Webb) as part of the Denial of Information Project at the Georgia Institute of Technology. It is a first-of-its-kind, large-scale, and publicly available Web spam data set that was created using a novel, fully automated Web spam collection method. The corpus consists of nearly 350,000 Web spam pages, making it more than two orders of magnitude larger than any other previously cited Web spam data set.
Yahoo Labs
We have various types of data available to share. They are categorized into Ratings, Language, Graph, Advertising and Market Data, Computing Systems and an appendix of other relevant data and resources available via the Yahoo! Developer Network.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.md

datasets.md

Selected Datasets and Corpora for the Humanities

Files

datasets.md

Latest commit

History

datasets.md

File metadata and controls

Selected Datasets and Corpora for the Humanities