Corpus-DB is a textual corpus database for the digital humanities. This project aggregates public domain texts, enhances their metadata from sources like Wikipedia, and makes those texts available according to that metadata. This will make it easy to download subcorpora like:
- Bildungsromans
- Dickens novels
- Poetry published in the 1880s
- Novels set in London
Corpus-DB has several components:
- Scripts for aggregating metadata, written in Python
- The database, currently a few SQLite databases
- A REST API for querying the database, written in Haskell (currently in progress)
- Analytic experiments, mostly in Python
Read more about the database at this introductory blog post. Scripts used to generate the database are in the gitenberg-experiments repo.
I could use some help with this, especially if you know Python or Haskell, have library or bibliography experience, or simply like books. Get in touch in the chat room, or contact me via email.
If you want to build the website and API, you'll need the Haskell tool stack
.
stack build
cd src
export ENV=dev
stack runhaskell Main.hs
If you use ENV=dev, this will set the database path to /data/dev.db
, which is a 30-row subset of the main database, since the main database is too big (16GB at the moment) to put on GitHub. You can use this dev database for hacking around on. If you need the full database for some reason, let me know.
I'm rewriting corpus-db from scratch (see issues labeled 2.0
). This is to make the whole toolchain in Corpus-DB repeatable, in case of data loss, and future-proof, so that it can ingest new texts from Project Gutenberg and other sources as they arrive. Feel free to help out with this!
- Parse Project Gutenberg RDF/XML metadata, and put it into a database.
- Mirror PG, using an rsync script.
- Clean PG texts, and add them to that database. Also add HTML files.
- Write an ORM-level database layer, using Persistent, for more native DB interactions and typesafe queries.