The show and tell for code from our similar but different DH group projects.
Remember to credit people and write your names on what you make <3
clustering_documents This very much a work in progress. An attempt to use a kmeans clustering algorithm on parts of the OB corpus
tf-idf Contains a notebook which can pass a multiword search query (in regex patterns) on any amount of the OB corpus. It calculates some basic statistics for word frequencies and finally computes tf-idf for the search terms in the retrieved documents (this is still to be completed).
The Code is written by Soeren Fomsgaard and Stella Verkijk.
speech Quirine Smit's work on speech extraction and gender-identification?
occupation Vivian Claes' code for extracting occupations (pre- and post-1834), accounting for the change in formatting. Sparql script (found in txt file) works better than the XML code.