Skip to content

Latest commit

 

History

History
23 lines (14 loc) · 946 Bytes

README.md

File metadata and controls

23 lines (14 loc) · 946 Bytes

DH_collab

The show and tell for code from our similar but different DH group projects.

Remember to credit people and write your names on what you make <3

Description of contents

clustering_documents This very much a work in progress. An attempt to use a kmeans clustering algorithm on parts of the OB corpus

tf-idf Contains a notebook which can pass a multiword search query (in regex patterns) on any amount of the OB corpus. It calculates some basic statistics for word frequencies and finally computes tf-idf for the search terms in the retrieved documents (this is still to be completed).

The Code is written by Soeren Fomsgaard and Stella Verkijk.

speech Quirine Smit's work on speech extraction and gender-identification?

occupation Vivian Claes' code for extracting occupations (pre- and post-1834), accounting for the change in formatting. Sparql script (found in txt file) works better than the XML code.