This repository contains the code for my PhD project "What's left in the bag for latent variable language modelling", a joint-doctorate project between the Radboud University Nijmegen, the Netherlands, and the KU Leuven, Belgium. In this project I look at bag-of-words for language modelling, and try to find information in this bag-of-words that is currently unexploited such as skipgrams. Bayesian models are exemplar for latent variable models, and it is this intersection of language modelling and Bayesian statistics that I find interesting.
Our main model is a hierarchical Pitman-Yor language model based on skipgrams. The models generated by this toolkit are language agnostic.
It is based on a fork of cpyp (https://github.com/redpony/cpyp) which I enhanced with Colibri-core (https://github.com/proycon/colibri-core).
More info and results will be added later.