This project aims to apply Topic Modeling to various Data Science course calendars. The goal is to leverage Latent Dirichlet Allocation (LDA) to model pathways through degrees as a three-level hierarchical Bayesian model. This model defines a collection of documents as a random mixture over some underlying topic set where each topic is itself a distribution over our vocabulary. This then allows us to cluster words into latent topics and discover common learning outcomes in Data Science Curricula.
The notebook is viewable here
The data has been collected by Hedgemon4. From their dataset we extract the list of course codes from a given university's course calendar and the corresponding course descriptions. We then construct degree pathways by sampling courses from the calendars. This produces a reasonable approximation to a student's journey through a curriculum. Notably some limitations are that we do not check for prerequisites nor include non-science courses.
In our application of LDA we model all courses and their descriptions as our documents. This then means our terms are individual words and LDA clusters our words into
The notebook introduces LDA and common Natural Language Processing (NLP) practices to clean text data. We then explore how one discovers an optimal
The notebook is available on GitHub pages and may be viewed directly here. The analysis is conducted in R and presented using Quarto.