Data Science Curricula Topic Modeling

This project aims to apply Topic Modeling to various Data Science course calendars. The goal is to leverage Latent Dirichlet Allocation (LDA) to model pathways through degrees as a three-level hierarchical Bayesian model. This model defines a collection of documents as a random mixture over some underlying topic set where each topic is itself a distribution over our vocabulary. This then allows us to cluster words into latent topics and discover common learning outcomes in Data Science Curricula.

The notebook is viewable here

Data Generating Process

The data has been collected by Hedgemon4. From their dataset we extract the list of course codes from a given university's course calendar and the corresponding course descriptions. We then construct degree pathways by sampling courses from the calendars. This produces a reasonable approximation to a student's journey through a curriculum. Notably some limitations are that we do not check for prerequisites nor include non-science courses.

Data Analysis

In our application of LDA we model all courses and their descriptions as our documents. This then means our terms are individual words and LDA clusters our words into $K$ topics.

The notebook introduces LDA and common Natural Language Processing (NLP) practices to clean text data. We then explore how one discovers an optimal $K$ and visualize our topics using LDAvis. From there we use the $\gamma$ matrix to visualize what topics specific universities tend to be comprised of and conclude with some word clouds of the $\beta$ matrix.

The notebook is available on GitHub pages and may be viewed directly here. The analysis is conducted in R and presented using Quarto.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
docs		docs
src		src
.RDataTmp		.RDataTmp
.gitignore		.gitignore
.nojekyll		.nojekyll
Curricula-Topic-Modeling.Rproj		Curricula-Topic-Modeling.Rproj
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Curricula Topic Modeling

Data Generating Process

Data Analysis

About

Releases

Packages

Languages

Danyulll/Curricula-Topic-Modeling

Folders and files

Latest commit

History

Repository files navigation

Data Science Curricula Topic Modeling

Data Generating Process

Data Analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages