Algorithm to analyze student questions from Piazza for Infocamp 2020
I pulled our sample of ~200 questions from the Piazza discussion board from CS10 the beauty and joy of computing. The data contained three data structures with student, contribution, and post information.
I used a modified version of the Bloom and Garrison et al 2001 methods to label the data between 4 categories.
- Other/Logistics
- Recall/Remember
- Exploration/Analysis
- Create/Expand
We examined the data from Piazza's interface.
The all_data
DataFrame contains the labeled data that we used to train our model. It contains 21 columns, but the ones we care about are:
id
: An identifier for the training examplebloom
: The modified Bloom-garrison score based on the 4 categories above.question
: The text of the question.folder
: The type of discussion post, we chose the categoriesexams
,problemset
andprojects
.subject
: The original discussion post title.module
: The subject module that came from another layer we labeled betweensubject
andfolder
.
The original data was hierarchical with thread_id
connected to the parent question and many layers of tags and authors. The post-level title was too low level, and the folder level was too broad, so we categorized the subjects further into modules (which you can view in the labeled/subject.csv) file.
I split the data into 80-20 train val split using sci-kitlearn.
We would like to take the text of a question and predict what category under the bloom-garrison score. This is a classification problem, so we can use logistic regression to train a classifier. Recall that to train an logistic regression model we need a numeric feature matrix
Each row of
To identify some features that allow us to distinguish the question categories. We compared the distribution of a single feature in each category to distinguish between them.
Here is the histogram plot of each folders' bloom scores:
And a correlation plot across some of the other features.
I aggregated the data with total question counts and predicted bloom proportions at the "class" level, "subject" level and "student" level. This made it easy for visualization on the dashboard.
Here is a plot of the student data:
Our sample size was very small, so our model has high variability and may lead to bias. We used 5-fold cross-validation to resample and retrain our model.
Our final dashboard is a full stack web app built with Django and HTML/Javascript. The github repo is here.