- author: Jenson Chang, Jingyuan Wang, Catherine Meng, Siddarth Subrahmanian
Demo of a data analysis project for DSCI 522 (Data Science Workflows); a course in the Master of Data Science program at the University of British Columbia.
Here we attempt to build a classification model using the k-Nearest Neighbors algorithm to predict student dropout and academic success based on information available at enrollment (including academic path, demographics, and socio-economic factors). Our final classifier performed consistently on unseen test data, achieving a cross-validation training score of 0.72, with a similar test score. Although the model’s accuracy is moderate, it performs consistently. Given that the data was collected from a single institution, a larger dataset may be necessary to generalize predictions to other institutions or countries. We believe this model can be a starting point for institution to identify and support students at risk of dropout. However, the model can be developed further by combining academic data with social/economic data to improve the prediction and provide stakeholders with a more comprehensive view on the potential causes of student dropouts. We recommend this improvement because it would enable instutitions to focus their leverage their limited resources for maximum student support.
The data set is created by Mónica Vieira Martins, Jorge Machado, Luís Baptista and Valentim Realinho at the Instituto Politécnico de Portalegre (M.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho. 2021). It is sourced from UC Irvine's Machine Learning Repository and can be found here. The data contains demographic, enrollment and academic (1st and 2nd semesters) information on the students. Each row in the data set represents a student record. Using these data, a model would be built to predict the academic outcome of the student. There are 36 columns in total.
The final report can be found here.
Run Jupyter Notebook
-
Clone this GitHub repository
-
Navigate to the root of the project and run the following command with command line
docker compose up
This container will run Jupyter Notebook using the default port of 8888. Make sure no other applications are using this port.
-
Navigate to the root of this project on your computer using the command line and enter the following command to reset the project to a clean state (i.e., remove all files generated by previous runs of the analysis):
make clean
- To run the analysis in its entirety, enter the following command in the terminal in the project root:
make all
Clean Up
-
Press Ctrl + C in the terminal to shut down the Jupyter Notebook.
-
Use the following command to remove the container.
docker compose rm
Folder Structure
data
: Contains both raw and processed dataimg
: Contains image used in the READMEreport
: Contains.html
and.pdf
versions of the final report, as well as the.qmd
file used the generate the report.results
: Contains figures and models exported by the analysis scripts inscripts
scripts
: Contains Python scripts used to perform data processing, analysis and model trainingsrc
: Contains source code for functions used by analysis scripts inscripts
test
: Contains unit tests for functions insrc
-
Add the dependency to the
environment.yml
file on a new branch. -
Run
conda-lock -k explicit --file environment.yml -p linux-64
to update theconda-linux-64.lock
file. -
Re-build the Docker image locally to ensure it builds and runs properly.
-
Push the changes to GitHub. A new Docker image will be built and pushed to Docker Hub automatically. It will be tagged with the SHA for the commit that changed the file.
-
Update the
docker-compose.yml
file on your branch to use the new container image (make sure to update the tag specifically). -
Send a pull request to merge the changes into the
main
branch.
The Academic Success Prediction report contained herein are licensed under the Creative Commons Attribution 2.5 Canada License (CC BY 2.5 CA). See the license file for more information. . If re-using/re-mixing please provide attribution and link to this webpage. The software code contained within this repository is licensed under the MIT license. See the license file for more information.
Bantilan, Niels. 2020. “Pandera: Statistical Data Validation of Pandas Dataframes.” In SciPy, 116–24.
Kramer, Oliver, and Oliver Kramer. 2016. “Scikit-Learn.” Machine Learning for Evolution Strategies, 45–53.
McKinney, Wes et al. 2011. “Pandas: A Foundational Python Library for Data Analysis and Statistics.” Python for High Performance and Scientific Computing 14 (9): 1–9.
Python, Why. 2021. “Python.” Python Releases for Windows 24.
Realinho, Valentim, Jorge Machado, Luı́s Baptista, and Mónica V Martins. 2022. “Predicting Student Dropout and Academic Success.” Data 7 (11): 146.
VanderPlas, Jacob, Brian Granger, Jeffrey Heer, Dominik Moritz, Kanit Wongsuphasawat, Arvind Satyanarayan, Eitan Lees, Ilia Timofeev, Ben Welsh, and Scott Sievert. 2018. “Altair: Interactive Statistical Visualizations for Python.” Journal of Open Source Software 3 (32): 1057.