Skip to content

UBC-MDS/academic-success-prediction

Repository files navigation

Academic Success Prediction

  • author: Jenson Chang, Jingyuan Wang, Catherine Meng, Siddarth Subrahmanian

Demo of a data analysis project for DSCI 522 (Data Science Workflows); a course in the Master of Data Science program at the University of British Columbia.

About

Here we attempt to build a classification model using the k-Nearest Neighbors algorithm to predict student dropout and academic success based on information available at enrollment (including academic path, demographics, and socio-economic factors). Our final classifier performed consistently on unseen test data, achieving a cross-validation training score of 0.72, with a similar test score. Although the model’s accuracy is moderate, it performs consistently. Given that the data was collected from a single institution, a larger dataset may be necessary to generalize predictions to other institutions or countries. We believe this model can be a starting point for institution to identify and support students at risk of dropout. However, the model can be developed further by combining academic data with social/economic data to improve the prediction and provide stakeholders with a more comprehensive view on the potential causes of student dropouts. We recommend this improvement because it would enable instutitions to focus their leverage their limited resources for maximum student support.

The data set is created by Mónica Vieira Martins, Jorge Machado, Luís Baptista and Valentim Realinho at the Instituto Politécnico de Portalegre (M.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho. 2021). It is sourced from UC Irvine's Machine Learning Repository and can be found here. The data contains demographic, enrollment and academic (1st and 2nd semesters) information on the students. Each row in the data set represents a student record. Using these data, a model would be built to predict the academic outcome of the student. There are 36 columns in total.

Report

The final report can be found here.

Dependencies

Usage

Run Jupyter Notebook

  1. Clone this GitHub repository

  2. Navigate to the root of the project and run the following command with command line

    docker compose up

    This container will run Jupyter Notebook using the default port of 8888. Make sure no other applications are using this port.

  3. Navigate to the root of this project on your computer using the command line and enter the following command to reset the project to a clean state (i.e., remove all files generated by previous runs of the analysis):

make clean
  1. To run the analysis in its entirety, enter the following command in the terminal in the project root:
make all

Clean Up

  1. Press Ctrl + C in the terminal to shut down the Jupyter Notebook.

  2. Use the following command to remove the container.

    docker compose rm

Folder Structure

  • data: Contains both raw and processed data
  • img: Contains image used in the README
  • report: Contains .html and .pdf versions of the final report, as well as the .qmd file used the generate the report.
  • results: Contains figures and models exported by the analysis scripts in scripts
  • scripts: Contains Python scripts used to perform data processing, analysis and model training
  • src: Contains source code for functions used by analysis scripts in scripts
  • test: Contains unit tests for functions in src

Adding a new dependency

  1. Add the dependency to the environment.yml file on a new branch.

  2. Run conda-lock -k explicit --file environment.yml -p linux-64 to update the conda-linux-64.lock file.

  3. Re-build the Docker image locally to ensure it builds and runs properly.

  4. Push the changes to GitHub. A new Docker image will be built and pushed to Docker Hub automatically. It will be tagged with the SHA for the commit that changed the file.

  5. Update the docker-compose.yml file on your branch to use the new container image (make sure to update the tag specifically).

  6. Send a pull request to merge the changes into the main branch.

License

The Academic Success Prediction report contained herein are licensed under the Creative Commons Attribution 2.5 Canada License (CC BY 2.5 CA). See the license file for more information. . If re-using/re-mixing please provide attribution and link to this webpage. The software code contained within this repository is licensed under the MIT license. See the license file for more information.

Reference

Bantilan, Niels. 2020. “Pandera: Statistical Data Validation of Pandas Dataframes.” In SciPy, 116–24.
Kramer, Oliver, and Oliver Kramer. 2016. “Scikit-Learn.” Machine Learning for Evolution Strategies, 45–53.
McKinney, Wes et al. 2011. “Pandas: A Foundational Python Library for Data Analysis and Statistics.” Python for High Performance and Scientific Computing 14 (9): 1–9.
Python, Why. 2021. “Python.” Python Releases for Windows 24.
Realinho, Valentim, Jorge Machado, Luı́s Baptista, and Mónica V Martins. 2022. “Predicting Student Dropout and Academic Success.” Data 7 (11): 146.
VanderPlas, Jacob, Brian Granger, Jeffrey Heer, Dominik Moritz, Kanit Wongsuphasawat, Arvind Satyanarayan, Eitan Lees, Ilia Timofeev, Ben Welsh, and Scott Sievert. 2018. “Altair: Interactive Statistical Visualizations for Python.” Journal of Open Source Software 3 (32): 1057.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages