This project focuses on building predictive models to forecast student success or risk of dropout based on academic trajectory, demographics, and socio-economic factors. The dataset consists of 3,500 instances with 36 attributes, covering various student attributes such as marital status, admission grades, and parental occupation, among others.
The objective is to develop a machine learning model to predict student outcomes using several techniques learned throughout the Data Science course, including:
- KNN
- Linear Discriminant Analysis
- Logistic Regression
- Random Forests
- Support Vector Machines
- Neural Networks
The dataset includes the following attributes:
- Marital status, application mode, academic course, admission grades, parental occupation, educational special needs, tuition status, and various curricular unit performance indicators, along with economic indicators like unemployment and inflation rates.
The following tasks were carried out:
- Data preprocessing and cleaning.
- Exploratory data analysis (EDA).
- Model selection and evaluation using various machine learning techniques.
- Evaluation of model accuracy, complexity, and performance metrics such as F1 score, accuracy, and AUC.
- Justification of the final selected model based on both accuracy and complexity.
- Algorithms Implemented: KNN, Logistic Regression, SVM, Random Forest, and Neural Networks.
- Tools Used: R, Python, and associated data analysis libraries.
The final model successfully predicts student performance, with a focus on balancing model accuracy and interpretability. The report includes a detailed comparison of the different models applied, with justifications for the final selection.
Additional techniques and improvements were explored, such as boosting and unsupervised methods, using advanced techniques for improved predictive accuracy.