A project to help community college students transfer to well-known universities. Based on data from Questionnaire, I did Random Forest and Support Vector Classification models. The main goal here is to achive higher prediction accuracy and interpret feature importance.
Attribute | DataType | Detail |
---|---|---|
Act_commu_ser_num |
Numeric | how many time of community serving the applicant had done |
Hours_commu_ser |
Numeric | how many hours of community serving the applicant had done |
Chall_rate |
Categorical | chall_rate of applicant |
Sex |
Categorical | the gender of applicant |
House_income |
Categorical | house income level of applicant |
State |
Categorical | which state the applicant live in |
Major |
Categorical | which major the applicant applied for |
Declare_transfer_major_dummy |
Categorical | whether the applicant had declaed his or her transfer major |
Credits_num |
Numeric | how many credits the applicant earned |
Courses_major_num |
Numeric | how many coureses the applicant had taken in his major |
GPA |
Numeric | GPA of applicant |
Part_time_dummy |
Categorical | whether the applicant had done a part time job before |
Work_hours |
Categorical | how many hours the applicant had worked |
Residency |
Categorical | Nationality of applicant |
Extra_curr_dummy |
Categorical | whether applicant had extra curriculum |
University_enc |
Categorical | the encode of each University |
Admission |
Categorical | whether the applicant get admitted |
The dataset contains:
- 348 Unique rows
- 16 attributes selected based on significance of univariate logistic regression
217 rows data with label 1 and 131 rows data with label 0
217/348 = 62.356 This is our naive benchmark.
Missing Value
For attributes that could be zero like Act_commu_ser_num and Hours_commu_ser, I filled na with 0. For attributes that could not be zero like GPA and credits number, I filled na with the mean of corresponding attrtibutes group by the same major. If the applicant has a unique major, then na value is filled with global mean.
I did Random Forest Classifition and Bootstrapping SVC. Both of them are Ensemble Models. Here is a brief introduction.
Random forest is a supervised machine learning algorithm that is constructed from decision tree. One random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating. A forest is difficult to interpret (unlike trees). One piece of information that is still available is feature importance.
The main objective in SVC is to find the optimal hyperplane to correctly classify between data points of different classes. For SVM with non-linear kernel functions, it is not possible to interpert feature importance, when the SVM is Non-linear the dataset is mapped into a space of higher dimension, which is quite different from the parent dataset and the property is changed here, So I did not do non-lienar SVM in this project.
I used accuracy and F1 score to evaluated trained model.
Confusion Matrix | ||
Actually Positive (1) | Actually Negative (0) | |
Predicted Positive (1) | TP | FP |
Predicted Negative (0) | FN | TN |
percison = TP/(TP+ FP)
recall = TP/(TP+ FN)
F1 = 2/ (1/percison + 1/recall)
F1 the harmonic mean of Precision and Recall and gives a better measure of the incorrectly classified cases than the Accuracy Metric
Random forest model with default hyperparameters.
Feature importance
Once having fitted our linear SVM it is possible to access the classifier coefficients using .coef_ on the trained model.
The weights obtained from svm.coef_ represent the vector coordinates which are orthogonal to the hyperplane and their direction indicates the predicted class. The absolute size of the coefficients in relation to each other can then be used to determine feature importance for the data separation task
Model | Accuracy | F1 |
---|---|---|
Naive | 62.356 | |
Random Forest | 0.71 | 0.77 |
SVC | 0.71 | 0.78 |