Skip to content

Latest commit

 

History

History
144 lines (109 loc) · 6.96 KB

README.md

File metadata and controls

144 lines (109 loc) · 6.96 KB

Pro_Dream

Intro

A project to help community college students transfer to well-known universities. Based on data from Questionnaire, I did Random Forest and Support Vector Classification models. The main goal here is to achive higher prediction accuracy and interpret feature importance.

Datasets

Pro_Dream Questionaire data:

Attribute DataType Detail
Act_commu_ser_num Numeric how many time of community serving the applicant had done
Hours_commu_ser Numeric how many hours of community serving the applicant had done
Chall_rate Categorical chall_rate of applicant
Sex Categorical the gender of applicant
House_income Categorical house income level of applicant
State Categorical which state the applicant live in
Major Categorical which major the applicant applied for
Declare_transfer_major_dummy Categorical whether the applicant had declaed his or her transfer major
Credits_num Numeric how many credits the applicant earned
Courses_major_num Numeric how many coureses the applicant had taken in his major
GPA Numeric GPA of applicant
Part_time_dummy Categorical whether the applicant had done a part time job before
Work_hours Categorical how many hours the applicant had worked
Residency Categorical Nationality of applicant
Extra_curr_dummy Categorical whether applicant had extra curriculum
University_enc Categorical the encode of each University
Admission Categorical whether the applicant get admitted

The dataset contains:

  • 348 Unique rows
  • 16 attributes selected based on significance of univariate logistic regression

Data preprocess

Screen Shot 2022-05-16 at 2 13 10 PM

Little Unbalcned problem here, not no severe, so I did not do oversampling.
217 rows data with label 1 and 131 rows data with label 0
217/348 = 62.356 This is our naive benchmark.


Missing Value Screen Shot 2022-05-16 at 2 14 18 PM
For attributes that could be zero like Act_commu_ser_num and Hours_commu_ser, I filled na with 0. For attributes that could not be zero like GPA and credits number, I filled na with the mean of corresponding attrtibutes group by the same major. If the applicant has a unique major, then na value is filled with global mean.

Model and Metrics

I did Random Forest Classifition and Bootstrapping SVC. Both of them are Ensemble Models. Here is a brief introduction.

Random Forest

Random forest is a supervised machine learning algorithm that is constructed from decision tree. One random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating. A forest is difficult to interpret (unlike trees). One piece of information that is still available is feature importance.

Support Vector Classification

Screen Shot 2022-05-16 at 2 31 34 PM


The main objective in SVC is to find the optimal hyperplane to correctly classify between data points of different classes. For SVM with non-linear kernel functions, it is not possible to interpert feature importance, when the SVM is Non-linear the dataset is mapped into a space of higher dimension, which is quite different from the parent dataset and the property is changed here, So I did not do non-lienar SVM in this project.

Metrics

I used accuracy and F1 score to evaluated trained model.

Confusion Matrix
  Actually Positive (1) Actually Negative (0)
Predicted Positive (1) TP FP
Predicted Negative (0) FN TN
Accuracy = (TN + TP)/(TP + TN + FN + FP)
percison = TP/(TP+ FP)
recall = TP/(TP+ FN)
F1 = 2/ (1/percison + 1/recall)
F1 the harmonic mean of Precision and Recall and gives a better measure of the incorrectly classified cases than the Accuracy Metric

Results

Random Forest

Random forest model with default hyperparameters. Screen Shot 2022-05-16 at 4 55 17 PM

Random forest After tunning by grid search

Screen Shot 2022-05-16 at 4 57 53 PM


Feature importance

Screen Shot 2022-05-16 at 4 59 17 PM

SVC

SVC model with penalty 1.
Screen Shot 2022-05-16 at 5 03 06 PM

SVC model with tunned penalty

Screen Shot 2022-05-16 at 5 03 31 PM


Boostrap 95% confidence Interval

Screen Shot 2022-05-16 at 5 03 58 PM


Feature Importance

Once having fitted our linear SVM it is possible to access the classifier coefficients using .coef_ on the trained model.
The weights obtained from svm.coef_ represent the vector coordinates which are orthogonal to the hyperplane and their direction indicates the predicted class. The absolute size of the coefficients in relation to each other can then be used to determine feature importance for the data separation task

Screen Shot 2022-05-16 at 5 04 27 PM


Model comparison

Model Accuracy F1
Naive 62.356
Random Forest 0.71 0.77
SVC 0.71 0.78