Pro_Dream

Intro

A project to help community college students transfer to well-known universities. Based on data from Questionnaire, I did Random Forest and Support Vector Classification models. The main goal here is to achive higher prediction accuracy and interpret feature importance.

Datasets

Pro_Dream Questionaire data:

Attribute	DataType	Detail
`Act_commu_ser_num`	Numeric	how many time of community serving the applicant had done
`Hours_commu_ser`	Numeric	how many hours of community serving the applicant had done
`Chall_rate`	Categorical	chall_rate of applicant
`Sex`	Categorical	the gender of applicant
`House_income`	Categorical	house income level of applicant
`State`	Categorical	which state the applicant live in
`Major`	Categorical	which major the applicant applied for
`Declare_transfer_major_dummy`	Categorical	whether the applicant had declaed his or her transfer major
`Credits_num`	Numeric	how many credits the applicant earned
`Courses_major_num`	Numeric	how many coureses the applicant had taken in his major
`GPA`	Numeric	GPA of applicant
`Part_time_dummy`	Categorical	whether the applicant had done a part time job before
`Work_hours`	Categorical	how many hours the applicant had worked
`Residency`	Categorical	Nationality of applicant
`Extra_curr_dummy`	Categorical	whether applicant had extra curriculum
`University_enc`	Categorical	the encode of each University
`Admission`	Categorical	whether the applicant get admitted

The dataset contains:

348 Unique rows
16 attributes selected based on significance of univariate logistic regression

Data preprocess

Little Unbalcned problem here, not no severe, so I did not do oversampling.
217 rows data with label 1 and 131 rows data with label 0
217/348 = 62.356 This is our naive benchmark.

Missing Value
For attributes that could be zero like Act_commu_ser_num and Hours_commu_ser, I filled na with 0. For attributes that could not be zero like GPA and credits number, I filled na with the mean of corresponding attrtibutes group by the same major. If the applicant has a unique major, then na value is filled with global mean.

Model and Metrics

I did Random Forest Classifition and Bootstrapping SVC. Both of them are Ensemble Models. Here is a brief introduction.

Random Forest

Random forest is a supervised machine learning algorithm that is constructed from decision tree. One random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating. A forest is difficult to interpret (unlike trees). One piece of information that is still available is feature importance.

Support Vector Classification

The main objective in SVC is to find the optimal hyperplane to correctly classify between data points of different classes. For SVM with non-linear kernel functions, it is not possible to interpert feature importance, when the SVM is Non-linear the dataset is mapped into a space of higher dimension, which is quite different from the parent dataset and the property is changed here, So I did not do non-lienar SVM in this project.

Metrics

I used accuracy and F1 score to evaluated trained model.

Confusion Matrix
	Actually Positive (1)	Actually Negative (0)
Predicted Positive (1)	TP	FP
Predicted Negative (0)	FN	TN

Accuracy = (TN + TP)/(TP + TN + FN + FP)
percison = TP/(TP+ FP)
recall = TP/(TP+ FN)
F1 = 2/ (1/percison + 1/recall)
F1 the harmonic mean of Precision and Recall and gives a better measure of the incorrectly classified cases than the Accuracy Metric

Results

Random Forest

Random forest model with default hyperparameters.

Random forest After tunning by grid search

Feature importance

SVC

SVC model with penalty 1.

SVC model with tunned penalty

Boostrap 95% confidence Interval

Feature Importance

Once having fitted our linear SVM it is possible to access the classifier coefficients using .coef_ on the trained model.
The weights obtained from svm.coef_ represent the vector coordinates which are orthogonal to the hyperplane and their direction indicates the predicted class. The absolute size of the coefficients in relation to each other can then be used to determine feature importance for the data separation task

Model comparison

Model	Accuracy	F1
Naive	62.356
Random Forest	0.71	0.77
SVC	0.71	0.78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pro_Dream

Intro

Datasets

Pro_Dream Questionaire data:

Data preprocess

Model and Metrics

Random Forest

Support Vector Classification

Metrics

Results

Random Forest

Random forest After tunning by grid search

SVC

SVC model with tunned penalty

Boostrap 95% confidence Interval

Feature Importance

Model comparison

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pro_Dream

Intro

Datasets

Pro_Dream Questionaire data:

Data preprocess

Model and Metrics

Random Forest

Support Vector Classification

Metrics

Results

Random Forest

Random forest After tunning by grid search

SVC

SVC model with tunned penalty

Boostrap 95% confidence Interval

Feature Importance

Model comparison