CS205 Final Project - Spring 2019

The Task:

In this project we will try and build a predicitve model to predict being positive to cancer or malignancy.

Using the features available at CDC NHANES, investigate what features you would like to use in your model.
Under the tab "Data, Documentation, Codebooks, SAS Code" choose one of the categories: "Demographics Data", "Dietary Data", "Examination Data", "Laboratory Data", or "Questionnaire Data".
For each of these categories there is a list of sub-categories available. Clicking the "Doc File" link will provide the information about features in that sub-category and feature names.
Using these features, construct a predictive model that will be able to predict having cancer or malignancy.

Example finding feature: Our target of focus (cancer or malignancy) is found under Questionnaire Data -> Medical Conditions -> MCQ220 - Ever told you had cancer or malignancy. Under this question the variable name MCQ220 is what we will use as an identifier in our code.

Tip: If you are looking for a feature, another option is to google "NHANES <feature name>". This is effective if you have a feature in mind but don't want to go over all possible categories to find it.

Important: When selecting features to process, remember to input the correct category where the feature can be found, so that the processing code can find it.

What you are scored on

The grade for this task will be seperated into 3 categories

Feature selection, and evaluation. You will be required to explain why certain features were incorporated. empiric evaluation (e.g: mutual information) or a logical explanation for a group of features (e.g: age, geneder are included as demographics, as certain types of cancer affect certain demographics with higher probability. As can be seen here).
Preprocessing and feature engineering. Writing your own preprocessing code, explain why you chose a specific imputation technique, or changed the features in a certain way. You can support you argument with previous research done on the topic or with your own experimentation.
Predictive modeling. Build an ML model to predict having cancer. Construct a model to fit the data, explain why the chosen model was selected, describe experiments done with the model (e.g: hyper parameter tuning).

How you are scored

The grade will be given based on the quality, supported arguments, and clarity of evaluation of each of the 3 parts. Each part is equal 33.3333...% of your final project grade. Bonus points for originality and "out of the box" thinking for approaching each of these 3 parts.

File list:

nhanes.py: implementation of the data preprocessing logic as well as definition an example dataset.
Demo_Dataset.ipynb: Jupyter notebook file to demonstrate the basic usage of sample dataset.

How to use:

Download raw data files and decompress them.
Install Python 3 and the following packages: joblib, numpy, pandas, matplotlib, scipy, sklearn, jupyter, pytorch.
Use Demo_Dataset.ipynb to see an example on how to use the predefined task.
Expand nhanes.py to define new tasks by following the implementation logic of the provided sample.

For a detailed explanation of the methods used here for the cost-sensetive health dataset, please refer to: "Nutrition and Health Data for Cost-Sensitive Learning"

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
imputer_model		imputer_model
.DS_Store		.DS_Store
.gitignore		.gitignore
CancerNN.py		CancerNN.py
Demo_Dataset.ipynb		Demo_Dataset.ipynb
Feature Engneering.ipynb		Feature Engneering.ipynb
Feature Selection2.ipynb		Feature Selection2.ipynb
FeatureSelection-step3.ipynb		FeatureSelection-step3.ipynb
FeatureSelection.ipynb		FeatureSelection.ipynb
FeatureSelection2.py		FeatureSelection2.py
LICENSE		LICENSE
NHANES_list_parser.ipynb		NHANES_list_parser.ipynb
NHANES_search_parser.ipynb		NHANES_search_parser.ipynb
Prediction.ipynb		Prediction.ipynb
README.md		README.md
SVM_selected.numbers		SVM_selected.numbers
SVM_selected_numercial.csv		SVM_selected_numercial.csv
all_features.csv		all_features.csv
all_features_dicarded.csv		all_features_dicarded.csv
all_features_dicarded2.csv		all_features_dicarded2.csv
all_features_kept.csv		all_features_kept.csv
all_features_kept2.csv		all_features_kept2.csv
html_parser.py		html_parser.py
nhanes.py		nhanes.py
nohup.out		nohup.out
plot_missing_values.ipynb		plot_missing_values.ipynb
sel+demo_features.pkl		sel+demo_features.pkl
sel+demo_features2.pkl		sel+demo_features2.pkl
selected_features.pkl		selected_features.pkl
selected_features2.pkl		selected_features2.pkl
target.pkl		target.pkl
test_runs.py		test_runs.py
traget.pkl		traget.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS205 Final Project - Spring 2019

The Task:

What you are scored on

How you are scored

File list:

How to use:

For a detailed explanation of the methods used here for the cost-sensetive health dataset, please refer to: "Nutrition and Health Data for Cost-Sensitive Learning"

About

Releases

Packages

Languages

License

apashch/CancerPrediction

Folders and files

Latest commit

History

Repository files navigation

CS205 Final Project - Spring 2019

The Task:

What you are scored on

How you are scored

File list:

How to use:

For a detailed explanation of the methods used here for the cost-sensetive health dataset, please refer to: "Nutrition and Health Data for Cost-Sensitive Learning"

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages