COMP9417_GROUP_PROJECT

Multitask Machine Learning on Clinical Data

Data preprocessing

Introduction

This report covers the various data preprocessing steps that are performed so that the dataset becomes better in quality and suitability with respect to the tasks of analysis and modeling that will follow. All these are very critical for model accuracy, robustness, and model reliability.

1. Data Cleaning

Removing Unary Features: Features that have a single unique value across the entire dataset are removed, as they contribute nil to the differentiation of models, but they cause added complexity to the fitting process. Delete Features and Samples with High Missing Rates: It got rid of those features and samples that exhibited high missing data. This would dampen the noise and potential for overfitting, preparing the data for analysis more efficiently. Impact Comparison of Different Missing Rate Thresholds: This covers the we explore the impacts of different missing thresholds on model training to come up with a proper strategy for data retention.

2. Feature Classification

Continuous Features:
Integers and Floats: We differentiated integer and floating-point continuous feature data because, according to the type of feature, there was a need to implement data processing methods.
Categorical Features: Binary and Multinoml: Binary and multinomial categorical features were identified and processed properly to enable appropriate encoding and handling.

3. Multi-strategy Imputation

Simple Imputation: The process of handling missing values was initially done through statistical methods (mean, median, mode) suitable for simple or small datasets.
KNN Imputation: KNN imputation assumes that similar instances have a close neighboring data point and was thus implemented for imputing the missing values in this analysis.
Regression-Based Imputation:
Linear Regression: Linear relationships were utilized to predict missing values.
Reverse Regression: Target variables were uniquely used to predict and impute missing features.
Random Forest Regression: Leveraged for its robust predictive power to estimate missing values.
Support Vector Regression (SVR): Applied to address complex non-linear relationships in data filling. Comparison of Imputation Techniques Effect on Training and Performance: We went into a critical assessment of the training and performance effects of the different imputation techniques.

4. Outlier Management

Identification and Mitigation: Outliers were identified statistically (IQR, Box plots) and reduced in their adverse impact on the models through trimming or replacing.

5. Feature Selection and Multi-task Modeling

Correlation Analysis: In this kind of analysis, using statistical measures such as mutual information, each label's correlation with the features is to be calculated, so as to keep only the most predictive for a given specific task. Data Reconstruction: Optimization of subsets of features was made for each label according to the correlation analysis in an effort to construct features that can provide model precision in tasks. Independent modeling: each of these labels performs independent modeling to customize the performance for specific tasks that will improve the preciseness of multi-task models. ### Conclusion In specific, the outlined steps of data preprocessing improved not only the quality and reliability but also helped in making the models for different tasks more flexible and accurate, remaining solid grounds for deeper analysis and complex model development, which ensures efficiency and effectiveness in decision-making driven by data.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.history		.history
code		code
data_set		data_set
data_visualization		data_visualization
dataset		dataset
preprocessing		preprocessing
report		report
.gitignore		.gitignore
Data_KNN.ipynb		Data_KNN.ipynb
Data_return.py		Data_return.py
GBM.ipynb		GBM.ipynb
Neural.ipynb		Neural.ipynb
README.md		README.md
SVM.ipynb		SVM.ipynb
SVM_2.ipynb		SVM_2.ipynb
XGBoost.ipynb		XGBoost.ipynb
noise.ipynb		noise.ipynb
relevance.ipynb		relevance.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COMP9417_GROUP_PROJECT

Data preprocessing

Introduction

1. Data Cleaning

2. Feature Classification

3. Multi-strategy Imputation

4. Outlier Management

5. Feature Selection and Multi-task Modeling

About

Releases

Packages

Contributors 4

Languages

pure-maple/COMP9417-Group-Project

Folders and files

Latest commit

History

Repository files navigation

COMP9417_GROUP_PROJECT

Data preprocessing

Introduction

1. Data Cleaning

2. Feature Classification

3. Multi-strategy Imputation

4. Outlier Management

5. Feature Selection and Multi-task Modeling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages