README - Practice Number 1: Decision Trees and KNN

Introduction

In this practice, we explore two fundamental machine learning algorithms:

Decision Trees
K-Nearest Neighbors (KNN)

We implement these algorithms, evaluate their performance, and explore techniques for hyperparameter tuning to improve model accuracy.

Decsions Tree

Decision Trees Implementation
Model Evaluation and Hyperparameter Tuning
Random Forest Implementation and Evaluation
Gradient Boosting Implementation and Evaluation

K-Nearest Neighbors (KNN)

Data Preprocessing
KNN Model Implementation
- Euclidean Distance Calculation
- Model Fitting and Prediction
Hyperparameter Tuning
ROC Curve and Model Evaluation
- Probability Prediction
- ROC Curve Generation
Conclusion

Decision Trees and Random Forest Classifier

Decision Trees Practice

This repository contains code for implementing and evaluating Decision Trees for multi-class classification tasks. The practice is divided into several sections:

Section A: Implementation of Decision Trees

In this section, we implement a complete Decision Tree classifier from scratch using Python and NumPy. The implementation includes the following components:

Node Class: Defines the structure of a node in the decision tree.
DecisionTree Class: Represents the Decision Tree model with methods for fitting and predicting.
Helper Functions: Includes functions for calculating entropy, information gain, finding the best split, and building the decision tree recursively.

class node:
  def __init__(self ,parent, left, right , values , column, threshhold):
    self.parent:node = parent
    self.left:node  = left
    self.right:node = right
    self.values:np.ndarray = values
    self.column:int = column
    self.threshhold:float = threshhold

  def is_leaf(self):
    if self.values is None:
      return False
    return True

class DecisionTree:
    def __init__(self, min_samples_split=2, max_depth=5, n_features=None ):
        """
        Initialize the Decision Tree model.
        """
        self.n_features = n_features
        self.max_depth=max_depth
        self.min_samples_split=min_samples_split
        self.tree = node(None, None, None, None, None, None)
        self.number_of_classes = -1
        self.classes = None

Section B: Model Evaluation and Hyperparameter Tuning

Model Training and Evaluation: We trained several Decision Tree classifiers using different hyperparameters and evaluated their performance on a given dataset. The hyperparameters explored include:
- Split criterion (Gini impurity or Entropy)
- Maximum depth of the tree
- Minimum samples required to split a node
- Minimum samples required to be at a leaf node We used the DecisionTreeClassifier from scikit-learn to create and train the models, and then measured their accuracy on the training data.
Hyperparameter Tuning: To find the optimal hyperparameters, we performed a grid search over a range of values for max_depth and min_samples_leaf, using a validation set. We visualized the learning curves to understand the model's performance as the training set size increases.

Section C: Random Forest Implementation and Evaluation

In this section, we implemented a Random Forest classifier using scikit-learn's RandomForestClassifier. We trained the Random Forest with 100 trees and evaluated its performance on a validation set. We used different sets of training set sizes using learning_curve

train_sizes, train_scores, val_scores = learning_curve(model, X_data, Y_data, cv=5  , scoring='accuracy', train_sizes=[0.25, 0.5, 0.75, 1])

Section D: Gradient Boosting Implementation and Evaluation

We implemented a Gradient Boosting classifier using scikit-learn's GradientBoostingClassifier. We tuned hyperparameters such as max_depth and learning_rate using GridSearchCV to find the best combination for our model. We evaluated the final model's performance on a test set and visualized the results using a confusion matrix.

KNN

Data Preprocessing

Handling Categorical Variables

We transform categorical data into one-hot encoding using the pd.get_dummies function.

Binary Encoding and Numerical Standardization

Binary encoding is applied to "Yes/No" columns, and numerical standardization is performed on the age column using LabelEncoder and StandardScaler.

Splitting the Dataset

The dataset is split into training, validation, and test sets using train_test_split from scikit-learn.

X_train, X_rest, y_train, y_rest = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_rest, y_rest, test_size=0.5, random_state=42)

X_train.shape: (209, 125)
X_val.shape: (70, 125)
X_test.shape:  (70, 125)
y_train.shape: (209,)
y_test.shape:  (70,)
y_val.shape:  (70,)

KNN Model Implementation

Euclidean Distance Calculation

A function euclidean_dist is defined to calculate the Euclidean distance between two data points.

def euclidean_dist(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

Model Fitting and Prediction

The KNN class is implemented with methods for fitting and predicting. The fit method initializes the model with training data, while the predict method predicts the class labels for new data points based on the K nearest neighbors.

class KNN:
    def __init__(self, k=3):
        self.k = k
        self.X_train = None
        self.y_train = None

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X):
        predictions = [self.single_predict(x) for x in X]
        return np.array(predictions)

    def single_predict(self, x):
        distances = [self.euclidean_dist(x, x_train) for x_train in self.X_train]
        k_near_neighbors_indices = np.argsort(distances)[:self.k]
        k_near_neighbor_labels = [self.y_train[i] for i in k_near_neighbors_indices]
        vote = np.bincount(k_near_neighbor_labels).argmax()
        return vote

    def euclidean_dist(self, x1, x2):
        return np.sqrt(np.sum((x1 - x2) ** 2))

Hyperparameter Tuning

The hyperparameter K is tuned by testing values from 1 to 20 and selecting the one with the highest accuracy on the validation set.

ROC Curve and Model Evaluation

Probability Prediction

A modified version of the predict method called single_predict_proba is implemented to generate probabilities for ROC curve plotting.

class KNN:
    def __init__(self, k=3):
        self.k = k
        self.X_train = None
        self.y_train = None

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X):
        predictions = [self.single_predict(x) for x in X]
        return np.array(predictions)

    def single_predict(self, x):
        distances = [self.euclidean_dist(x, x_train) for x_train in self.X_train]
        k_near_neighbors_indices = np.argsort(distances)[:self.k]
        k_near_neighbor_labels = [self.y_train[i] for i in k_near_neighbors_indices]
        vote = np.bincount(k_near_neighbor_labels).argmax()
        return vote

    def euclidean_dist(self, x1, x2):
        return np.sqrt(np.sum((x1 - x2) ** 2))

    def single_predict_proba(self, x):
      distances = [self.euclidean_dist(x, x_train) for x_train in self.X_train]
      k_near_neighbors_indices = np.argsort(distances)[:self.k]
      k_near_neighbor_labels = [self.y_train[i] for i in k_near_neighbors_indices]
      return np.mean(k_near_neighbor_labels)

ROC Curve Generation

The ROC curve and the area under the curve (AUC) are computed for each K value using roc_curve and auc functions from scikit-learn.

from sklearn.metrics import roc_curve, auc

Conclusion

The KNN model with K=1 demonstrates the highest AUC on the validation set, indicating its superior performance in this classification task. However, the choice of K may vary depending on the specific requirements and characteristics of the dataset.

Name		Name	Last commit message	Last commit date
.idea		.idea
datasets		datasets
images		images
MLPractice1.ipynb		MLPractice1.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1

1

README.md

README - Practice Number 1: Decision Trees and KNN

Introduction

Table of Contents

Decision Trees and Random Forest Classifier

Section A: Implementation of Decision Trees

Section B: Model Evaluation and Hyperparameter Tuning

Section C: Random Forest Implementation and Evaluation

Section D: Gradient Boosting Implementation and Evaluation

KNN

Data Preprocessing

Handling Categorical Variables

Binary Encoding and Numerical Standardization

Splitting the Dataset

KNN Model Implementation

Euclidean Distance Calculation

Model Fitting and Prediction

Hyperparameter Tuning

ROC Curve and Model Evaluation

Probability Prediction

ROC Curve Generation

Conclusion

Files

1

Directory actions

More options

Directory actions

More options

Latest commit

History

1

Folders and files

README.md

README - Practice Number 1: Decision Trees and KNN

Introduction

Table of Contents

Decision Trees and Random Forest Classifier

Section A: Implementation of Decision Trees

Section B: Model Evaluation and Hyperparameter Tuning

Section C: Random Forest Implementation and Evaluation

Section D: Gradient Boosting Implementation and Evaluation

KNN

Data Preprocessing

Handling Categorical Variables

Binary Encoding and Numerical Standardization

Splitting the Dataset

KNN Model Implementation

Euclidean Distance Calculation

Model Fitting and Prediction

Hyperparameter Tuning

ROC Curve and Model Evaluation

Probability Prediction

ROC Curve Generation

Conclusion