-
Notifications
You must be signed in to change notification settings - Fork 0
/
eda_and_data_visualization_use_case.py
174 lines (114 loc) · 4.71 KB
/
eda_and_data_visualization_use_case.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# -*- coding: utf-8 -*-
"""EDA and Data Visualization - Use Case.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1Z_mQdRuJIR7bJtgo1bzNue6XeuKnyl_A
# Exploratory Data Analysis (EDA):
Exploratory Data Analysis is all about analyzing the dataset and summarizing the key insights and characteristics of the data.
**EDA checklist:**
1. Understanding the dataset, and its shape
2. Checking the data type of each columns
3. Categorical & Numerical columns
4. Checking for missing values
5. Descriptive summary of the dataset
6. Groupby for classification problems
Importing Libraries
"""
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
"""Data Collection
"""
# loading the breast cancer dataset from csv file to pandas data frame
cancer_data = pd.read_csv('/content/data.csv')
"""# **Exploratory Data Analysis**
"""
# printing the first five rows of the dataframe
cancer_data.head()
# removing the unnamed column
cancer_data.drop(columns='Unnamed: 32' , axis = 1 , inplace=True)
cancer_data.head()
# checking the data types
cancer_data.info()
"""Diagnosis column is a CATEGORICAL columnm whereas remIning are continuous values
"""
# removing the id column
cancer_data.drop(columns='id', axis = 1, inplace=True)
cancer_data.describe()
cancer_data.diagnosis.unique()
cancer_data.shape
# checking for missing values
cancer_data.isnull().sum()
#Statistical summary of the data - Descriptive Statistics
cancer_data.describe()
#Checking the distribution of target Variable
cancer_data['diagnosis'].value_counts()
# encoding the target column
label_encode = LabelEncoder()
labels = label_encode.fit_transform(cancer_data['diagnosis'])
cancer_data['target'] = labels
cancer_data.head()
cancer_data.drop(columns = 'diagnosis', axis=1, inplace=True)
# diagnosis column removed
cancer_data
cancer_data['target'].value_counts()
"""Benign --> 0
Malignant --> 1
"""
#Grouping the data based on the target
cancer_data.groupby('target').mean()
"""We can clearly see that for most of the features, the mean values are higher for Malignant(1) cases and lower for Benign(0) cases
# **Summary from EDA:**
1. No missing Values
2. All are continuous numerical values except for Target column
3. Mean is slightly more than the median for most of the features. So it is right skewed.
4. Slight imbalance in the dataset Benign(0) cases are more than Malignant(1) cases
5. Mean of most features are clearly larger for Malignant cases compared to the benign cases (Groupby)
# **Data Visulization**
"""
# countplot for the target column for checkin gthe distribution of target
sns.countplot(x= 'target', data=cancer_data)
# this is how we can get all the column names of the dataframe
for column in cancer_data:
print(column)
# creating a for loop to get the distribution plot for all columns
for column in cancer_data:
sns.displot(x=column, data=cancer_data)
sns.distplot(x=cancer_data.radius_mean)
"""we can also use pairplot for checking relationship between features but it will take all the features and we have 30 and 30x30 is 900 so we are doing for only two feature.
**Scatter plot of first 2 columns**
"""
# Select first column of the dataframe as a series
first_column = cancer_data.iloc[:, 0]
# Select second column of the dataframe as a series
second_column = cancer_data.iloc[:, 1]
print(first_column)
print('-----')
print(second_column)
plt.scatter(x=first_column,y=second_column)
"""**Outliers Detection**
box plot for visualizing the outliers in the dataset
"""
for column in cancer_data:
plt.figure()
cancer_data.boxplot([column])
"""# Correlation Matrix"""
correlation_matrix = cancer_data.corr()
# constructing a heat map to visualize the correlation matrix
plt.figure(figsize=(20,20))
sns.heatmap(correlation_matrix, cbar=True, fmt='.1f', annot=True, cmap='Blues')
plt.savefig('Correlation Heat map')
"""Multicollinearity problem:
Multicollinearity exists when an independent variable is highly correlated with one or more independent variables
We can remove the features if they have high +ve or -ve correlation between them
**Inference from EDA & Data Visualization:**
1. No missing Values
2. All are continuous numerical values except for Target column
3. Mean is slightly more than the median for most of the features. So it is right skewed.
4. Slight imbalance in the dataset Benign(0) cases are more than Malignant(1) cases
5. Mean of most features are clearly larger for Malignant cases compared to the benign cases (Groupby)
6. Most of the features have Outliers
7. Correlation Matrix reveal that most of the features are highly correlated. So we can remove certain features during Feature Selection
"""