- 📘 Day 9: Exploratory Data Analysis (EDA) with Python
- 1️⃣ Introduction to Exploratory Data Analysis (EDA)
- 2️⃣ Data Overview
- 3️⃣ Measures of Central Tendency
- 4️⃣ Measures of Dispersion (Spread)
- 5️⃣ Distribution Analysis
- 6️⃣ Quantiles and Percentiles
- 7️⃣ Categorical Data Analysis
- 8️⃣ Outlier Detection
- 9️⃣ Visualizations for Descriptive Statistics
- 1️⃣0️⃣ Correlation and Relationships
- 1️⃣1️⃣ Missing Values Analysis
- 1️⃣2️⃣ Data Cleaning Insights
- 🧠 Practice Exercises
- 🌟 Summary
Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, often using visual methods. It helps in understanding the data structure, spotting patterns, detecting anomalies, and deriving actionable insights.
-
Understanding Rows and Columns:
Use the.shape
method to identify the number of rows and columns.import pandas as pd # Sample dataset data = pd.read_csv('sample_data.csv') print("Shape of dataset:", data.shape)
-
Inspecting Data: Use
.head()
,.info()
, and.describe()
for initial exploration.# Display first 5 rows print(data.head()) # Dataset information print(data.info()) # Summary statistics for numerical columns print(data.describe())
-
Numerical Variables:
- Continuous: e.g., height, weight.
- Discrete: e.g., number of children.
-
Categorical Variables:
- Nominal: No inherent order (e.g., gender, color).
- Ordinal: Ordered categories (e.g., education level).
-
Date/Time: Useful for time-series analysis.
mean_value = data['column_name'].mean()
print("Mean:", mean_value)
median_value = data['column_name'].median()
print("Median:", median_value)
mode_value = data['column_name'].mode()
print("Mode:", mode_value)
range_value = data['column_name'].max() - data['column_name'].min()
print("Range:", range_value)
variance = data['column_name'].var()
std_dev = data['column_name'].std()
print("Variance:", variance)
print("Standard Deviation:", std_dev)
q1 = data['column_name'].quantile(0.25)
q3 = data['column_name'].quantile(0.75)
iqr = q3 - q1
print("IQR:", iqr)
skewness = data['column_name'].skew()
kurtosis = data['column_name'].kurt()
print("Skewness:", skewness)
print("Kurtosis:", kurtosis)
# Quartiles
q1 = data['column_name'].quantile(0.25)
q2 = data['column_name'].quantile(0.50) # Median
q3 = data['column_name'].quantile(0.75)
print("Quartiles:", q1, q2, q3)
# Percentile
percentile_90 = data['column_name'].quantile(0.90)
print("90th Percentile:", percentile_90)
print(data['categorical_column'].value_counts())
pd.crosstab(data['column1'], data['column2'])
outliers = data[(data['column_name'] < (q1 - 1.5 * iqr)) |
(data['column_name'] > (q3 + 1.5 * iqr))]
print(outliers)
from scipy.stats import zscore
data['z_score'] = zscore(data['column_name'])
outliers = data[data['z_score'].abs() > 3]
print(outliers)
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram
data['column_name'].hist()
# Boxplot
sns.boxplot(x=data['column_name'])
# Bar Chart
data['categorical_column'].value_counts().plot(kind='bar')
correlation = data.corr()
print(correlation)
sns.scatterplot(x='column1', y='column2', data=data)
# Proportion of missing values
missing = data.isnull().mean()
print(missing)
# Impute missing values
data['column_name'].fillna(data['column_name'].mean(), inplace=True)
- Identify inconsistencies like out-of-range values or incorrect types.
- Remove duplicates.
# Remove duplicates
data = data.drop_duplicates()
- Load a dataset of your choice and summarize its structure.
- Compute measures of central tendency and dispersion for a numerical column.
- Identify and visualize outliers using boxplots.
- Analyze correlations between numerical variables and plot a heatmap.
- Handle missing values using different imputation techniques.
Exploratory Data Analysis (EDA) helps in gaining a comprehensive understanding of datasets by summarizing their structure, detecting outliers, analyzing distributions, and visualizing relationships. Mastering EDA is a critical step for preparing data for advanced analytics and machine learning.