Data Cleaning.

Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

Data cleaning is one those things that everyone does but no one really talks about. Sure, it’s not the "sexiest" part of machine learning. And no, there aren’t hidden tricks and secrets to uncover.

Types of Errors

1). Missing Values 2). Bad Values 3). Duplicates

Added a data cleaning using python/ pandas library (2rd october 2019) using property data.csv data set.

The data set is small compared to real data machine learning models data set. i kept it simple to ease coding.

Useful Functions

Loading Libraries

import pandas as  pd 
import numpy as np 
import matplotlib.pyplot as plt
imort seabrom as sns 
%matplotlib inline 
%load_ext autoreload
%autoreload 2

Loading dataset with pandas

#csv file, we will use the property data.csv file
data = pd.read_csv("property data.csv")

#loading excel
data_from_excel = pd.read_xlxs("  path to your file ") 

#Loading a json file 
data_from_excel = pd.read_json ("Path where you saved the JSON file")

Viewing data

#Viewing the first 5 rows
data.head()

#viewing the last 5 rows
data.tail()

Inspecting the dataset

#Dataset shape
data.shape

#Dataset basic analysis
data.describe()

Removing NAN, N/A & na

Remember python pandas library only recognizes nan as the missing value so it will skip any missing value recorderd with na or N/A, the steps below helps us solve that problem

#Define a list to hold all representation of missing values 

missing_values = [ np.nan, 'N/A', 'na'] 

data = pd.read_csv("sample_data.csv", missing_values")

Checking for any missing value:

You can use different ways to chech for missing value

data.isnull()
#or
data.isnull().sum() 
#or 
data.isnull().any()

Visualizing the missing value with seaborn

sns.headmap(isnull(), yticklabels=False annot=True)

Removing missing values from the data set:

 df=df.dropna(axis=0, how='any')

How is used to instruct which low should be removed, that is when how is setted to all, it drops a row if all values are missing.

Filling the missing values:

#Forward fill, fills the missing value with the values above it.

data.fillna(method="ffill") 

#Back fill, fills the missing value with the values below it.

data.fillna(method="bfill") 
 
#Interploation finds the average for the above and below value and uses the value to fill the missing value

data.interpolate()

Filling the missing values with a specific know value:

data.fillna({
 'Column_to_substitute' : TheValue
 })

Note that when the column or row has 80%+ missing values, the simplest and the best way to treat is by dropping the row/column

Viewing columns

In pandas we use the code below to view all the column in our dataset

#Viewing columns in data dataframe
data.column

Changing the letter casing of our column

#to lowercase
data.columns.str.lower()

#to uppercase
data.columns.str.upper()

Remaning the columns

Example when i have a column called Duration that i want to name to Time i will use the snippets below

df.rename({"Duration": "Time"})

Get the latest snippets: https://colab.research.google.com/drive/18pYbCHhTQkjBGCYF2qM0-M0_pA6DfUso#scrollTo=h5RHxT3x4A6Y

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
2009_census_data_roofing_materials.csv		2009_census_data_roofing_materials.csv
Iris.csv		Iris.csv
README.md		README.md
What is Machine Learning_ By Charity Delmus (1).pptx		What is Machine Learning_ By Charity Delmus (1).pptx
datacleaning1.py		datacleaning1.py
property data.csv		property data.csv
sample_data.csv		sample_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Cleaning.

Types of Errors

Useful Functions

Loading Libraries

Loading dataset with pandas

Viewing data

Inspecting the dataset

Removing NAN, N/A & na

Checking for any missing value:

Visualizing the missing value with seaborn

Removing missing values from the data set:

Filling the missing values:

Filling the missing values with a specific know value:

Viewing columns

Changing the letter casing of our column

Remaning the columns

About

Releases

Packages

Languages

Ritik262/Data-Cleaning-With-Python

Folders and files

Latest commit

History

Repository files navigation

Data Cleaning.

Types of Errors

Useful Functions

Loading Libraries

Loading dataset with pandas

Viewing data

Inspecting the dataset

Removing NAN, N/A & na

Checking for any missing value:

Visualizing the missing value with seaborn

Removing missing values from the data set:

Filling the missing values:

Filling the missing values with a specific know value:

Viewing columns

Changing the letter casing of our column

Remaning the columns

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages