"Preparation is everything."~David Robinson
It's commonly said that data scientists spend 80% of their time preprocessing, cleaning and manipulating data and only 20% of their time analyzing it.The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions. Data cleaning is an essential task in data science. Without properly preprocessed, cleaned data, the results of any data analysis or machine learning model could be inaccurate. In this repo, you will learn how to identify, diagnose, and treat a variety of data preprocessing & data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!
- dependent variable (does not depend on column which all values are unique) : the hight of the plant
- independent varibales (Responsible for determining the dependent variable) : amount of water, amount of fertilizer
- dependent variable : Purshased_Any_Item
- independent variables : State, Profission, age, monthly_income
- Name does not determine the dependent variable (as all values in this column are unique). So it is not an independent variables. So it should be removed
independent = df[ ['State', 'Profession', 'Age', 'Monthly_income'] ] dependent = df[ ['Purshased_Any_Item'] ]# another way to split data X = df.drop(columns=['Name','Purshased_Any_Item']) Y = df['Purshased_Any_Item']
newdf = df.drop(columns=['Name']) # or df.drop(columns=['Name'], inplace=True)
# display number of rows and columns of the data set df.shape # display number of non null values for each column. df.info() # from this step i can determine if the column has a lot of nulls or not comparing with number of original rows # print count of nulls for each column and percentage of them missing_data = pd.DataFrame({'total_missing': df.isnull().sum(), 'perc_missing': (df.isnull().mean())*100}) missing_data # Statistical description of numerical variables df.describe()
Check this Notebook
column_mean = df['col'].mean() columns_mean = df.means(axis=0) columns_meanWhat are Data Types?
Check those Notebooks
What are Data Types?
most_freq_items = df['col'].mode() most_freq = df['col'].mode()[0] df['col'].fillna(most_freq)Check this Notebook
Data cleaning means fixing bad data in your data set.
Bad data could be:
If you want to assign some missing values as null ? assume that some cells have '?', 'UNDEFINED' i want to make them read as Null values
df = pd.read_csv('data.csv', na_values=['?','UNDEFINED'])Deal with empty cells:
❱ Calculate total number of rows that has missing values then calc its mean.
❱ If the average of missing rows is very small so we can Drop those rows else we can Fill the empty cells with value
❱ Remove rows that contain empty cells.
❱ Replace all empty cells with values# not affect the original dataframe newdf = df.dropna() # affect the original dataframe df.dropna(inplcae=True)# fill all empty cells in dataframe with value 130 (in the original dataframe) df.fillna(130, inplace=True) # fill all empty cells in specific column with value 130 (in the original dataframe) df["Quantity"].fillna(130, inplace=True) # Replacing using mean, median, mode in one column column_mean = df['Quantity'].mean() # make sure that Quantity is column is int data type df["Quantity"].fillna(column_mean, inplace=True) # Replacing using mean, median, mode in all columns columns_mean = df.mean(axis=0) # make sure that Quantity is column is int data type df.fillna(columns_mean, inplace=True)❱ Check this Notebook
Cells with data of wrong format can make it difficult, or even impossible, to analyze data. For example this record "20201226" in date column is wrong format data that should be 2020-12-26
Wayes to fix wrong format data :
❱ convert all cells in the columns into the same format
❱ remove the rows that has wrong formatimport pandas as pd df = pd.read_csv('data.csv') df['Date'] = pd.to_datetime(df['Date']) print(df.to_string())
- The result from the converting the column to datatime gave us a NaT value, which can be handled as a NULL value, and we can remove the row by using the dropna() method.
df.dropna(subset=['Date'], inplace = True)
Wrong data does not have to be empty cells or wrong format, it can just be wrong,
like if someone registered "199" instead of "1.99".
If you have a data set for courses in the college. You have class duration is 2 or 3 hours. While you check the data set you find out that there is a classes have duration 30 hours!
Sometimes you can spot wrong data by looking at the data set, because you have an expectation of what it should be.
Wayes to fix wrong format data:
❱ Replacing Values
❱ Removing Rows
One way to fix wrong values is to replace them with something suitable else.
Set "Duration" = 45 in row 7:
df.loc[7, 'Duration'] = 45For small data sets you might be able to replace the wrong data one by one, but not for big data sets. To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries for legal values, and replace any values that are outside of the boundaries.
# set all values that are greater than 120 to 120 in Duration column df.loc[df['Duration'] > 120,'Duration'] = 120# the same function using loop for x in df.index: if df.loc[x, "Duration"] > 120: df.loc[x, "Duration"] = 120
Another way of handling wrong data is to remove the rows that contains wrong data.
This way you do not have to find out what to replace them with, and there is a good chance you do not need them to do your analyses.
for x in df.index: if df.loc[x, "Duration"] > 120: df.drop(x, inplace = True)
Discovering Duplicates
print(df.duplicated()) # Returns True for every row that is a duplicate, othwerwise FalseRemoving Duplicates
df.drop_duplicates(inplace = True)
Words that starts with the symbol '@', e.g., @AnnaMedaris, are removed.
Hashtag symbols are removed, e.g., #depression is converted to depression.
df['name'] = df['name'].str.strip('@') df['titles'] = df['titles'].str.strip('#') # this function make the data type of column stringDelete any URL