Description: This SQL based project delves into an important step of data analysis called Data Cleaning.
Data Cleaning: Data cleaning is the process of detecting and rectifying faults or inconsistencies in dataset by scrapping or modifying them to fit the definition of quality data for analysis. It is an essential activity in data pre-processing as it determines how the data will be used and processed in other modelling processes.
Here are the steps taken in this comprehensive data cleaning project to ensure data quality and consistensy:
-
Standardizing Date Formats: I used the
CONVERT
function to ensure consistent date formats across the dataset, improving data accuracy and analysis. -
Handling Null Values: Leveraging self-joins, I populated missing values in the Property Address field, enhancing data completeness.
-
Address Breakdown: I dissected the address field into separate components (Address, City, State) using functions like
PARSENAME
,CHARINDEX
, andSUBSTRING
. This allowed for better geospatial analysis. -
Boolean Value Transformation: To enhance readability, I converted boolean values (1 and 0) to user-friendly labels (Yes and No).
-
Duplicate Removal: By utilizing the
ROW_NUMBER
function, I efficiently removed duplicate records, ensuring data integrity. -
Column Pruning: I deleted unnecessary columns, streamlining the dataset for analysis. (Deleting data is not a common practice and hence be absolutely sure before deleting data)