This project focuses on cleaning a dataset related to layoffs. The goal is to prepare the data for further analysis by removing duplicates, standardizing values, handling nulls, and removing unnecessary columns or rows.
-
Create a Staging Table:
- A staging table (
layoffs_staging
) is created to work with a copy of the original dataset (layoffs
).
- A staging table (
-
Identify Duplicates:
- Duplicates are identified using a
ROW_NUMBER()
function based on key fields such as company, location, industry, total laid off, percentage laid off, and date.
- Duplicates are identified using a
-
Remove Duplicates:
- Rows identified as duplicates are removed.
-
Trim Whitespaces:
- Whitespaces from fields such as
company
,industry
, andcountry
are trimmed.
- Whitespaces from fields such as
-
Standardize Industry and Location Names:
- Common names and formats are applied to fields like
industry
andcountry
for consistency. - For example, various forms of
Crypto
are standardized toCrypto
.
- Common names and formats are applied to fields like
-
Date Standardization:
- The date format is standardized and the data type of the
date
field is modified toDATE
.
- The date format is standardized and the data type of the
-
Identify and Fill Nulls:
- Identify records where critical fields (
total_laid_off
,percentage_laid_off
,industry
) are null or blank. - Industry values are filled in based on matching records.
- Identify records where critical fields (
-
Remove Records with Unresolvable Nulls:
- Records with critical fields that remain null after attempts to fill them are removed.
- Drop Unused Columns:
- The
row_numb
column, used to manage duplicates, is dropped after duplicates are removed.
- The
WITH duplicate_cte AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY
company, location, industry, total_laid_off, percentage_laid_off, `date`, stage,
country, funds_raised_millions) AS row_numb
FROM layoffs_staging
)
DELETE
FROM layoffs_staging2
WHERE row_numb > 1;