This repository is containing portfolio data analyst projects completed by me for academic, self learning, and hobby purposes. Presented in the form of iPython Notebooks.
The objective of this project was to extract data from websites and available APIs. The resulting datasets were then transformed by cleaning, joining, and filtering into nine tables. The object-relational database, PostgreSQL, was used to load the datasets into pgAdmin. Thus completing a functional ETL pipeline.
The following Data Sources were used:
-
- Method: Webscraping extraction
- Used for: Collecting the Top 250 IMDB rated movie list
-
- Method: API Extraction
- Used for: Collecting IMDb id and other movie related details like actor, director, etc.
-
- Method: API Extraction
- Used For: Collecting streaming options for Top 250 IMDb movies
-
- Method: API Extraction
- Used For: Collecting movies on Netflix in released in the United States which have an IMDb rating between 7 and 10
-
- Method: Webscraping extraction
- Used for: Collecting viewing Streaming Service availability and price
- Data extracted were formated in CSV and JSON files
- The following datasets were then transformed by cleaning, joining, and filtering into nine tables
- The object-relational database, PostgreSQL, was used to load the datasets into pgAdmin.
-
Extract:
Google scraping.ipynb
:- contains IMDB website and Google Search Engine Webscraping
netflix_high_imdb_rated(uNoGS api).ipynb
:- contains IMDB website Webscraping, OMDb API, and uNoGS API extraction
streaming_options(utelly api).ipynb
:- contains Utelly API extraction
-
Transform:
Transform.ipynb
:- contains all datasets that were transformed into nine tables
-
Load:
SQL folder
:- contains ERD and schema
SQL_Table folder
:- contains the creation of and all nine tables created in pgAdmin with PostgreSQL