Data Engineer Nanodegree Capstone Project

Table of Contents

About The Project
- Premise
- Goal
- Execution Plan
- Data
- Data Schema
Getting Started
Usage
Additional Notes
License
Contact
Acknowledgments

Data Engineer Nanodegree Capstone Project

This is the final project from the Data Engineer Nanodegree Program at Udacity to practice everything that has been taught during the degree.

About The Project

Premise

Immigration affects the receiving country in many fronts, from the labor market to cultural changes. It is essential for policy makers to understand immigration trends to better address the needs of the ever-changing population, as well as to foster international cooperation with the countries immigrants come from.

(back to top)

Goal

The goal of this project is to extract insights from U.S. immigration data. Moreover, additional information can be obtained by correlating immigration data with U.S. cities demographics, available airports in the receiving cities and even average temperature through the year.

(back to top)

Execution plan

In order to achieve this goal, I will be using Amazon Web Services (AWS) S3 buckets, Apache Airflow and Apache Spark to populate a Data Lake residing on S3. Given the reasonable size of the data (available in this repository through Git LFS), the ETL pipeline represented as an Airflow DAG will be run locally, as well as the subsequent analytic queries.

Raw data will be cleaned and uploaded to S3. Then, the STAR dimensional tables will be extracted and stored on S3 to form the Data Lake. Finally, these tables will be queried for different analytic purposes. This pipeline is meant to be run on-demand and not in a schedule, since the datasets are static and do not come from any streaming or updating source. All data operations are performed with Spark for best performance. The following image represents the final DAG:

It is however worth considering whether this action plan would also be appropriate in different scenarios:

The data is 100x larger:
- The size of the data would make it infeasible to process on a standard consumer PC or laptop or even a medium-size compute server. Therefore it would make sense to use cloud computing services such as AWS EC2 machines or AWS EMR clusters for all data-heavy operations. Also, all data (i.e. from raw to cleaned data) would be stored in S3 or alternatively on HDFS partitions for quicker access from Spark.
The data populates a dashboard that must be updated on a daily basis by 7am every day:
- If the dashboard must be updated daily, that means the data is also increasing in size daily. Therefore, similar to the previous point, more computing resources would be needed to handle this additional requirement. The Airflow DAG would then need to be scheduled such that it would be done by 7am every day. This could be achieved by scheduling the pipeline to start a few hours earlier (depending on estimated total compute time) as well as by setting an Service Level Agreement (SLA).
The database needed to be accessed by 100+ people:
- With the data stored in S3 buckets, this should not be an issue, as long as each user has the necessary permissions to access said S3 buckets. Amazon S3 buckets are designed to support high frequency operations.

(back to top)

Data

The project uses the following data sets:

I94 Immigration Data: This data comes from the US National Tourism and Trade Office and covers the year 2016. A data dictionary is available in data/i94_inmigration_data_2016/schema.json. This data set contains many records, a smaller data sample is available in data/i94_inmigration_data_2016/data_sample.csv.bz2.
World Temperature Data: Global land and ocean-and-land temperatures (source).
U.S. City Demographic Data: This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000 (source).
Airport Code Table: This is a simple table of airport codes and corresponding cities (source).

(back to top)

Data Schema

The Data Lake follows a STAR schema, where tables can be joined by a city_id field that uniquely identifies a city in a US state. A total of five tables were extracted: dim_cities, dim_airports, fact_temps, fact_us_demogr and fact_immigration. A brief description of each table, their source datasets as well as the available columns can be found in data/star_schema.json.

After running the whole project, the objects in S3 should be the same as in data/s3_inventory.txt. Data profiling reports for all tables except facts_immigration are available in data/profiling_reports.

The following diagram shows how the resulting dimensional tables relate to each other:

(back to top)

Getting Started

To make use of this project, I recommend managing the required dependencies with Anaconda.

Setting up a conda environment

Install miniconda:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Install mamba:

conda install -n base -c conda-forge mamba

Install environment using provided file:

mamba env create -f environment.yml # alternatively use environment_core.yml if base system is not debian
mamba activate de_capstone

(back to top)

Setting up a local Apache Airflow server

To start a local Apache Airflow server for the purposes of this project, simply run the following:

bash initialize_airflow.sh

Introduce your desired password when prompted and then access the UI at localhost:8080 with user admin and the password you just created.

(back to top)

Getting ready to interact with AWS

Create an IAM user:

IAM service is a global service, meaning newly created IAM users are not restricted to a specific region by default.
Go to AWS IAM service and click on the "Add user" button to create a new IAM user in your AWS account.
Choose a name of your choice.
Select "Programmatic access" as the access type. Click Next.
Choose the Attach existing policies directly tab, and select the "AdministratorAccess". This is solely for the purposes of this project and not recommended in a production environment. Click Next.
Skip adding any tags. Click Next.
Review and create the user. It will show you a pair of access key ID and secret.
Take note of the pair of access key ID and secret. This pair is collectively known as Access key.

Save access key and secret locally:

Create a new file, _user.cfg, and add the following:

AWS_ACCESS_KEY_ID = <YOUR_AWS_KEY>
AWS_SECRET_ACCESS_KEY = <YOUR_AWS_SECRET>

This file will be loaded internally to connect to AWS and perform various operations.
DO NOT SHARE THIS FILE WITH ANYONE! I recommend adding this file to .gitignore to avoid accidentally pushing it to a git repository: printf "\n_user.cfg\n" >> .gitignore.

Set configuration values:

Fill the dl.cfg configuration file. This is needed in order to create the S3 bucket that will hold all the data, as well as the region where this bucket should reside in.

DO NOT FORGET TO DELETE YOUR S3 BUCKETS WHEN FINISHED WORKING ON THE PROJECT TO AVOID UNWANTED COSTS!

(back to top)

Usage

Simply follow along the main notebook of this project: notebooks/main.ipynb.

(back to top)

Additional Notes

Source files formatted using the following commands:

isort .
autoflake -r --in-place --remove-unused-variable --remove-all-unused-imports --ignore-init-module-imports .
black .

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Carlos Uziel Pérez Malla

GitHub - Google Scholar - LinkedIn - Twitter

(back to top)

Acknowledgments

This README includes a summary of the official project description provided to the students of the Data Engineer Nanodegree Program at Udacity.

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineer Nanodegree Capstone Project

About The Project

Premise

Goal

Execution plan

Data

Data Schema

Getting Started

Setting up a conda environment

Setting up a local Apache Airflow server

Getting ready to interact with AWS

Usage

Additional Notes

License

Contact

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
data		data
images		images
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dl.cfg		dl.cfg
environment.yml		environment.yml
environment_core.yml		environment_core.yml
initialize_airflow.sh		initialize_airflow.sh

License

CarlosUziel/de-capstone

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Nanodegree Capstone Project

About The Project

Premise

Goal

Execution plan

Data

Data Schema

Getting Started

Setting up a conda environment

Setting up a local Apache Airflow server

Getting ready to interact with AWS

Usage

Additional Notes

License

Contact

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages