US import data pipeline

The source is Amazon Data Exchange and US Harmonized Tariff Schedule (HTS). It contains data for US imports from 2018 to 2020. I assume it will also contain 2021 data when it becomes available.

Input data

Each year's data contains 11 datasets for header, bill, cargo description, hazmat, hazmat class, tariff, cosignee, notified party, shipper, container, and mark number. We will also use the latest HTS data.

For information on these datasets and their columns, refer to this notebook. You can also refer to the Amazon Data Exchange overview page.

Output data

This data pipeline outputs four datasets in CSV files aggregating each year of import data. Here are the table schemas (from Spark) and how they are assembled.

Fact

Header table -

From joining the header and bill dataset

Refer to the notebook and script for more details

Dimension

Cargo table -

From joining the cargo description, hazmat, hazmat class, tariff, and HTS datasets. Each row in each dataset describes a cargo in a shipment.

Refer to the notebook and script for more details

Contact table -

From joining the cosignee, notified party, and shipper datasets which are all contact parties for a shipment

Refer to the notebook and script for more details

Container table -

From joining the container and mark datasets.

Refer to the notebook and script for more details

Pipeline

US import ETL

Summary:

Check that the specified buckets (one for data storage, and one for logs) are available
Load data using the Data Exchange API and scripts to the bucket
Create an EMR cluster and load necessary jobs to the cluster
Termiante the cluster when everything is done

Details at the dag file

What you need to run the pipeline

Set up AWS credentials on your local machine or where you will host the Airflow application
Fill in the variable JSON file (The latest HTS excel link can be found from the source)
Create the storage and log buckets (if your storage bucket name is "us-import", you also need a "us-import-logs" bucket to store EMR logs)
Run Airflow and load the JSON file

After the setup, feel free to tweak your Airflow settings and the DAGs.

Running Airflow

Make sure Airflow knows where to find the Airflow folder, then

to start the scheduler:

airflow scheduler

to start the web sever:

airflow webserver -p 8080

For more information on Airflow, go to its site

Project write-up

What is the goal? Why did I choose the data model?

The goal is to aggregate the import data so that it is easier for further analysis or exporting to data warehouses. I chose the models beceuse they simplify the original datasets and are suitable for further analysis or data modelling.

The reasoning behind the choice of technologies

Airflow is suitable for orahcasting a ETL workflow, especially one for a data lake. Spark is suitable for handling big data, and these datasets all have millions of rows. AWS services integrate well with Amazon Data Exchange where the original data resides. EMR is a powerful and managed tool for running Spark jobs.

How often should the data be updated?

You can retrieve all the previous years' data right now and there is no more update. Once 2021 data comes in, conventionally, they are updated once a week.

If the data was increased by 100 times, the output cannot be stored in one partition and should therefore have several partitions. Although Spark should be able to handle the data, you might want to add more nodes to speed up the processing if it becomes too slow.
If the pipeline were run on a daily basis on 7am, Airflow could be set up to do that, but the data won't get updated every day.
If the datasets needed to be accessed by 100+ people, a S3 bucket should be able to handle that. If the output data was to be copied over to tables on a cloud warehouse, it would depend on the warehouse specifications. Both Redshift and BigQuery should be able to handle that, but costs could be a concern.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
airflow		airflow
imgs		imgs
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
data_dictionary.md		data_dictionary.md
project_summary.ipynb		project_summary.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

US import data pipeline

Input data

Output data

Fact

Dimension

Pipeline

US import ETL

What you need to run the pipeline

Running Airflow

Project write-up

About

Releases

Packages

Languages

jackyho112/us-import-data-pipelines

Folders and files

Latest commit

History

Repository files navigation

US import data pipeline

Input data

Output data

Fact

Dimension

Pipeline

US import ETL

What you need to run the pipeline

Running Airflow

Project write-up

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages