Skip to content

Data pipeline for US import data hosted on Amazon Data Exchange

Notifications You must be signed in to change notification settings

jackyho112/us-import-data-pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

US import data pipeline

The source is Amazon Data Exchange and US Harmonized Tariff Schedule (HTS). It contains data for US imports from 2018 to 2020. I assume it will also contain 2021 data when it becomes available.

Input data

Each year's data contains 11 datasets for header, bill, cargo description, hazmat, hazmat class, tariff, cosignee, notified party, shipper, container, and mark number. We will also use the latest HTS data.

For information on these datasets and their columns, refer to this notebook. You can also refer to the Amazon Data Exchange overview page.

Output data

This data pipeline outputs four datasets in CSV files aggregating each year of import data. Here are the table schemas (from Spark) and how they are assembled.

Fact

Header table -

From joining the header and bill dataset

bill table schema

Refer to the notebook and script for more details

Dimension

Cargo table -

From joining the cargo description, hazmat, hazmat class, tariff, and HTS datasets. Each row in each dataset describes a cargo in a shipment.

cargo table schema

Refer to the notebook and script for more details

Contact table -

From joining the cosignee, notified party, and shipper datasets which are all contact parties for a shipment

contact table schema

Refer to the notebook and script for more details

Container table -

From joining the container and mark datasets.

container table schema

Refer to the notebook and script for more details

Pipeline

US import ETL

dag

Summary:

  1. Check that the specified buckets (one for data storage, and one for logs) are available
  2. Load data using the Data Exchange API and scripts to the bucket
  3. Create an EMR cluster and load necessary jobs to the cluster
  4. Termiante the cluster when everything is done

Details at the dag file

What you need to run the pipeline

  • Set up AWS credentials on your local machine or where you will host the Airflow application
  • Fill in the variable JSON file (The latest HTS excel link can be found from the source)
  • Create the storage and log buckets (if your storage bucket name is "us-import", you also need a "us-import-logs" bucket to store EMR logs)
  • Run Airflow and load the JSON file

After the setup, feel free to tweak your Airflow settings and the DAGs.

Running Airflow

Make sure Airflow knows where to find the Airflow folder, then

to start the scheduler:

airflow scheduler

to start the web sever:

airflow webserver -p 8080

For more information on Airflow, go to its site

Project write-up

  1. What is the goal? Why did I choose the data model?

The goal is to aggregate the import data so that it is easier for further analysis or exporting to data warehouses. I chose the models beceuse they simplify the original datasets and are suitable for further analysis or data modelling.

  1. The reasoning behind the choice of technologies

Airflow is suitable for orahcasting a ETL workflow, especially one for a data lake. Spark is suitable for handling big data, and these datasets all have millions of rows. AWS services integrate well with Amazon Data Exchange where the original data resides. EMR is a powerful and managed tool for running Spark jobs.

  1. How often should the data be updated?

You can retrieve all the previous years' data right now and there is no more update. Once 2021 data comes in, conventionally, they are updated once a week.

  1. If the data was increased by 100 times, the output cannot be stored in one partition and should therefore have several partitions. Although Spark should be able to handle the data, you might want to add more nodes to speed up the processing if it becomes too slow.

  2. If the pipeline were run on a daily basis on 7am, Airflow could be set up to do that, but the data won't get updated every day.

  3. If the datasets needed to be accessed by 100+ people, a S3 bucket should be able to handle that. If the output data was to be copied over to tables on a cloud warehouse, it would depend on the warehouse specifications. Both Redshift and BigQuery should be able to handle that, but costs could be a concern.

About

Data pipeline for US import data hosted on Amazon Data Exchange

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published