The source is Amazon Data Exchange and US Harmonized Tariff Schedule (HTS). It contains data for US imports from 2018 to 2020. I assume it will also contain 2021 data when it becomes available.
Each year's data contains 11 datasets for header, bill, cargo description, hazmat, hazmat class, tariff, cosignee, notified party, shipper, container, and mark number. We will also use the latest HTS data.
For information on these datasets and their columns, refer to this notebook. You can also refer to the Amazon Data Exchange overview page.
This data pipeline outputs four datasets in CSV files aggregating each year of import data. Here are the table schemas (from Spark) and how they are assembled.
Header table -
From joining the header and bill dataset
Refer to the notebook and script for more details
Cargo table -
From joining the cargo description, hazmat, hazmat class, tariff, and HTS datasets. Each row in each dataset describes a cargo in a shipment.
Refer to the notebook and script for more details
Contact table -
From joining the cosignee, notified party, and shipper datasets which are all contact parties for a shipment
Refer to the notebook and script for more details
Container table -
From joining the container and mark datasets.
Refer to the notebook and script for more details
Summary:
- Check that the specified buckets (one for data storage, and one for logs) are available
- Load data using the Data Exchange API and scripts to the bucket
- Create an EMR cluster and load necessary jobs to the cluster
- Termiante the cluster when everything is done
Details at the dag file
- Set up AWS credentials on your local machine or where you will host the Airflow application
- Fill in the variable JSON file (The latest HTS excel link can be found from the source)
- Create the storage and log buckets (if your storage bucket name is "us-import", you also need a "us-import-logs" bucket to store EMR logs)
- Run Airflow and load the JSON file
After the setup, feel free to tweak your Airflow settings and the DAGs.
Make sure Airflow knows where to find the Airflow folder, then
to start the scheduler:
airflow scheduler
to start the web sever:
airflow webserver -p 8080
For more information on Airflow, go to its site
- What is the goal? Why did I choose the data model?
The goal is to aggregate the import data so that it is easier for further analysis or exporting to data warehouses. I chose the models beceuse they simplify the original datasets and are suitable for further analysis or data modelling.
- The reasoning behind the choice of technologies
Airflow is suitable for orahcasting a ETL workflow, especially one for a data lake. Spark is suitable for handling big data, and these datasets all have millions of rows. AWS services integrate well with Amazon Data Exchange where the original data resides. EMR is a powerful and managed tool for running Spark jobs.
- How often should the data be updated?
You can retrieve all the previous years' data right now and there is no more update. Once 2021 data comes in, conventionally, they are updated once a week.
-
If the data was increased by 100 times, the output cannot be stored in one partition and should therefore have several partitions. Although Spark should be able to handle the data, you might want to add more nodes to speed up the processing if it becomes too slow.
-
If the pipeline were run on a daily basis on 7am, Airflow could be set up to do that, but the data won't get updated every day.
-
If the datasets needed to be accessed by 100+ people, a S3 bucket should be able to handle that. If the output data was to be copied over to tables on a cloud warehouse, it would depend on the warehouse specifications. Both Redshift and BigQuery should be able to handle that, but costs could be a concern.