Develop a project that will take a dataset and build either an ETL or ELT pipeline to produce a practical output from it. Examples would be a triggered email from a condition that is met, or a dashboard to visualize the data.
The primary goal and function of the data pipeline is to intake monthly data obtained from the weather data file, the airport code file, and the airlines database. Once the data is cleaned, processed, and merged, the data is inputted into a database schema that is readily available for data scientists to work with. The data scientists can then use the database schema to create time series forecasts for predicting flight cancellations for that same month in the upcoming following year. The data for this pipeline contains only the month of January and serves as a snapshot of how the data will be inputted as more monthly data becomes available.
For our output and visualization, we plan to load the schema and data into the data science database called domestic_flight_weather_database so the data science team can develop time series forecasting models or other predictive models to understand how date, weather type, weather severity, airline carrier and airport locations affect flight cancellations. Finally, an interactive heatmap of flight cancellations and top weather conditions per airport will be provided as output to the pipeline.
- Gain Access to the server containing Airline database
- Import that database to local server
- Download the weather csv file from the website
- Confirm directory of csv in the ETL pipeline
- Run the Script with proper credentials indicated in first block of the code
- Manually monitor outputs
- Airline.sql database
- WeatherEvents_Jan2016-Dec2021.csv
- Shapefile of Continental USA for Output Visualization
-
Extraction: Airline_diagram.png
- Extract airline, airport and cancellation codes "look-up" style tables from the airlines database
- Extract the on_time_performance_2016 table that contains flight information for domestic flights for the month of January, 2016
- Load the weather csv file as a dataframe in python.
-
Transform:
- Make transformations to weather dataframe to match datetime of airline data. This will be required to do the final matches in the SQL JOIN of weather and flight data.
- Match airport codes in weather and flight data by removing K in airport code in weather data.
- Generate a new feature called weather_code that points to a particular weather type and severity combination.
- Add this code to the matching weather type/severity rows in the weather data
-
Load: Final_project_loading_schema_diagram.png
- Load airport, cancelled code, and airline tables with more meaningful column names into the new database called domestic_flight_weather_database
- JOIN the flight data and weather data on basis of the location (matching airport codes from the weather data, and Origin airport codes from the flight data), as long as the flight date lies between the start and end dates of the weather reported for that region. Only the most necessary fields are kept: FlightDate, latitude, longitude, Origin (airport code), weather_code, weather type, weather severity, airline code (that is referenced to a name in the airline table loaded in step 1), cancelled (indicates if a flight was cancelled), and cancellation_code (that also matches to a reason on the cancellation_codes table loaded in step 1.)
- ETL pipeline manual deployment and monitoring
- More monthly data is needed for the entire year and more to collect better training data for future forecasts.
- Queries might have room for further optimization for better performance.
- Ideally, code is in the form of an API that can change the csv file directory for the read_csv function in the code, as well as take in the necessary login credentials instead of being hard-coded in the code.
- Future steps involve automation perhaps through implementation of Airflow to orchestrate and monitor the ETL process.
- Output dashboard can include more tiles containing statistical results to assist Business Intelligence teams with quick in-the-moment decision-making data. This would involve more communication with the BI team.