GitHub - ravishchawla/ETL-Pipeline-for-Disaster-Data: Developing an ETL Pipeline for loading, modeling, and visualizing Disaster Data from Figure Eight

Installation

This project uses Python 3, along with Jupyter Notebook. The following libraries are necessary for running the notebook:

Pandas
Numpy
MatplotLib
Plotly
Scikit-Learn
SqlAlchemy
NLTK
wordcloud

Packages used by this project can also be installed as a Conda Environment using the provided Requirements.txt file.

To run this project, three steps are required.

Run the following commands in the project's root directory to set up your database and model.
- To run ETL pipeline that cleans data and stores in database python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
- To run ML pipeline that trains classifier and saves python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
Run the following command in the app's directory to run your web app. python run.py
Go to http://0.0.0.0:3001/

Project Motivation

For this project, I was interested in exploring the AirBnB dataset from Seattle to better understand the following questions: For this project, I was interested in exploring Disaster Relief data from Figure Eight, by building an end-to-end ETL Pipeline to be able to do the following:

Pre-Processing the dataset by organizing the labels into one-hot encodings, and saving into a Database file
Tokenizing and cleaning the Natural Language data using NLP processing
Building, evaluating, and tuning a Machine Learning model to be able to predict the categories a disaster message would correspond to.

File Descriptions

The main code for this project is included in 3 files, 'data/process_data.py', 'models/train_classifier.py', 'app/run.py'. Code for processing and modeling is also in the notebooks 'ETL Pipeline Preparation.ipynb' and 'ML Pipeline Preparation.ipynb', which walks through the different steps involved in preparing the data for modeling, and obtaining the final results.

Starting data is included in the data folder, as `disaster_categories.csv', and 'disaster_messages.csv'.
Processed data is stored as a database in the data folder as DisasterResponse.db.
Trained model is stored in the models folder as classifier.pkl
The Wordcloud visual is drawn by wordcloud-plotly.PY, the code is referenced from GitHub.

Results

The Average F1 score for all categories was around 0.93. The following chart shows the score by each column:

Data Visualizations showing further results are available in charts/ directory. By running the Flask app, you can try your own messages on the model to see results from the trained model.

Licensing, Authors, Acknowledgements

Credit to FigureEight for providing the data. You can find the Licensing for the data and other descriptive information at theri website here. This code is free to use.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
app		app
charts		charts
data		data
models		models
.DS_Store		.DS_Store
.gitignore		.gitignore
ETL Pipeline Preparation.ipynb		ETL Pipeline Preparation.ipynb
LICENSE		LICENSE
ML Pipeline Preparation.ipynb		ML Pipeline Preparation.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Installation

Project Motivation

File Descriptions

Results

Licensing, Authors, Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

ravishchawla/ETL-Pipeline-for-Disaster-Data

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Installation

Project Motivation

File Descriptions

Results

Licensing, Authors, Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages