A Machine Learning Web Application that processes Twitter messages during disasters, categorizing them to help Response Organizations efficiently direct aid. The system performs Extract, Transform and Load (ETL) operations on messages and classifies them into relevant emergency response categories.
- Python 3.6+
- pip package manager
-
Create and activate a virtual environment
# Create virtual environment python3 -m venv myenv # Activate virtual environment # On Unix/macOS: source myenv/bin/activate # On Windows: myenv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up the database and train the model
# Process data and create database python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db # Train and save the classifier python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
-
Launch the web application
python app/run.py
-
Access the application
- Open your browser and navigate to: http://127.0.0.1:3001/ or http://0.0.0.0:3001/
The ETL pipeline (process_data.py
) handles:
- Loading data from CSV files
- Merging messages and categories datasets
- Cleaning and transforming data
- Storing processed data in SQLite database
Key functions:
load_data()
: Data extraction from CSVsave_data()
: Database storage operations
The ML pipeline (train_classifier.py
) includes:
- Data loading from SQLite database
- Text processing and feature engineering
- Model training and evaluation
- Model persistence (pickle format)
Key components:
- Custom tokenizer with NLTK
- StartingVerbExtractor feature
- Multi-output classification pipeline
- GridSearchCV for hyperparameter tuning
Flask-based web interface providing:
- Interactive message classification
- Data visualizations
- Real-time prediction results
The dataset exhibits significant class imbalance, particularly in categories like 'water' and 'child alone' which has near or all zeros. This presents several challenges:
- Training Impact: Underrepresented classes may have lower prediction accuracy
- Metric Selection: F1-score provides a balanced measure for imbalanced classes
- Strategy: Model evaluation emphasizes:
- High recall for critical categories (e.g., medical help)
- High precision for resource allocation categories
- Additional web app visualizations
- Organization recommendation system
- UI/UX improvements
- Cloud deployment
- Pipeline optimization
- Enhanced handling of class imbalance eg using class weights in the ML training pipeline.
- Integration with disaster response organizations
Run the test suite (In development):
python -m tests/test_data_processing.py
python -m tests/test_train_classifier.py
The workspace/
directory contains Jupyter notebooks used for:
- Experimental feature development
- Pipeline prototyping
- Model evaluation
- Visualization testing
This project is actively maintained and welcomes contributions.