Performing basic data cleaning, EDA and prediction of number of collisions in Montreal on a fixed day, knowing the lighting conditions, day of the week, month and weather conditions.
The purpose of this project is to use acquired techniques of creating Docker images, working with volumes, containers, and stacks, creating clusters and performing deployment inside the cluster locally and on GCP. We will also be using HDFS, MondoDB and Parquet Files in our project. As we work in Spark we will be using pyspark language and working with RDD.
Contributors: @Arwasheraky, @Nastazya and @Clercy Link to Presentation
-
Code:
-
Scripts:
data_cleaning.ipynb
- Data cleaning.data_EDA.ipynb
- Exploration data analysis.csv-to-parquet.py
- Making use of parquet and hadoop.Write_features_MongoDB.ipynb
- Making use of MongoDB.Prediction_Subset.ipynb
- Restructuring dataset for further prediction.Prediction_and_model_export.ipynb
- Prediction and model export.pred.py
- Script that runs the prediction with specific featuresmodel.joblib
- Exported prediction model used in the scriptmodel.py
- Script that builds prediction model, we use this script to run it and save the results to hdfs (optional, used for training, no description)
-
docker-compose:
jupyter-compose.yml
- Running Jupyter, using external volumejupyter-data
.jupyter_bind-compose.yml
- Running Jupyter, using volume mounting and external network.jupyter_local-compose.yml
- Running Jupyter, using volume mounting.spark-compose.yml
- Running Spark based onjupyter/pyspark-notebook
image, no volumes, using external network.spark_bind-compose.yml
- Running Spark based onjupyter/pyspark-notebook
image, using volume mounting and external network.spark_bind_hdfs-compose.yml
- Running Spark based onmjhea0/spark:2.4.1
image, using volume mounting and external network.spark_hdfs-compose.yml
- Running Spark based onmjhea0/spark:2.4.1
image, no volumes, using external network.spark_mongo-compose.yml
- Running Spark and MongoDB.Dockerfile
- Used to create a custom image that copies script and model then runs the script.
-
-
Data:
accidents.csv
- Mnicipalities dataset, merged with original.accidents_new.csv
- Clean data, ready for EDA.pred-1.csv
- Restructured temporary dataset, not ready for predictionfinal_part_1.csv
- Restructured dataset, ready for prediction
-
Clone this folder locally
-
Make sure you have latest version of Docker installed
-
Deploy any of the code files locally, in a Spark cluster or in GCP.
-
Locally:
env token=cebd1261 docker-compose -f jupyter_local-compose.yml up
-
Spark Cluster:
docker network create spark-network docker-compose -f spark_bind-compose.yml up --scale spark-worker=2 env token=cebd1261 docker-compose -f jupyter_bind-compose.yml up
-
GCP
- Create a bucket with the files you would need
- Create a dataproc cluster which includes Jupyter and Anaconda
- Run Jupyter from the cluster
-
-
To run a prediction script pred.py that takes a number of features as an input and calculates the predicted number of collisions, follow these steps:
- Run the environment using these commands from
code
folder:
docker network create spark-network docker-compose --file spark-compose.yml up
- Run this script on a cluster using this command:
docker run -ti --network=spark-network -v /${PWD}/script://app jupyter/pyspark-notebook:latest //usr/local/spark/bin/spark-submit --master spark://master:7077 //app/pred.py <input feature>
where
<input features>
is a sequence of 4 parameters, devided by comma. Example:day,DI,2,neige
.Possible parameters:
- lighting comditions (day or night) - day of the week (DI, LU, MA, ME, JU, VE, SA) - month (11 to 12) - weather conditions (normal, pluie, neige, verglas)
- You can also do the same thing using custom image:
docker build -t <image_name> .
or use existing image
nastazya/image_for_pred:latest
- Run the image using this format:
docker run -ti --network=spark-network <image name> <input features> or docker run -ti --network=spark-network <image name>
In case where the features weren't chosen this input is taken by default:
day,DI,2,neige
- Run the environment using these commands from
In this section, we will shortly describe some techniques we used along the project packaging and deployment.
Below is an approximate diagram of all the processes used in this project (for clarity, we presented it as a team of two co-workers).
This schema was used the most in this project:
- Run this command from code :
docker-compose -f spark_bind-compose.yml up --scale spark-worker=2
- Run Jupyter from another terminal using this command:
env token=cebd1261 docker-compose -f jupyter_bind-compose.yml up
- Proceed with coding of one of the .ipynb files
In the middle of the project development process, we needed to access data_cleaning.ipynb and run some additional code without changing the existing file and its output. For this purpose, we've run jupyter-compose.yml that uses volume jupyter-data. When we opened Jupyter notebook there were some files already, but not the ones we needed:
So we created a temporary folder with two files and copied them into Jupyter-data volume through busybox image:
Now we have a script folder with the files that we need inside our volume and whatever changes we do won't affect the original files. Later we will remove Jupyter-data volume.
To execute a jupyter-spark-mongo session:
- Run
spark_mongo-compose.yml
in the following format:
docker-compose -f spark-mongodb-compose.yml up --scale worker=2
- In another terminal run
jupyter-compose.yml
in the following format:
env TOKEN=cebd1261 docker-compose --file jupyter-compose.yml up
The session should now be available. We used Write_features_MongoDB.ipynb
- Once the session is available you can upload a test
.csv
file by using theUpload
function.
- The database file can be viewed at http://localhost:8181
-
Run
spark_hdfs-compose.yml
with 2 workers: -
From another terminal, run the PySpark script
csv-to-parquet.py
, and mount the current directory:-
docker run -t --rm \ -v "$(pwd)":/script \ --network=spark-network \ mjhea0/spark:2.4.1 \ bin/spark-submit \ --master spark://master:7077 \ --class endpoint \ /script/csv-to-parquet.py
-
The script is running to: read the csv file, write it to hdfs in parquet format, read the file again from hfds and try some SQL queries:
-
-
Now, we have our data in parquet format, uploaded to HDFS, we can browse it on localhost: http://localhost:50070/explorer.html
Running data cleaning and EDA on a local cluster was fairly easy. Once we came to a stage where we needed to restructure the data for prediction it became impossible due to insufficient resources. We had to switch to GCP.
- Create a bucket
- Create a Dataproc cluster and connect to it our bucket, Anaconda and Jupyter components
-
Go to the Cluster and open Jupyter window.
-
Upload your files
- Run the file without creating Spark session (in Dataproc it is created automatically)
Our script takes a number of features as an input and output the predicted number of collisions based on the info we provided.
- Creating spark environment:
docker network create spark-network
docker-compose --file spark-compose.yml up
- In another terminal run this command:
$ docker run -ti --network=spark-network -v /${PWD}/script://app jupyter/pyspark-notebook:latest //usr/local/spark/bin/spark-submit --master spark://master:7077 //app/pred.py 'day','DI',2,'neige'
- Creating spark environment:
docker network create spark-network
docker-compose --file spark-compose.yml up
- Building the image:
docker build -t <image_name> .
or use existing image nastazya/image_for_pred:latest
from Docker Hub:
- Run the image using this format:
docker run -ti --network=spark-network <image name> <input features>
or without features (they will be taken from the image by default: day,DI,2,neige
):
docker run -ti --network=spark-network <image name>
The following steps were done:
- Reading collisions dataset and checking its structure.
- Choosing some columns and renaming them.
- Adjusting the types.
- Reading municipalities dataset and merging it with collisions.
- Exploring collisions in each municipality and doing some other basic explorations to better understand the data.
- Dealing with nulls: streets with nulls were removed; empty codes were removed too: we could replace them with 99 but we will remove all the rows with unknown categories anyway.
- Writing the "clean" data to another file
Please check this file which contains the code, results and comments
-
The Exploratory Data Analysis undertaken summarizes a number of values from our dataset to help isolate the values required for our prediction model. The exploratory analysis was conducted after refinement of the original data to it's present form. To move towards a prediction model the data was filtered and aggregated in several forms allowing for various views and insights.
-
The data included for the analysis include:
- A data schema is provided outlining the fields and number of records that were available for our investigation.
- Aggregates have been made available outlining the number of accidents, victims by method of transportation, severity, municipality, street, days of the week and weather conditions.
- Charts to visualize and facilitate the understanding of the data for decision making.
-
Based on our initial analysis what does the data tell us?
- That the majority of accidents took place in the city of Montreal area. First thoughts dictate that this would be attributed to the population density, and as Montreal is the business hub of the city raising the probability of accidents.
- We can also see that Weather, Surface, time of day, day of week amongst others can factor into accidents as well> Friday has the largest number of accidents among the week days. In addition, most accidents happen in a Clear weather!
- Fortunately, Material Damages are the most common damages. In contrast, Mortel Accidents are the least common.
We want to predict number of collisions in Montreal based on lighting conditions, day of the week, month and weather conditions.
The following steps were done:
-
Restructuring of the dataset in order to be able to make a prediction:
Original dataset has collision - oriented structure (one row - one collision):
We will have to find the way to restructure it in order to be day - oriented (one day - one row) with number of collision per day. On top of that we want to divide each day into day and night in terms of light.
After some manipulations we have an intermediate result where we have the data day-oriented:
Feature LIGHT has 4 possible values. We will group them into 2 (day and night) and create a new column for each option with possible values of 1 or 0.
We will do the same with METEO which has 9 states. We will group them into 5 (normal, rain, snow, ice) and create the discrete columns.
We will also create discrete columns with week days and month.
This is the final structure of prediction dataset:
Please check this complete file which contains the code, results and comments
-
Splitting the dataset into test and train
Please check this complete file which contains the code, results and comments
-
Building and running a model
We used Random Forest Regressor which is known to perform well on discrete features. We haven't used other methods or Greed Search as the prediction itself wasn't the core objective in this project.
-
Checking the results
As we were expecting the results are not great with CVS = 0.6 and MAE = 6.68 as the model has to be optimized. Ideally, we would add some features using external sources or develop new features based on ratios or dependencies of existing features.
Strongest feature importance: Day, Night, Sunday, Saturday, Snow, January, December, Friday:
-
Exporting the model and running the script
We exported the model using joblib library, to be able to run the script that will take a number of features as an input and calculate the predicted number of collisions based on the info we provided (you can check the code here and instructions on how to run it in RUNME section)