Montreal Road Collisions

Performing basic data cleaning, EDA and prediction of number of collisions in Montreal on a fixed day, knowing the lighting conditions, day of the week, month and weather conditions.

Introduction

The purpose of this project is to use acquired techniques of creating Docker images, working with volumes, containers, and stacks, creating clusters and performing deployment inside the cluster locally and on GCP. We will also be using HDFS, MondoDB and Parquet Files in our project. As we work in Spark we will be using pyspark language and working with RDD.

Contributors: @Arwasheraky, @Nastazya and @Clercy Link to Presentation

Project Structure

Code:
- Scripts:
  - data_cleaning.ipynb - Data cleaning.
  - data_EDA.ipynb - Exploration data analysis.
  - csv-to-parquet.py - Making use of parquet and hadoop.
  - Write_features_MongoDB.ipynb - Making use of MongoDB.
  - Prediction_Subset.ipynb - Restructuring dataset for further prediction.
  - Prediction_and_model_export.ipynb - Prediction and model export.
  - pred.py - Script that runs the prediction with specific features
  - model.joblib - Exported prediction model used in the script
  - model.py - Script that builds prediction model, we use this script to run it and save the results to hdfs (optional, used for training, no description)
- docker-compose:
  - jupyter-compose.yml - Running Jupyter, using external volume jupyter-data .
  - jupyter_bind-compose.yml - Running Jupyter, using volume mounting and external network.
  - jupyter_local-compose.yml - Running Jupyter, using volume mounting.
  - spark-compose.yml - Running Spark based on jupyter/pyspark-notebook image, no volumes, using external network.
  - spark_bind-compose.yml - Running Spark based on jupyter/pyspark-notebook image, using volume mounting and external network.
  - spark_bind_hdfs-compose.yml - Running Spark based on mjhea0/spark:2.4.1 image, using volume mounting and external network.
  - spark_hdfs-compose.yml- Running Spark based on mjhea0/spark:2.4.1 image, no volumes, using external network.
  - spark_mongo-compose.yml - Running Spark and MongoDB.
  - Dockerfile - Used to create a custom image that copies script and model then runs the script.
Data:
- accidents.csv - Mnicipalities dataset, merged with original.
- accidents_new.csv - Clean data, ready for EDA.
- pred-1.csv - Restructured temporary dataset, not ready for prediction
- final_part_1.csv - Restructured dataset, ready for prediction

Running instructions

Clone this folder locally
Make sure you have latest version of Docker installed

Deploy any of the code files locally, in a Spark cluster or in GCP.

Locally:

env token=cebd1261 docker-compose -f jupyter_local-compose.yml up

Spark Cluster:

docker network create spark-network
docker-compose -f spark_bind-compose.yml up --scale spark-worker=2
env token=cebd1261 docker-compose -f jupyter_bind-compose.yml up

GCP
- Create a bucket with the files you would need
- Create a dataproc cluster which includes Jupyter and Anaconda
- Run Jupyter from the cluster

To run a prediction script pred.py that takes a number of features as an input and calculates the predicted number of collisions, follow these steps:

Run the environment using these commands from code folder:

docker network create spark-network
docker-compose --file spark-compose.yml up

Run this script on a cluster using this command:

docker run -ti  --network=spark-network -v /${PWD}/script://app jupyter/pyspark-notebook:latest //usr/local/spark/bin/spark-submit --master spark://master:7077 //app/pred.py <input feature>

where <input features> is a sequence of 4 parameters, devided by comma. Example: day,DI,2,neige.

Possible parameters:

  - lighting comditions (day or night)
  - day of the week (DI, LU, MA, ME, JU, VE, SA)
  - month (11 to 12)
  - weather conditions (normal, pluie, neige, verglas)

You can also do the same thing using custom image:

docker build -t <image_name> .

or use existing image nastazya/image_for_pred:latest

Run the image using this format:

docker run -ti  --network=spark-network <image name> <input features>
or
docker run -ti  --network=spark-network <image name>

In case where the features weren't chosen this input is taken by default: day,DI,2,neige

Packaging and deployment

In this section, we will shortly describe some techniques we used along the project packaging and deployment.

Below is an approximate diagram of all the processes used in this project (for clarity, we presented it as a team of two co-workers).

Jupyter + Spark + Bind Mount.

This schema was used the most in this project:

Run this command from code :

docker-compose -f spark_bind-compose.yml up --scale spark-worker=2

Run Jupyter from another terminal using this command:

env token=cebd1261 docker-compose -f jupyter_bind-compose.yml up

Proceed with coding of one of the .ipynb files

Jupyter + Spark + Volumes.

In the middle of the project development process, we needed to access data_cleaning.ipynb and run some additional code without changing the existing file and its output. For this purpose, we've run jupyter-compose.yml that uses volume jupyter-data. When we opened Jupyter notebook there were some files already, but not the ones we needed:

So we created a temporary folder with two files and copied them into Jupyter-data volume through busybox image:

Now we have a script folder with the files that we need inside our volume and whatever changes we do won't affect the original files. Later we will remove Jupyter-data volume.

Jupyter + Spark + MondoDB.

To execute a jupyter-spark-mongo session:

Run spark_mongo-compose.yml in the following format:

docker-compose -f spark-mongodb-compose.yml up --scale worker=2

In another terminal run jupyter-compose.yml in the following format:

env TOKEN=cebd1261 docker-compose --file jupyter-compose.yml up

The session should now be available. We used Write_features_MongoDB.ipynb

Once the session is available you can upload a test .csv file by using the Upload function.

The database file can be viewed at http://localhost:8181

Spark + Parquet + HDFS.

Run spark_hdfs-compose.yml with 2 workers:
- docker-compose --file spark_hdfs-compose.yml up --scale spark-worker=2

From another terminal, run the PySpark scriptcsv-to-parquet.py, and mount the current directory:

  docker run -t --rm \
  -v "$(pwd)":/script \
  --network=spark-network \
      mjhea0/spark:2.4.1 \
      bin/spark-submit \
          --master spark://master:7077 \
          --class endpoint \
          /script/csv-to-parquet.py

The script is running to: read the csv file, write it to hdfs in parquet format, read the file again from hfds and try some SQL queries:

Now, we have our data in parquet format, uploaded to HDFS, we can browse it on localhost: http://localhost:50070/explorer.html

GCP Dataproc cluster

Running data cleaning and EDA on a local cluster was fairly easy. Once we came to a stage where we needed to restructure the data for prediction it became impossible due to insufficient resources. We had to switch to GCP.

Create a bucket

Create a Dataproc cluster and connect to it our bucket, Anaconda and Jupyter components

Go to the Cluster and open Jupyter window.
Upload your files

Run the file without creating Spark session (in Dataproc it is created automatically)

Running script on a Spark cluster

Our script takes a number of features as an input and output the predicted number of collisions based on the info we provided.

Creating spark environment:

    docker network create spark-network
    docker-compose --file spark-compose.yml up

In another terminal run this command:

$ docker run -ti  --network=spark-network -v /${PWD}/script://app jupyter/pyspark-notebook:latest //usr/local/spark/bin/spark-submit --master spark://master:7077 //app/pred.py 'day','DI',2,'neige'

Running script on a Spark cluster from an image

Creating spark environment:

    docker network create spark-network
    docker-compose --file spark-compose.yml up

Building the image:

docker build -t <image_name> .

or use existing image nastazya/image_for_pred:latest from Docker Hub:

Run the image using this format:

docker run -ti  --network=spark-network <image name> <input features>

or without features (they will be taken from the image by default: day,DI,2,neige):

docker run -ti  --network=spark-network <image name>

Data Cleaning

The following steps were done:

Reading collisions dataset and checking its structure.
Choosing some columns and renaming them.
Adjusting the types.
Reading municipalities dataset and merging it with collisions.
Exploring collisions in each municipality and doing some other basic explorations to better understand the data.
Dealing with nulls: streets with nulls were removed; empty codes were removed too: we could replace them with 99 but we will remove all the rows with unknown categories anyway.
Writing the "clean" data to another file

Please check this file which contains the code, results and comments

Exploratory Data Analysis

The Exploratory Data Analysis undertaken summarizes a number of values from our dataset to help isolate the values required for our prediction model. The exploratory analysis was conducted after refinement of the original data to it's present form. To move towards a prediction model the data was filtered and aggregated in several forms allowing for various views and insights.
The data included for the analysis include:
- A data schema is provided outlining the fields and number of records that were available for our investigation.
- Aggregates have been made available outlining the number of accidents, victims by method of transportation, severity, municipality, street, days of the week and weather conditions.
- Charts to visualize and facilitate the understanding of the data for decision making.
Based on our initial analysis what does the data tell us?
1. That the majority of accidents took place in the city of Montreal area. First thoughts dictate that this would be attributed to the population density, and as Montreal is the business hub of the city raising the probability of accidents.
1. We can also see that Weather, Surface, time of day, day of week amongst others can factor into accidents as well> Friday has the largest number of accidents among the week days. In addition, most accidents happen in a Clear weather!
1. Fortunately, Material Damages are the most common damages. In contrast, Mortel Accidents are the least common.

EDA Notebook

Prediction

We want to predict number of collisions in Montreal based on lighting conditions, day of the week, month and weather conditions.

The following steps were done:

Restructuring of the dataset in order to be able to make a prediction:

Original dataset has collision - oriented structure (one row - one collision):

We will have to find the way to restructure it in order to be day - oriented (one day - one row) with number of collision per day. On top of that we want to divide each day into day and night in terms of light.

After some manipulations we have an intermediate result where we have the data day-oriented:

Feature LIGHT has 4 possible values. We will group them into 2 (day and night) and create a new column for each option with possible values of 1 or 0.

We will do the same with METEO which has 9 states. We will group them into 5 (normal, rain, snow, ice) and create the discrete columns.

We will also create discrete columns with week days and month.

This is the final structure of prediction dataset:

Please check this complete file which contains the code, results and comments
Splitting the dataset into test and train

Please check this complete file which contains the code, results and comments
Building and running a model

We used Random Forest Regressor which is known to perform well on discrete features. We haven't used other methods or Greed Search as the prediction itself wasn't the core objective in this project.
Checking the results

As we were expecting the results are not great with CVS = 0.6 and MAE = 6.68 as the model has to be optimized. Ideally, we would add some features using external sources or develop new features based on ratios or dependencies of existing features.

Strongest feature importance: Day, Night, Sunday, Saturday, Snow, January, December, Friday:
Exporting the model and running the script

We exported the model using joblib library, to be able to run the script that will take a number of features as an input and calculate the predicted number of collisions based on the info we provided (you can check the code here and instructions on how to run it in RUNME section)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Project organisation files		Project organisation files
README files and screenshots		README files and screenshots
code		code
LICENSE		LICENSE
Montreal Road Collisions.pdf		Montreal Road Collisions.pdf
Montreal Road Collisions.pptx		Montreal Road Collisions.pptx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Montreal Road Collisions

Introduction

Project Structure

Running instructions

Packaging and deployment

Jupyter + Spark + Bind Mount.

Jupyter + Spark + Volumes.

Jupyter + Spark + MondoDB.

Spark + Parquet + HDFS.

GCP Dataproc cluster

Running script on a Spark cluster

Running script on a Spark cluster from an image

Data Cleaning

Exploratory Data Analysis

Prediction

About

Releases

Packages

Languages

License

ArwaSheraky/Montreal-Collisions

Folders and files

Latest commit

History

Repository files navigation

Montreal Road Collisions

Introduction

Project Structure

Running instructions

Packaging and deployment

Jupyter + Spark + Bind Mount.

Jupyter + Spark + Volumes.

Jupyter + Spark + MondoDB.

Spark + Parquet + HDFS.

GCP Dataproc cluster

Running script on a Spark cluster

Running script on a Spark cluster from an image

Data Cleaning

Exploratory Data Analysis

Prediction

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages