Hi!
In this repository, we will get our hand's dirty with Spark.
This assignment has the same tasks and goals as previous assignment, but this time, we will utilize Apache Spark and it's ecosystem.
Apache Spark is a general purpose processing engine for analytics. It is generally used for large datasets, typically in terabytes or petabytes. It has wide coverage on APIs and can be used for processing batches of data, real-time streams, machine learning, and ad-hoc query. Processing tasks are distributed over a cluster of nodes, and data is cached in-memory, to reduce computation time. We will be using Spark for our previous airline dataset, with size of 1.6 GB and 120M records, which should be an easy job for Spark to handle.
We will look into Spark and pyspark API for Spark.
Spark is a processing engine for large scale datasets. It handles parallel processing operations so that you don't have to build your own.
Sparks architecture consists of 2 main sides.
- Master Daemon (Driver Process)
- Worker Daemon (Slave Process)
DAG is nothing but a graph which holds the track of operations applied on RDD.
DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling. It transforms a logical execution plan (i.e. RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).
Generally, while functions passed on, it executes on the specific remote cluster node. Usually, it works on separate copies of all the variables those we use in functions. These specific variables are precisely copied to each machine. Also, on the remote machine, no updates to the variables sent back to the driver program. Therefore, it would be inefficient to support general, read-write shared variables across tasks. Although, in spark for two common usage patterns, there are two types of shared variables, such as:
- Broadcast Variables
- Accumulators
Content taken from techvidvan.com.
RDD (Resilient, Distributed, Dataset) is immutable distributed collection of objects. RDD is a logical reference of a dataset which is partitioned across many server machines in the cluster. RDDs are Immutable and are self recovered in case of failure. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc.
RDDs have two sets of operations.
- Transformations
- Actions
Transformation applies some function on a RDD and creates a new RDD, it does not modify the RDD that you apply the function on.(Remember that RDDs are immutable). Also, the new RDD keeps a pointer to it’s parent RDD.
Transformations are lazy operations on a RDD that create one or many new RDDs, e.g. map,filter, reduceByKey, join, cogroup, randomSplit
Narrow transformation — doesn’t require the data to be shuffled across the partitions. for example, Map, filter etc.. Wide transformation — requires the data to be shuffled for example, reduceByKey etc..
An Action is used to either save result to some location or to display it. You can also print the RDD lineage information by using the command filtered.toDebugString(filtered is the RDD here).
In Spark, DataFrames are the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type. DataFrames are similar to traditional database tables, which are structured and concise. We can say that DataFrames are relational databases with better optimization techniques.
Spark DataFrames can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. DataFrames allow the processing of huge amounts of data.
Create your Spark Environment using Google Colab with the script we had. After creating the environment, you can do the following to get your sc
object.
After loading Spark, initating an instance of spark can be done as below.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName("My Awesome Spark App!") \
.getOrCreate()
Read from a text file, create an RDD.
sc.textFile('data.csv', use_unicode=True).take(4)
Creating a dataframe.
df = sqlContext.createDataFrame(
data=[('a', 4), ('b', 2)],
schema=['Column A', 'Column B']
)
Apply a transformation to the data.
df2 = df[df.column_name != 'some_value']
Apply an action to get a result.
df3.filter(df3.size > 1000).count()
Content taken from Spark Basics : RDDs,Stages,Tasks and DAG.
There are four ways of using spark.
- Using Google Colab (Recommended)
- Using Databricks
- Install it on your own system, either using dockers or on your host system directly.
Using Saint Peter's University Data Science Lab Notebooks.*
To make this part easy, you don't have to install it on your own computer. However, you will have to use all the other tools.
Databricks will work on the fly, you just need to create and account. SPU data science lab is similar to Databricks, after you log in, you will be able to create notebooks. For Google Colab, you have to run the following script first, in order to be able to use spark.
Installing Spark on your local system is another option. You can install everything directly on your own computer, else you can utilize docker containers and make a docker compose to configure master and worker nodes. Check out this marvelous article on medium by @marcovillarreal_40011 on how to create a spark standalone cluster with Docker and docker-compose. You can try docker-compose
to build your own standalone spark instance.
Not available due to COVID 19 and VPN restrictions.
The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008.
Each row represents an individual flight record with details of that flight in the row. The information are:
- Time and date of arrival
- Originating and destination of airports
- Amount of time for a plane from taxi to takeoff
You can find more information about this dataset in the website of Statistical Computing.
Find the # of flights each airline made using Spark.
Try to find the count for the entire dataset.
Follow below instructions to set up your assignment repository.
- Download images from My Google Drive. (Only SPU emails are allowed to download.)
- Create a folder named as
data
in this directory. Put the data files in this folder. - Load the entire dataset into a DataFrame.
Not required, but can you do the following in Spark?
- Create a new field of timestamp using the columns Year, Month, DayofMonth, DayOfWeek, DepTime, and CRSDepTime. Note that CRSDepTime is in HHMM format.
- How many rows does you dataset have?
- How many flights that are not cancelled were taken?
- What is the average departure delay from each airport?
- What day the delays are the worst?
Use map-reduce algorithm to find out the results of the following questions.
- Find the counts of all airlines using MapReduce algorithm. Use Spark and PySpark API to complete this task.
- Find the mean departure delay per origination airport.
Following table is will give it a meaning for each file.
File | Description |
---|---|
README.md | A descriptive file to give an introduction of current project/ assignment. Includes a todo list that you have to edit. |
LICENCE | The licence of the file that every project should have. |
.gitignore | The file to control which files should be ignored by Git. |
.gitkeep | An empty file to keep folders under git. |
requirements.txt | A list of python packages you may need for the assignment. |
*.ipynb | Sample notebook as a reference for how your notebooks should be organized. |
- I have completed all the tasks in tasks section.
- I edit this README file and checkmarked things I've completed in the tasks section.
- My notebook(s) are well organized with headings, comments, that makes it visually appealing.
- My notebook(s) have the results of my execution.
- My notebook(s) are reproducible.
- I download the final version of my repository, and uploaded to the blackboard!