Solution: https://www.youtube.com/watch?v=YtddC7vJOgQ
In this homework we'll put what we learned about Spark in practice.
For this homework we will be using the FHV 2019-10 data found here. FHV Data
Install Spark and PySpark
- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.
What's the output?
Note
To install PySpark follow this guide
FHV October 2019
Read the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons.
Repartition the Dataframe to 6 partitions and save it to parquet.
What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.
- 1MB
- 6MB
- 25MB
- 87MB
Count records
How many taxi trips were there on the 15th of October?
Consider only trips that started on the 15th of October.
- 108,164
- 12,856
- 452,470
- 62,610
Important
Be aware of columns order when defining schema
Longest trip for each day
What is the length of the longest trip in the dataset in hours?
- 631,152.50 Hours
- 243.44 Hours
- 7.68 Hours
- 3.32 Hours
User Interface
Spark’s User Interface which shows the application's dashboard runs on which local port?
- 80
- 443
- 4040
- 8080
Least frequent pickup location zone
Load the zone lookup data into a temp view in Spark
Zone Data
Using the zone lookup data and the FHV October 2019 data, what is the name of the LEAST frequent pickup location Zone?
- East Chelsea
- Jamaica Bay
- Union Sq
- Crown Heights North
- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw5
- Deadline: See the website