Big Data coursework This coursework focuses on parallelisation and scalability in the cloud with Spark and TesorFlow/Keras. We start with code based on lessons 3 and 4 of the Fast and Lean Data Science course by Martin Gorner. What we will do here is parallelise pre-processing, measuring and machine learning in the cloud and we will perform evaluation and analysis on the cloud performance, as well as theoretical discussion.
This coursework contains 5 sections*.
This section just contains some necessary code for setting up the environment. It has no tasks for you (but do read the code and comments).
Section 1 is about preprocessing a set of image files. We will work with a public dataset “Flowers” (3600 images, 5 classes). This is not a vast dataset, but it keeps the tasks more manageable for development and you can scale up later, if you like.
In 'Getting Started' we will work through the data preprocessing code from Fast and Lean Data Science which uses TensorFlow's tf.data
package.
There is no task for you here, but you will need to re-use some of this code later.
In Task 1 you will parallelise the data preprocessing in Spark, using Google Cloud (GC) Dataproc. This involves adapting the code from 'Getting Started' to use Spark and running it in the cloud.
In Section 2 we are going to measure the speed of reading data in the cloud. In Task 2 we will paralellize the measuring of different configurations using Spark.
In Section 2, we will use the pre-processed data in Tensorflow/Keras. We will use the GC AI-Platform (formerly Cloud ML) in Task 3 and test different parallelisation approaches for multiple GPUs.
This section is about the theoretical discussion, based on to two papers, in Task 4. The answers should be given in the PDF report.