Big Data & Cloud Computing
Use Spark and Google Cloud Platform to perform analysis on increasing difficulty samples of the MovieLens dataset.
Example of Pyspark use:
def recommendByTag(singleTag, TFIDF_tags, movies, min_fmax=10, numberOfResults=10, debug=False):
# start by most complexity-reducing operation: filter
# filter by the singleTag
# remove entries with f_max < min_fmax
df_tag = TFIDF_tags.filter(TFIDF_tags.tag == singleTag)\
.filter(TFIDF_tags.f_max >= min_fmax)
# join to get movie title
# order by descending TFIDF + ascending lexicographic title
# remove unnecessary columns
# return results limited to numberOfResults
df = df_tag.join(movies, 'movieId')\
.orderBy(['TF_IDF','title'], ascending=[0,1])\
.select('movieId', 'title', 'TF_IDF')\
.limit(numberOfResults)
return df
Open problem of using Big Data tools and techniques to analyse a 32GB+ dataset of hospital events. Besides GCP, we used DASK and dask-ml.