This project focuses on data analysis using Apache Hadoop and Apache Spark.
The goal is to familiarize working with distributed systems and modern data science techniques.
The project utilizes large datasets related to crime data in Los Angeles.
- Apache Hadoop 3.3.6 The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
Distributed processing of large datasets across clusters of computers using simple programming models.
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Batch/streaming data
SQL analytics
Data science at scale
Machine learning
The project uses virtual machines from the public cloud ~Okeanos-knossos
.
A detailed setup
guide for the installation of the tools used is available in the files/documents
folder.
A detailed report with the execution and the interpretation of queries is also available in the files/documents
folder.