- Jeffrey Magina
- Jay Patel
- Jeremy Tran
- Pejal Rath
The main focus of this project is to get familiar with the data pipeline process. The project utilizes Java, AWS-EMR Cluster, AWS-S3 Bucket, AWS-RDS, Tomcat, Apache Spark, PostgresSQL, HttpServlet and HTML. In AWS-S3 bucket, there is a CSV file that contains raw data related to hotel booking.
- The program will load the CSV file from S3 bucket, perform Spark transformatons on the raw data and save it back to S3 bucket in CSV file format. The process of loading the CSV file from S3 will be handled by EMR instance.
- The program will load all the CSV files that were generated by Spark transformations, build a SQL table for each CSV file and save it to a AWS RDS PostgresSQL remote database.
- The program will host a Tomcat Server where user can inquiry for each table inside the database located inside AWS-RDS.
./copyAndRunSparkJobJar.sh ~/.ssh/"keypair".pem spark-job-1.0-SNAPSHOT.jar
./sendFromS3toDB.sh
./runServer.sh
./copyAndRunSparkJobJar.sh ~/.ssh/spark-demo.pem spark-job-1.0-SNAPSHOT.jar
./sendFromS3toDB.sh
./runServer.sh