API Interface for Analytics on IMDB Data

CSCI 5253: Datacenter Scale Computing, CU Boulder

Project Contributors:

Abhinivesh Palusa
Nikhil Prabhu

Project Goals:

IMDB data doesn’t provide an official documented API. The main goal of our project is to create an API for describing shows (titles), describing people working in those shows, and providing analytics on already available data (sourced from IMDB interfaces website [1]) and also to update these analytics if either new data is inserted or existing data is updated.

Software and Hardware Components:

AWS Services:
- S3 buckets
- API Gateway
- Serverless Lambda functions
- SQS (Simple Queue Service)
- DynamoDB
- EC2 instances
- Cloudwatch logs
Elasticsearch
Kibana
Nginx

Architectural Diagram:

Pipelines separated:

Flow 1 ⇒ S3 → SQS → DynamoDB
- Batch of data falls into S3 pushed by developer and λ is triggered.
- This λ pushes data record-wise into SQS and one more λ is triggered.
- The second λ pushes each message/record into relevant table of DynamoDB.
Flow 2 ⇒ API Gateway → DynamoDB → S3
- User requests data via API Gateway triggering a λ to a DynamoDB database.
- This λ transforms data to a relevant format collecting data from required tables in the DB.
- This data is stored in a HTML document in S3 whose public URL is sent as response.
Flow 3 ⇒ API Gateway → DynamoDB
- A record to be inserted / updated is posted via an API Gateway triggering a λ.
- This λ updates the relevant record or inserts a new one in the corresponding table of the DynamoDB database.
Flow 4 ⇒ DynamoDB → ES → Kibana[NGINX]
- Whenever data is inserted or updated in DynamoDB, a λ is triggered.
- This λ transforms data into a relevant JSON doc to be pushed into one of the available two indices on Elasticsearch (titles and people).
- These two indices can create many visualizations on Kibana.
- Kibana port 5601 is forwarded from port 80 by NGINX.

Testing & Debugging

Designed classes for each service (S3, SQS, DynamoDB, Elasticsearch) for better development and testing purposes.
Created unit-tests around above services and API Gateway using a mini-batch of data (5 records from each table with 7 tables).
Integration tests for each of the 4 flows using same type of mini-batch.
Integration tests for the entire architecture of the 4 flows.
Debugging data formats between components using Cloudwatch logs.
Using Kibana Dev tools to check whether the data is inserted in the expected format using various APIs like Insert API, Delete API, Count API, Bulk API, etc.

Limitations & Future Work:

Limited AWS credits (100) and limited time per session (3 hours).
Couldn’t store the entire 10GB of data and couldn’t do batch upload for all tables in any of the AWS services and Elasticsearch on EC2 instance. Only one table at max per session.
Performing advanced data transformations while moving data between components using corresponding λ functions.
More number of visualizations on the Kibana dashboard.
Currently limiting data to ‘en’ language and can be extended all available languages on IMDB.
Services involving container architecture (Docker & K8s).

Video Links (YouTube):

References:

[1] https://datasets.imdbws.com/

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
architecture_diagrams		architecture_diagrams
dynamodb_module		dynamodb_module
elasticsearch_module		elasticsearch_module
lambdas		lambdas
preprocessing		preprocessing
s3_module		s3_module
sqs_module		sqs_module
.gitignore		.gitignore
Final Project Slides.pdf		Final Project Slides.pdf
Final Project Writeup.pdf		Final Project Writeup.pdf
README.md		README.md
aws_credentials.py		aws_credentials.py
index.py		index.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

API Interface for Analytics on IMDB Data

CSCI 5253: Datacenter Scale Computing, CU Boulder

About

Releases

Packages

Contributors 2

Languages

AbhiniveshP/DCSC-Project-IMDB

Folders and files

Latest commit

History

Repository files navigation

API Interface for Analytics on IMDB Data

CSCI 5253: Datacenter Scale Computing, CU Boulder

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages