A demonstration of an IoT project using AWS. I use data from the IMDB to predict the best combination of director, genre, and year for a new movie.
There is a presentation and a YouTube Video describing the project.
I begin with data from two sources:
- The IMDB Online Database. I started with the
ratings.list
file that provides a breakdown of movie ratings for all of the films in their database. - I enhanced this dataset using the Open Movid Database, adding information about the director and genre of each film.
The GetMovieData notebook steps through the acquisition of the Open MDB data using their API. It also combines the data with a cleaned version of the ratings.list
data from the IMDb. The cleaned version removes the header and footer from the file and turns it into a csv
. This notebook outputs two additional files that have the Open MDB data with films with more than 5000 ratings: and films with 1500 to 5000 ratings. It then combines all the datasets and outputs a simplified dataset containing the director, year, genre, and star rating (/data/movie_ratings_simple.csv).
I present a second notebook, MovieAnalysis that looks at the data and examines some of its features.
Finally, a third notebook, MoviePredictionTests does a first-pass attempt at predicting the rating of a movie given its director, genre, and year.
The next part of the project was to upload the data to an AWS dynamoDB database. This consists of first uploading the data to an S3 storage container, then moving from the storage container to the dynamoDB. The dynamoDB requires an input data schema and format following:
{"title":{"s":"Test Movie Title"},"Director1":{"s":"Director name"},"Genre1":{"s":"Genre1 Name"},"Genre2":{"s":"Genre2 Name"},"stars":{"n":"9"},"year":{"n":"2002"},"Genre3":{"s":"Genre3 Name"}}
I followed this documentation to import the data into the dynamoDB.
I used the data stored in the S3 container to train the machine learning model following this tutorial. I established an endpoint for the trained model in order to query the model from my IoT server node.
I built an AWS IoT portal following the SDK tutorial. I registered and connected my Rapsberry Pi model 3 to the IoT portal. This involved downloading license keys and certificates and putting them on the Raspberry Pi.
The IoT hub is set to look for prediction requests on the filmrequest
topic from the IoT device.
The IoT device is set to listen on the MQTT topic $aws/things/MovieSelector/shadow/update
for predictions from the AWS machine learning model.
The IoT hub is looking for new "scored" films on the filmupdate
topic.
The data path on AWS looks like this:
- The IoT sends a request for a prediction. This is directed to a lambda script that queries the trained model.
- The trained model returns a JSON object with its prediction. The prediction is then returned to the IoT device.
- The IoT device repeats this process several times for different combinations of inputs, looking for the best possible output.
- The IoT device then makes a decision about which set to use, then gets feedback as to the "real" rating for that set of inputs. It sends the final movie information, with its rating, back to the AWS IoT hub. The hub then directs that data to two places, storing it in the dynamoDB and appending it to the S3 tsv file for future improvements in the machine learning model.
The Raspberry Pi uses the AWS Python SDK to interface with the AWS IoT hub. This requires gathering the certificate and private key files and saving them in the source directory on the Pi.
The Pi also needs a copy of the movie_ratings_simple.csv data file from which it creates random combinations of film titles, directors, years, and genres. This is testing in the device movie generator notebook. We then run through a complete test of the system (including communications with the AWS endpoint) in the last notebook.
The final version is compiled in a single device simulator script python script.