Tasked with the project of finding and analyzing data to guide Microsoft's potential entry into the Movie Industry, this project seeks to provide one way in which Microsoft might analyze limited data to commence this entrance. Analyzing the runtime and production budget of highly rated movies provided recommendations for Microsoft in terms of how long an ideal movie would be, and how much to set aside for the prodcution budget.
Data Visualizations Presentation Notebook
- Business Understanding
- Data Understanding
- Analysis and Visualizations
- Conclusion
- Recommendations for the Future
- Project Info
Given the project proposed by Microsoft, dozens of questions were drafted in the brainstorming process related to scope, accuracy, deliviery, consumer, content, and cost. This specific analysis was narrowed down to content and cost questions, specifically:
-
What budget range should Microsoft expect to set for a highly rated movie?
-
What runtime length range should Microsoft expect to set for a highly rated movie?
The data used for this project was provided by Flatiron school and included several datasets from IMDB, BOM, Rotten Tomatoes, The Numbers, and the Movie Database. From among the datasets provided, only 3 were used to narrow the scope of analysis.
Focusing on budget data and rating data from IMDB, we used three datasets: "IMDB Title Basics", "IMDB Title Ratings", and "TN Movie Budgets". Within those datasets, we focused on these specific items:
- Movie Title
- Runtime in Minutes
- Ratings
- Budget
I started with several visualizations to gain a general sense of what the data demonstrated:
This first visualization allowed us to narrow down our analysis to movies with an average rating of 7 or higher.
Seaborn in python required cleaner data in order to generate effective visualizations, so I created bins of average ratings at intervals of .5, and then produced overall visualizations of the data to see if we could glean any information from them:
After a touch more cleanup, I noticed more trends in the data which drew me back to the analysis of the business questions: optimal runtime and optimal budget based on ratings. Using boxplots was the most effective way to demonstrate range:
I initially anticipated this to be a quick, one number recommendation. However, analyzing these two visualizations showed patterns by certain groupings. Given the data left out and the myriad caveats along the way, I decided to provide the Microsoft decision making team with more flexibility by providing them with options.
Based on the images above, I made the following recommendations to Microsoft:
-
"Good Enough" (ratings between 7-8)
- Runtime: 100 - 125 Minutes
- Budget: $ 10 - $ 50 million
-
"Just Right" (ratings between 8-8.5)
- Runtime: 110 - 150 Minutes
- Budget: $ 25 - $ 100 million
-
"Go for Gold" (ratings between 8.5-9)
- Runtime: 140 - 160 Minutes
- Budget: $ 100 - $ 150 million
Within the data given, I recommend doing further research into the following:
-
Data scope - every step of narrowing down for cleaner data has an impact on our analysis outcomes.
-
Nuance in data - there's something more behind the number of votes when compared to the higher rating that could warrant further explanation.
-
Quality control through multiple sources - more of the given data should be analyzed for a more complete picture, e.g., using Rotten Tomatoes ratings in addition to IMDB ratings.
-
Data accuracy - more up-to-date data analyzing current box office income and projections given Covid19 and physical distancing measures' impact upon our data points.
In terms of the project itself, I recommend Microsoft do further research on the following:
- Content & Delivery - e.g, genre? platform? app?
- Competition - e.g., Netflix, HBO
- Competitive Advantages - e.g. Xbox, Office
- Threats - e.g., pandemics, protests
Contributor: Alexander Newton
Languages : Python
Tools/IDE : Powershell (Windows), Anaconda, Jupyter Notebook, Google Slides
Libraries : numpy, pandas, matplotlib, seaborn
Duration : June 2020 Last Update: 06.22.2020