Exploring the World of Cinema through Data: an ADAnalysis

Web Story : https://epfl-ada.github.io/ada-2023-project-dromadaire/

Abstract

Main characters often drive the plot and central themes of a movie, making them crucial to the audience's experience and perception. This data analysis project focuses on these pivotal figures, aiming to unravel the intricate relationship between casting choices and audience preferences for main actors within various movie genres. Utilizing datasets like the CMU Movie Corpus and IMDb, we delve into a comprehensive analysis of actor attributes such as age, gender, and career accomplishments.

Our research is guided by key questions: Do specific actor attributes correlate with higher movie ratings? How do these attributes and their impact on film success evolve over time? By filtering our dataset based on time period and primary character features, we aim to uncover patterns and trends that influence cinematic success.

This README documents our journey through this analytical process. It begins with a description of the data sources and an overview of our methodological approach, including data preprocessing, exploratory analysis, and advanced statistical techniques. We then present our findings, discussing their implications for the film industry and potential future research directions. The structure of this document is designed to provide a clear, cohesive narrative of our analytical exploration, offering insights into the fascinating world of cinema through the lens of data.

2. Research Questions

Can an actor's success in terms of awards and nominations impact the ratings of the movies in which they appear?
Do viewers tend to rate higher movies in which prolific actors appear?
How do the connections between actors influence the ratings of the movies they are in?
Do physical attributes of actors, in terms of age and gender, influence the ratings of the movies they play roles in?

3. Additional Datasets

IMDb Non-Commercial Datasets: The datasets include various aspects of movie and TV shows data like titles, crew, ratings, and episode details. In this project we use datasets title.basics (basic title information), title.principals (main participants), title.ratings (user ratings), and name.basics (personnel details). Source: https://datasets.imdbws.com/
Kaggle Awards Dataset: This dataset is a scraping of the official Academy Awards, listing winners and nominees between 1927 and 2023. A typical row indicates that a given actor was nominated in a given year for a given movie and whether an oscar was won or not. Source: https://www.kaggle.com/datasets/unanimad/the-oscar-award

4. Methods

Data Preparation

In the data preparation stage, our team meticulously analyzed and transformed various data types, converting date formats from dd/mm/yyyy to year-only format. We eliminated columns that were not needed and renamed the remaining ones for uniformity.

We enriched our dataset by performing an inner join between IMDb and CMU movies datasets, using movie titles and release dates as key identifiers. This process integrated additional details such as ratings and IMDb IDs into our data.

Furthermore, we extracted a list of lead actors from the IMDb principals dataset and merged this with our combined movies dataframe. To enhance the data further, we also incorporated information from the Kaggle Awards dataset, adding vital features like nominations and awards.

Feature Engineering and Extraction

In our datasets, we selectively retain specific features. For actors, we include an ID (from IMDb), name, age, and gender. Movie-related features include the Wikipedia ID, IMDb ID, title, release year, genres, and rating.

Additional actor features are engineered through simple statistical methods:

Awards Won: Count of Academy Awards the actor won before the role.
Nominations: Count of Academy Award nominations prior to the role.
Previous Roles: Total number of roles played by the actor beforehand.
Genre Diversity: Variety of genres the actor has appeared in previously.
Age at Release: Actor's age at the time of the movie's release.

Dataframe 1: Movies

Each movie is uniquely identified by its WikiID but might have overlapping names or release years. Unique combinations are formed by "name + year."

Table 1: Example of Movies Dataframe

Wiki ID	Movie Name	Release Year	Genres	Rating
W_ID1	MovieName1	Year1	Genre1	Rating1
W_ID2	MovieName1	Year2	Genre2	Rating2
W_ID3	MovieName2	Year1	Genre3	Rating3

Dataframe 2: Main Actors

Entries are uniquely identified by the combination of Wiki ID / IMDb ID and Actor ID.

Table 2: Example of Main Actors Dataframe

Wiki ID	IMDb ID	Actor ID	Ordering	Release Year	Age	Gender	Roles in Movies	Awards	Nominations	Genre Diversity	Age at Release
W_ID1	I_ID1	A_ID1	1	Year1	Age1	Gender1	NumRoles1	Award1	Nomination1	4	30
W_ID1	I_ID1	A_ID2	2	Year1	Age2	Gender2	NumRoles2	Award2	Nomination2	2	25
W_ID2	I_ID2	A_ID3	3	Year2	Age3	Gender3	NumRoles3	Award3	Nomination3	1	40

Feasibility Analysis

Upon reviewing the final datasets, we confirm the feasibility of our analysis. The required features are present, and the data volume is sufficient. We've also introduced flexibility by allowing the selection of a varying number of main actors, enabling us to expand our dataset as necessary.

Exploratory Data Analysis

In the Exploratory Data Analysis section, our initial focus is on thoroughly examining essential actors and movies features for our analysis, including age, gender, awards, nominations, and genre diversity. This step is crucial for grasping the fundamental characteristics of our dataset. Following this, we engage in a time series analysis to track the evolution of these actor features over time. This approach offers valuable insights into the shifting trends within the film industry.

Diving into the Research Questions

n this section, we track actor feature trends over time and distinguish between high-rated and low-rated films to identify key attributes linked to film success. Our analysis includes correlating actor features with movie ratings, conducting a temporal examination to observe how these correlations evolve, and comparing actor data from differently rated movies to discern patterns that signal cinematic success.

6. Project Overview

7. Team Orgnization

Armance Novel: Feature extraction, Data Visualisation, Feasibility analysis, README
Emeline Debalme: Explore analysis, Cross-analysis, Machine Learning, Data-story
Théo Houle: Explore analysis, Cross-analysis, Data-story, README
Kelan Solomon: Feature extraction, Data Visualisation, Machine Learning, Feasibility analysis
Dimitri Jacquemont: Explore analysis, Cross-analysis, Machine Learning, Data-story

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
docs		docs
images		images
.gitignore		.gitignore
P3.ipynb		P3.ipynb
README.md		README.md
import_data.py		import_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring the World of Cinema through Data: an ADAnalysis

Table of Contents

Abstract

2. Research Questions

3. Additional Datasets

4. Methods

Data Preparation

Feature Engineering and Extraction

Feasibility Analysis

Exploratory Data Analysis

Diving into the Research Questions

6. Project Overview

7. Team Orgnization

About

Releases

Packages

Contributors 5

Languages

DJacquemont/ada-2023-project-dromadaire

Folders and files

Latest commit

History

Repository files navigation

Exploring the World of Cinema through Data: an ADAnalysis

Table of Contents

Abstract

2. Research Questions

3. Additional Datasets

4. Methods

Data Preparation

Feature Engineering and Extraction

Feasibility Analysis

Exploratory Data Analysis

Diving into the Research Questions

6. Project Overview

7. Team Orgnization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages