Web Story : https://epfl-ada.github.io/ada-2023-project-dromadaire/
- 1. Abstract
- 2. Research Questions
- 3. Additional Datasets
- 4. Methods
- 5. Further Analysis
- 6. Project Overview
- 7. Team Orgnization
Main characters often drive the plot and central themes of a movie, making them crucial to the audience's experience and perception. This data analysis project focuses on these pivotal figures, aiming to unravel the intricate relationship between casting choices and audience preferences for main actors within various movie genres. Utilizing datasets like the CMU Movie Corpus and IMDb, we delve into a comprehensive analysis of actor attributes such as age, gender, and career accomplishments.
Our research is guided by key questions: Do specific actor attributes correlate with higher movie ratings? How do these attributes and their impact on film success evolve over time? By filtering our dataset based on time period and primary character features, we aim to uncover patterns and trends that influence cinematic success.
This README documents our journey through this analytical process. It begins with a description of the data sources and an overview of our methodological approach, including data preprocessing, exploratory analysis, and advanced statistical techniques. We then present our findings, discussing their implications for the film industry and potential future research directions. The structure of this document is designed to provide a clear, cohesive narrative of our analytical exploration, offering insights into the fascinating world of cinema through the lens of data.
-
Can an actor's success in terms of awards and nominations impact the ratings of the movies in which they appear?
-
Do viewers tend to rate higher movies in which prolific actors appear?
-
How do the connections between actors influence the ratings of the movies they are in?
-
Do physical attributes of actors, in terms of age and gender, influence the ratings of the movies they play roles in?
-
IMDb Non-Commercial Datasets: The datasets include various aspects of movie and TV shows data like titles, crew, ratings, and episode details. In this project we use datasets title.basics (basic title information), title.principals (main participants), title.ratings (user ratings), and name.basics (personnel details). Source: https://datasets.imdbws.com/
-
Kaggle Awards Dataset: This dataset is a scraping of the official Academy Awards, listing winners and nominees between 1927 and 2023. A typical row indicates that a given actor was nominated in a given year for a given movie and whether an oscar was won or not. Source: https://www.kaggle.com/datasets/unanimad/the-oscar-award
In the data preparation stage, our team meticulously analyzed and transformed various data types, converting date formats from dd/mm/yyyy to year-only format. We eliminated columns that were not needed and renamed the remaining ones for uniformity.
We enriched our dataset by performing an inner join between IMDb and CMU movies datasets, using movie titles and release dates as key identifiers. This process integrated additional details such as ratings and IMDb IDs into our data.
Furthermore, we extracted a list of lead actors from the IMDb principals dataset and merged this with our combined movies dataframe. To enhance the data further, we also incorporated information from the Kaggle Awards dataset, adding vital features like nominations and awards.
In our datasets, we selectively retain specific features. For actors, we include an ID (from IMDb), name, age, and gender. Movie-related features include the Wikipedia ID, IMDb ID, title, release year, genres, and rating.
Additional actor features are engineered through simple statistical methods:
- Awards Won: Count of Academy Awards the actor won before the role.
- Nominations: Count of Academy Award nominations prior to the role.
- Previous Roles: Total number of roles played by the actor beforehand.
- Genre Diversity: Variety of genres the actor has appeared in previously.
- Age at Release: Actor's age at the time of the movie's release.
Dataframe 1: Movies
Each movie is uniquely identified by its WikiID but might have overlapping names or release years. Unique combinations are formed by "name + year."
Wiki ID | Movie Name | Release Year | Genres | Rating |
---|---|---|---|---|
W_ID1 | MovieName1 | Year1 | Genre1 | Rating1 |
W_ID2 | MovieName1 | Year2 | Genre2 | Rating2 |
W_ID3 | MovieName2 | Year1 | Genre3 | Rating3 |
Dataframe 2: Main Actors
Entries are uniquely identified by the combination of Wiki ID / IMDb ID and Actor ID.
**Wiki ID** | IMDb ID | Actor ID | Ordering | Release Year | Age | Gender | Roles in Movies | Awards | Nominations | Genre Diversity | Age at Release |
---|---|---|---|---|---|---|---|---|---|---|---|
W_ID1 | I_ID1 | A_ID1 | 1 | Year1 | Age1 | Gender1 | NumRoles1 | Award1 | Nomination1 | 4 | 30 |
W_ID1 | I_ID1 | A_ID2 | 2 | Year1 | Age2 | Gender2 | NumRoles2 | Award2 | Nomination2 | 2 | 25 |
W_ID2 | I_ID2 | A_ID3 | 3 | Year2 | Age3 | Gender3 | NumRoles3 | Award3 | Nomination3 | 1 | 40 |
Upon reviewing the final datasets, we confirm the feasibility of our analysis. The required features are present, and the data volume is sufficient. We've also introduced flexibility by allowing the selection of a varying number of main actors, enabling us to expand our dataset as necessary.
In the Exploratory Data Analysis section, our initial focus is on thoroughly examining essential actors and movies features for our analysis, including age, gender, awards, nominations, and genre diversity. This step is crucial for grasping the fundamental characteristics of our dataset. Following this, we engage in a time series analysis to track the evolution of these actor features over time. This approach offers valuable insights into the shifting trends within the film industry.
n this section, we track actor feature trends over time and distinguish between high-rated and low-rated films to identify key attributes linked to film success. Our analysis includes correlating actor features with movie ratings, conducting a temporal examination to observe how these correlations evolve, and comparing actor data from differently rated movies to discern patterns that signal cinematic success.
-
Armance Novel: Feature extraction, Data Visualisation, Feasibility analysis, README
-
Emeline Debalme: Explore analysis, Cross-analysis, Machine Learning, Data-story
-
Théo Houle: Explore analysis, Cross-analysis, Data-story, README
-
Kelan Solomon: Feature extraction, Data Visualisation, Machine Learning, Feasibility analysis
-
Dimitri Jacquemont: Explore analysis, Cross-analysis, Machine Learning, Data-story