Skip to content

ada-2023-project-dromadaire created by GitHub Classroom

Notifications You must be signed in to change notification settings

DJacquemont/ada-2023-project-dromadaire

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploring the World of Cinema through Data: an ADAnalysis

Web Story : https://epfl-ada.github.io/ada-2023-project-dromadaire/

Table of Contents

Abstract

Main characters often drive the plot and central themes of a movie, making them crucial to the audience's experience and perception. This data analysis project focuses on these pivotal figures, aiming to unravel the intricate relationship between casting choices and audience preferences for main actors within various movie genres. Utilizing datasets like the CMU Movie Corpus and IMDb, we delve into a comprehensive analysis of actor attributes such as age, gender, and career accomplishments.

Our research is guided by key questions: Do specific actor attributes correlate with higher movie ratings? How do these attributes and their impact on film success evolve over time? By filtering our dataset based on time period and primary character features, we aim to uncover patterns and trends that influence cinematic success.

This README documents our journey through this analytical process. It begins with a description of the data sources and an overview of our methodological approach, including data preprocessing, exploratory analysis, and advanced statistical techniques. We then present our findings, discussing their implications for the film industry and potential future research directions. The structure of this document is designed to provide a clear, cohesive narrative of our analytical exploration, offering insights into the fascinating world of cinema through the lens of data.

2. Research Questions

  • Can an actor's success in terms of awards and nominations impact the ratings of the movies in which they appear?

  • Do viewers tend to rate higher movies in which prolific actors appear?

  • How do the connections between actors influence the ratings of the movies they are in?

  • Do physical attributes of actors, in terms of age and gender, influence the ratings of the movies they play roles in?

3. Additional Datasets

  • IMDb Non-Commercial Datasets: The datasets include various aspects of movie and TV shows data like titles, crew, ratings, and episode details. In this project we use datasets title.basics (basic title information), title.principals (main participants), title.ratings (user ratings), and name.basics (personnel details). Source: https://datasets.imdbws.com/

  • Kaggle Awards Dataset: This dataset is a scraping of the official Academy Awards, listing winners and nominees between 1927 and 2023. A typical row indicates that a given actor was nominated in a given year for a given movie and whether an oscar was won or not. Source: https://www.kaggle.com/datasets/unanimad/the-oscar-award

4. Methods

Data Preparation

In the data preparation stage, our team meticulously analyzed and transformed various data types, converting date formats from dd/mm/yyyy to year-only format. We eliminated columns that were not needed and renamed the remaining ones for uniformity.

We enriched our dataset by performing an inner join between IMDb and CMU movies datasets, using movie titles and release dates as key identifiers. This process integrated additional details such as ratings and IMDb IDs into our data.

Furthermore, we extracted a list of lead actors from the IMDb principals dataset and merged this with our combined movies dataframe. To enhance the data further, we also incorporated information from the Kaggle Awards dataset, adding vital features like nominations and awards.

Feature Engineering and Extraction

In our datasets, we selectively retain specific features. For actors, we include an ID (from IMDb), name, age, and gender. Movie-related features include the Wikipedia ID, IMDb ID, title, release year, genres, and rating.

Additional actor features are engineered through simple statistical methods:

  • Awards Won: Count of Academy Awards the actor won before the role.
  • Nominations: Count of Academy Award nominations prior to the role.
  • Previous Roles: Total number of roles played by the actor beforehand.
  • Genre Diversity: Variety of genres the actor has appeared in previously.
  • Age at Release: Actor's age at the time of the movie's release.

Dataframe 1: Movies

Each movie is uniquely identified by its WikiID but might have overlapping names or release years. Unique combinations are formed by "name + year."

Table 1: Example of Movies Dataframe
Wiki ID Movie Name Release Year Genres Rating
W_ID1 MovieName1 Year1 Genre1 Rating1
W_ID2 MovieName1 Year2 Genre2 Rating2
W_ID3 MovieName2 Year1 Genre3 Rating3

Dataframe 2: Main Actors

Entries are uniquely identified by the combination of Wiki ID / IMDb ID and Actor ID.

Table 2: Example of Main Actors Dataframe
**Wiki ID** IMDb ID Actor ID Ordering Release Year Age Gender Roles in Movies Awards Nominations Genre Diversity Age at Release
W_ID1 I_ID1 A_ID1 1 Year1 Age1 Gender1 NumRoles1 Award1 Nomination1 4 30
W_ID1 I_ID1 A_ID2 2 Year1 Age2 Gender2 NumRoles2 Award2 Nomination2 2 25
W_ID2 I_ID2 A_ID3 3 Year2 Age3 Gender3 NumRoles3 Award3 Nomination3 1 40

Feasibility Analysis

Upon reviewing the final datasets, we confirm the feasibility of our analysis. The required features are present, and the data volume is sufficient. We've also introduced flexibility by allowing the selection of a varying number of main actors, enabling us to expand our dataset as necessary.

Exploratory Data Analysis

In the Exploratory Data Analysis section, our initial focus is on thoroughly examining essential actors and movies features for our analysis, including age, gender, awards, nominations, and genre diversity. This step is crucial for grasping the fundamental characteristics of our dataset. Following this, we engage in a time series analysis to track the evolution of these actor features over time. This approach offers valuable insights into the shifting trends within the film industry.

Diving into the Research Questions

n this section, we track actor feature trends over time and distinguish between high-rated and low-rated films to identify key attributes linked to film success. Our analysis includes correlating actor features with movie ratings, conducting a temporal examination to observe how these correlations evolve, and comparing actor data from differently rated movies to discern patterns that signal cinematic success.

6. Project Overview



7. Team Orgnization

  • Armance Novel: Feature extraction, Data Visualisation, Feasibility analysis, README

  • Emeline Debalme: Explore analysis, Cross-analysis, Machine Learning, Data-story

  • Théo Houle: Explore analysis, Cross-analysis, Data-story, README

  • Kelan Solomon: Feature extraction, Data Visualisation, Machine Learning, Feasibility analysis

  • Dimitri Jacquemont: Explore analysis, Cross-analysis, Machine Learning, Data-story

About

ada-2023-project-dromadaire created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published