Welcome to the Tokyo Olympics Data Analysis Project! This project dives deep into analyzing data from the 2021 (2020) Tokyo Olympics using SQL and Python. The dataset contains comprehensive details about athletes, coaches, teams, disciplines, and gender-based entries.
This project showcases:
- 📊 Relational database creation using SQLite.
- ⚖️ Advanced SQL queries to uncover insights.
- 🎨 Visualizations with Python's
matplotlib
andplotly
libraries.
The project is structured as follows:
Tokyo-Olympics-Data/
├── Data/ # Raw CSV files
├── Plots/ # Visualizations generated by the scripts
├── SQL/ # SQL scripts
│ ├── create_table.sql # SQL script to create database tables
├── src/ # Python scripts
│ ├── execute_sql.py # Create ddbb structure
│ ├── load_data.py # Script to load data into SQLite database
│ ├── test_Medals_table.py # Script to test Medals table or others
│ ├── analysis.py # Execute SQL queries and save results to CSV
│ ├── visualizations.py # Generate visualizations from analysis results
└── README.md # Project documentation (this file)
└── olympics.db # SQlite Database
-
Athletes
:- Columns:
PersonName
,Country
,Discipline
- Description: Contains information about the athletes and their disciplines.
- Columns:
-
Coaches
:- Columns:
Name
,Country
,Discipline
,Event
- Description: Contains information about the coaches.
- Columns:
-
EntriesGender
:- Columns:
Discipline
,Female
,Male
,Total
- Description: Records the number of entries by gender in each discipline.
- Columns:
-
Medals
:- Columns:
Rank
,TeamCountry
,Gold
,Silver
,Bronze
,Total
,Rank_by_Total
- Description: Tracks medal counts for each country.
- Columns:
-
Teams
:- Columns:
TeamName
,Discipline
,Country
,Event
- Description: Contains information about the teams and events.
- Columns:
create_table.sql
: Defines the schema for the above tables.queries.sql
: Includes advanced SQL queries for insights such as:- 🏅 Countries with the most medals.
- 🎈 Distribution of medals by type.
- 🏋️ Countries participating in the highest number of disciplines.
- 🏊 Most competitive sports (sports with the highest number of participating countries).
The load_data.py
script loads data from the raw CSV files into the SQLite database. Steps:
- Define the path to the dataset.
- Load the data into a relational SQLite database.
- Create tables as per the schema defined in
create_tables.sql
.
The analysis.py
script executes advanced SQL queries to extract insights and saves the results as CSV files in the output/
directory.
Examples of SQL queries included:
-
Top Countries by Total Medals:
SELECT TeamCountry, SUM(Gold + Silver + Bronze) AS TotalMedals FROM Medals GROUP BY TeamCountry ORDER BY TotalMedals DESC LIMIT 10;
-
Distribution of Medal Types:
SELECT SUM(Gold) AS TotalGold, SUM(Silver) AS TotalSilver, SUM(Bronze) AS TotalBronze FROM Medals;
-
Countries Participating in the Most Disciplines:
SELECT Country, COUNT(DISTINCT Discipline) AS TotalDisciplines FROM Teams GROUP BY Country ORDER BY TotalDisciplines DESC LIMIT 10;
The visualizations.py
script generates plots to visually represent the data and analysis results.
-
🏅 Top 10 Countries by Total Medals:
-
🎈 Medal Type Distribution:
-
🏋️ Most Successful Countries Rank:
-
🏊 Most Competitive Sports:
- SQLite:
- Relational database for data storage and querying.
- Python:
- Libraries:
pandas
,matplotlib
,plotly
,sqlite3
.
- Libraries:
- Matplotlib & Plotly:
- For generating visualizations.
- Clone this repository:
git clone <repository_url>
- Navigate to the project directory:
cd Tokyo-Olympics-Data
- Install dependencies:
pip install -r requirements.txt
- Run the scripts in the following order:
- Load the data:
python src/load_data.py
- Run the analysis:
python src/analysis.py
- Generate visualizations:
python src/visualizations.py
- Load the data:
- ⚙️ Incorporate machine learning models for predictive analytics (e.g., predicting medal counts).
- 🔢 Add more visualizations and interactive dashboards.
- 🔎 Expand the dataset with more detailed athlete and event data.
- Dataset Source: Kaggle.
- Author: ILyd Bautista.