Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data_Wrangling_Assignment #411

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions Data Wrangling Assignment/.vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"sqltools.connections": [
{
"mysqlOptions": {
"authProtocol": "xprotocol",
"enableSsl": "Disabled"
},
"previewLimit": 50,
"server": "localhost",
"port": 33060,
"driver": "MySQL",
"name": "TMA_data",
"database": "TMA_data",
"username": "root",
"password": "Toor",
"connectionTimeout": 50
}
]
}
5 changes: 5 additions & 0 deletions Data Wrangling Assignment/250justify.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Step 3: Justification for Structure (in 250 words)

The easy_data table is structured to facilitate further analysis and visualization by breaking down composite fields into distinct, interpretable columns. This transformation makes the data more manageable for queries, as each offer and acceptance metric (total, senior, and inclusive) is stored in separate columns. The structure improves clarity and reduces redundancy since each column now stores a single value, making it easier to apply statistical methods or visualizations like bar charts, pie charts, or trend lines based on offer and acceptance rates.

Additionally, the decision to store Office, Department, and headcount as individual columns allows for grouping and filtering the data by office location or department, which is essential for comparative analysis. By converting textual numbers into numeric data types, we enable efficient calculations, aggregations, and other numerical operations, thereby reducing the risk of errors in later stages of analysis. This structure is optimized for performance and ease of use in tools like Pandas, Matplotlib, or other BI platforms for further insights.
Binary file not shown.
374 changes: 374 additions & 0 deletions Data Wrangling Assignment/Data_Wrangling.ipynb

Large diffs are not rendered by default.

146 changes: 146 additions & 0 deletions Data Wrangling Assignment/Data_Wrangling.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Import necessary libraries
import mysql.connector
import pandas as pd
import matplotlib.pyplot as plt

# Function to establish connection to MySQL database
def connect_to_db():
try:
connection = mysql.connector.connect(
host='localhost', # Replace with your host
user='root', # Replace with your MySQL username
password='Toor', # Replace with your MySQL password
database='TMA_data' # Replace with your database name
)
if connection.is_connected():
print("Connected to MySQL Server")
return connection
except mysql.connector.Error as e:
print(f"Error: {e}")
return None

# Function to create the easy_data table based on TMA_data table
def create_easy_data_table():
connection = connect_to_db()
if connection:
cursor = connection.cursor()
# Drop the easy_data table if it already exists, then recreate it
cursor.execute("DROP TABLE IF EXISTS easy_data;")
# Creating easy_data table with necessary columns and derived data
cursor.execute(
'''
CREATE TABLE easy_data AS
SELECT
Location AS Location,
Department,
headcount AS Total_Headcount,
Offers_Recruitment_Firm1 AS Offers_Made_Company1,
Offers_Recruitment_Firm2 AS Offers_Made_Company2,
Offers_Recruitment_Firm3 AS Offers_Made_Company3,
Offers_Total AS Total_Offers,
Acceptance_Recruitment_Firm1 AS Offers_Accepted_Company1,
Acceptance_Recruitment_Firm2 AS Offers_Accepted_Company2,
Acceptance_Recruitment_Firm3 AS Offers_Accepted_Company3,
Acceptance_Total AS Total_Accepted_Offers
FROM TMA_data;
'''
)
connection.commit()
print("Table easy_data created successfully.")
connection.close()

# Create the easy_data table
create_easy_data_table()

# Function to create the fig1 table based on TMA_data table
def create_fig1_table():
connection = connect_to_db()
if connection:
cursor = connection.cursor()
# Drop the fig1 table if it exists and recreate it
cursor.execute("DROP TABLE IF EXISTS fig1;")
cursor.execute(
'''
CREATE TABLE fig1 AS
SELECT
Location AS Location,
Department,
SUM(headcount) AS Total_Headcount,
SUM(CAST(SUBSTRING_INDEX(Offers_Recruitment_Firm1, '|', 1) AS UNSIGNED)) AS Offers_Made_Firm1,
SUM(CAST(SUBSTRING_INDEX(Offers_Recruitment_Firm2, '|', 1) AS UNSIGNED)) AS Offers_Made_Firm2,
SUM(CAST(SUBSTRING_INDEX(Offers_Recruitment_Firm3, '|', 1) AS UNSIGNED)) AS Offers_Made_Firm3,
SUM(CAST(SUBSTRING_INDEX(Offers_Total, '|', 1) AS UNSIGNED)) AS Total_Offers_Made,
SUM(CAST(SUBSTRING_INDEX(Acceptance_Recruitment_Firm1, '|', 1) AS UNSIGNED)) AS Offers_Accepted_Firm1,
SUM(CAST(SUBSTRING_INDEX(Acceptance_Recruitment_Firm2, '|', 1) AS UNSIGNED)) AS Offers_Accepted_Firm2,
SUM(CAST(SUBSTRING_INDEX(Acceptance_Recruitment_Firm3, '|', 1) AS UNSIGNED)) AS Offers_Accepted_Firm3,
SUM(CAST(SUBSTRING_INDEX(Acceptance_Total, '|', 1) AS UNSIGNED)) AS Total_Offers_Accepted
FROM TMA_data
GROUP BY Location, Department;
'''
)
connection.commit()
print("Table fig1 created successfully.")
connection.close()

# Create the fig1 table
create_fig1_table()

# Function to fetch data from easy_data table and return as a DataFrame
def fetch_data():
connection = connect_to_db()
if connection:
try:
query = "SELECT * FROM easy_data;"
df = pd.read_sql(query, connection) # Fetch data into a Pandas DataFrame
return df
except mysql.connector.Error as e:
print(f"Error fetching data: {e}")
finally:
if connection.is_connected():
connection.close()
print("MySQL connection is closed")

# Fetch data from the easy_data table
df = fetch_data()
df.head() # Display the first few rows of the DataFrame

# Function to display the table similar to your uploaded image using Matplotlib
def display_custom_table(df):
fig, ax = plt.subplots(figsize=(16, 8)) # Adjust the size according to the table

# Multi-level headers
headers = [
['Location', 'Department', 'Headcount Available', 'Number of Offers Made', '', '', '', 'Number of Offers Accepted', '', '', ''],
['', '', '', 'Recruitment Firm 1', 'Recruitment Firm 2', 'Recruitment Firm 3', 'Total', 'Recruitment Firm 1', 'Recruitment Firm 2', 'Recruitment Firm 3', 'Total']
]

df.columns = pd.MultiIndex.from_tuples(zip(*headers))

# Hide axes
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ax.set_frame_on(False)

# Create the table
table = ax.table(cellText=df.values, colLabels=df.columns, cellLoc='center', loc='center')

# Adjust font size and scale to fit content
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1.2, 1.2)

for (i, j), cell in table.get_celld().items():
if i == 0 or i == 1: # Header rows
cell.set_text_props(weight='bold', color='white')
cell.set_facecolor('#4B8BBE')
elif i % 2 == 0: # Alternating row colors
cell.set_facecolor('#E8EAF6')
else:
cell.set_facecolor('white')

plt.title('Recruitment Data Table', fontsize=16, pad=20)
plt.show()

# Display the fetched data in a graphical table format
if df is not None:
display_custom_table(df)
Binary file added Data Wrangling Assignment/Figure_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 50 additions & 0 deletions Data Wrangling Assignment/TMA_data.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
-- Create the database
CREATE DATABASE IF NOT EXISTS TMA_data;

-- Use the newly created database
USE TMA_data;

-- Create the table
CREATE TABLE IF NOT EXISTS recruitment_data (
Location VARCHAR(50),
Department VARCHAR(50),
Headcount_Available INT,
Offers_Made_Firm1 INT,
Offers_Made_Firm2 INT,
Offers_Made_Firm3 INT,
Total_Offers_Made INT,
Offers_Accepted_Firm1 INT,
Offers_Accepted_Firm2 INT,
Offers_Accepted_Firm3 INT,
Total_Offers_Accepted INT
);

-- Insert data into the table
INSERT INTO recruitment_data
(Location, Department, Headcount_Available, Offers_Made_Firm1, Offers_Made_Firm2, Offers_Made_Firm3, Total_Offers_Made,
Offers_Accepted_Firm1, Offers_Accepted_Firm2, Offers_Accepted_Firm3, Total_Offers_Accepted)
VALUES
('Singapore', 'IT Systems', 335, 183, 92, 30, 254, 67, 42, 20, 109),
('Singapore', 'Corporate Services', 130, 206, 41, 41, 288, 24, 10, 32, 66),
('Singapore', 'Customer Service', 118, 295, 57, 29, 381, 53, 12, 21, 86),
('Singapore', 'Operations', 290, 187, 22, 14, 223, 55, 4, 10, 69),
('Singapore', 'Customer Support', 150, 86, 21, 19, 126, 14, 1, 10, 25),
('Singapore', 'Total', 1023, 957, 233, 133, 1323, 213, 69, 93, 375),

('Hong Kong', 'IT Systems', 125, 123, 58, 12, 193, 43, 5, 1, 49),
('Hong Kong', 'Corporate Services', 125, 151, 21, 10, 182, 24, 1, 2, 27),
('Hong Kong', 'Customer Service', 170, 197, 41, 21, 259, 74, 12, 4, 90),
('Hong Kong', 'Operations', 160, 57, 43, 24, 124, 33, 9, 10, 52),
('Hong Kong', 'Customer Support', 90, 48, 12, 12, 72, 11, 2, 6, 19),
('Hong Kong', 'Total', 670, 576, 175, 79, 830, 185, 29, 23, 237),

('Tokyo', 'Customer Service', 110, 148, 39, 30, 217, 27, 2, 10, 39),
('Tokyo', 'Customer Support', 90, 43, 15, 12, 70, 21, 2, 9, 32),
('Tokyo', 'Total', 200, 191, 54, 42, 287, 48, 4, 19, 71),

('Overall', 'Total', 1562, 1824, 462, 254, 2538, 446, 102, 135, 683);





120 changes: 120 additions & 0 deletions Data Wrangling Assignment/data2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@

--Step 1: Import the Dataset

CREATE DATABASE IF NOT EXISTS TMA_data;
USE TMA_data;

--Step 2: Check the Data Types

DESCRIBE TMA_data;

--Step 2: Optimize Data Types

-- Modify data types for optimization
ALTER TABLE TMA_data MODIFY COLUMN Office VARCHAR(255);
ALTER TABLE TMA_data MODIFY COLUMN Department VARCHAR(255);
ALTER TABLE TMA_data MODIFY COLUMN headcount INT;

ALTER TABLE TMA_data MODIFY COLUMN offers_recruitment_firm1 VARCHAR(50);
ALTER TABLE TMA_data MODIFY COLUMN offers_recruitment_firm2 VARCHAR(50);
ALTER TABLE TMA_data MODIFY COLUMN offers_recruitment_firm3 VARCHAR(50);
ALTER TABLE TMA_data MODIFY COLUMN offers_total VARCHAR(50);

ALTER TABLE TMA_data MODIFY COLUMN acceptance_recruitment_firm1 VARCHAR(50);
ALTER TABLE TMA_data MODIFY COLUMN acceptance_recruitment_firm2 VARCHAR(50);
ALTER TABLE TMA_data MODIFY COLUMN acceptance_recruitment_firm3 VARCHAR(50);
ALTER TABLE TMA_data MODIFY COLUMN acceptance_total VARCHAR(50);


--Step 3: Recompute the Totals

-- Recompute offers_total
UPDATE TMA_data
SET offers_total = CAST(SUBSTRING_INDEX(offers_recruitment_firm1, '|', 1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(offers_recruitment_firm2, '|', 1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(offers_recruitment_firm3, '|', 1) AS UNSIGNED);

-- Recompute acceptance_total
UPDATE TMA_data
SET acceptance_total = CAST(SUBSTRING_INDEX(acceptance_recruitment_firm1, '|', 1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(acceptance_recruitment_firm2, '|', 1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(acceptance_recruitment_firm3, '|', 1) AS UNSIGNED);


--Step 4: Create fig1 Table

-- Create fig1 table
CREATE TABLE fig1 (
Office VARCHAR(255),
Department VARCHAR(255),
headcount INT,
offers_total INT,
offers_senior INT,
offers_inclusive INT,
acceptance_total INT,
acceptance_senior INT,
acceptance_inclusive INT
);

-- Insert data into fig1
INSERT INTO fig1 (Office, Department, headcount, offers_total, offers_senior, offers_inclusive, acceptance_total, acceptance_senior, acceptance_inclusive)
SELECT
Office,
Department,
CAST(headcount AS UNSIGNED) AS headcount,
CAST(SUBSTRING_INDEX(offers_total, '|', 1) AS UNSIGNED) AS offers_total,
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm1, '|', 2), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm2, '|', 2), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm3, '|', 2), '|', -1) AS UNSIGNED) AS offers_senior,
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm1, '|', -1), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm2, '|', -1), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm3, '|', -1), '|', -1) AS UNSIGNED) AS offers_inclusive,
CAST(SUBSTRING_INDEX(acceptance_total, '|', 1) AS UNSIGNED) AS acceptance_total,
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm1, '|', 2), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm2, '|', 2), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm3, '|', 2), '|', -1) AS UNSIGNED) AS acceptance_senior,
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm1, '|', -1), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm2, '|', -1) , '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm3, '|', -1), '|', -1) AS UNSIGNED) AS acceptance_inclusive
FROM TMA_data;


--Verify the Contents of fig1
SELECT * FROM fig1;

--Step 4: Create the easy_data Table

CREATE TABLE easy_data (
Office VARCHAR(255),
Department VARCHAR(255),
headcount INT,
offers_total INT,
offers_senior INT,
offers_inclusive INT,
acceptance_total INT,
acceptance_senior INT,
acceptance_inclusive INT
);


INSERT INTO easy_data (Office, Department, headcount, offers_total, offers_senior, offers_inclusive, acceptance_total, acceptance_senior, acceptance_inclusive)
SELECT
Office,
Department,
CAST(headcount AS UNSIGNED) AS headcount,
CAST(SUBSTRING_INDEX(offers_total, '|', 1) AS UNSIGNED) AS offers_total,
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm1, '|', 2), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm2, '|', 2), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm3, '|', 2), '|', -1) AS UNSIGNED) AS offers_senior,
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm1, '|', -1), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm2, '|', -1), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(offers_recruitment_firm3, '|', -1), '|', -1) AS UNSIGNED) AS offers_inclusive,
CAST(SUBSTRING_INDEX(acceptance_total, '|', 1) AS UNSIGNED) AS acceptance_total,
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm1, '|', 2), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm2, '|', 2), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm3, '|', 2), '|', -1) AS UNSIGNED) AS acceptance_senior,
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm1, '|', -1), '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm2, '|', -1) , '|', -1) AS UNSIGNED) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(acceptance_recruitment_firm3, '|', -1), '|', -1) AS UNSIGNED) AS acceptance_inclusive
FROM TMA_data;

Loading