This project aims to analyze industry layoffs using a distributed analytical technique with Hadoop MapReduce and HDFS. The analysis focuses on identifying layoff trends across different industries in 2023. Python programming language is utilized for the implementation of the map and reduce programs.
- Introduction
- Problem Statement
- Dataset
- Project Structure
- Installation and Setup
- Implementation
- Results
- Usage
- Critical Reflection
- Contributing
In today's data-driven world, large-scale data processing is crucial for extracting meaningful insights. This project utilizes Hadoop MapReduce to analyze layoff data, providing insights into industry-specific layoff trends in 2023.
The project aims to analyze layoffs across different industries in 2023. Understanding the layoff patterns helps stakeholders identify economic trends, market demands, and workforce dynamics.
The dataset is sourced from Layoff.csv, which contains layoff information for major companies from 2020 to 2024. The dataset includes variables such as company name, location, industry, percentage of layoffs, date, funding raised, and others.
LayoffDataAnalysis
- layoff-reducer.py # Reducer code for the MapReduce job
- layoff-mapper.py # Mapper code for the MapReduce job
- layoffs_data.csv # Input dataset
- README.md # Project description and usage guide
- HPCI_ReportFinal.pdf # Project report explaining the methodology and results
- Python 3.x
- Hadoop installed and configured (or use Google Colab)
- Brunel Remote Access Services (VPN) for implementation (if using Brunel server)
- Clone the Repository:
- Upload the Files to the Hadoop Distributed File System (HDFS):
- Use the hadoop fs -copyFromLocal command to upload layoffs_data.csv, layoff-mapper.py, and layoff-reducer.py to HDFS.
- Running on Google Colab (Optional):
- Open Google Colab and set up the Hadoop environment.
- Upload the necessary files (layoffs_data.csv, layoff-mapper.py, layoff-reducer.py) to the Colab environment.
- The mapper script (layoff-mapper.py) processes the dataset, extracting relevant fields like industry and date to shuffle and sort records for the year 2023.
- The reducer script (layoff-reducer.py) receives the key-value pairs from the mapper and calculates the total layoffs for each industry.
- The analysis identifies key layoff trends in consumer sectors, finance, healthcare, IT, and others for 2023.
- The results provide insights into economic conditions, workforce dynamics, and industry health.
hadoop jar /path/to/hadoop-streaming.jar
-input /path/to/input/layoffs_data.csv
-mapper layoff-mapper.py
-reducer layoff-reducer.py
-output /path/to/output/directory
Copy the output from HDFS to your local system using: hadoop fs -copyToLocal /path/to/output /local/destination
The project demonstrates the power of Hadoop MapReduce for large-scale data analysis. Although the setup and implementation can be complex, especially for beginners, the framework's ability to process big data makes it a valuable tool for analyzing workforce trends.
If you want to contribute to this project, please fork the repository and submit a pull request with your changes.