In the competitive business environment, effective employee performance management is crucial for maintaining productivity and achieving organizational goals. However, HR departments often struggle with gathering, managing, and analyzing performance data from various sources, which can lead to delayed or inaccurate insights. The "People Performance Data Pipeline" project addresses this business problem by providing a streamlined process for collecting and managing performance indicators, such as daily tasks, from multiple data sources (APIs, databases, and Google Sheets). This project aims to enhance the HR department's ability to report on, visualize, and predict employee performance, ultimately leading to more informed decision-making and improved productivity across the company.
The primary goals of the People Performance Data Pipeline project include:
-
Centralized Data Management: Collect and consolidate employee performance data from diverse sources into a single Delta Lake storage solution. This centralized approach ensures that all relevant data is readily accessible for analysis.
-
Data Transformation and Standardization: Convert raw performance data into meaningful metrics by assigning points to tasks and standardizing the data format. This transformation allows for easy calculation and comparison of employee performance across different departments.
-
Data Quality Assurance: Implement rigorous data validation processes to ensure the accuracy and reliability of the data stored in the Delta Lake. High-quality data is essential for generating trustworthy insights and making sound business decisions.
-
Enhanced Reporting and Analytics: Provide HR with the tools to generate detailed performance reports, create data visualizations, and leverage machine learning models to predict and boost departmental performance. This goal directly ties to improving key business metrics such as employee productivity, retention rates, and overall company efficiency.
- Create new resource group in microsoft azure and add azure key vault and databricks into the resource group
- Store the secret key, such as API key, Database link, Google Sheet URL, etc.
- Clone this repo
git clone https://github.com/ArkanNibrastama/people_performance_data_pipeline.git
- Make points delta_lake and store .csv file in points folder. Also make delta lake for bronze, silver and gold stage
- Copy data_ingestion, transformation, and data_validation folder into databricks
- After that make a job from the notebook like this
The implementation of the People Performance Data Pipeline has had a significant impact on the HR department's ability to manage and analyze employee performance data. By centralizing and standardizing data from multiple sources, and processing over 10,000+ data using PySpark, the project has reduced the process in HR department by approximately 75%, enabling faster and more accurate reporting. The data validation processes have ensured high data quality, leading to more reliable insights and predictions. As a result, the HR department has been able to identify underperforming areas and take proactive measures to boost productivity, contributing to increase in overall company performance. This project demonstrates the value of leveraging data engineering and machine learning to drive business success.