This project is a Python-based tool designed to automate the aggregation and processing of multiple Excel files. It provides a user-friendly interface for selecting input and output directories, reads and concatenates data from multiple Excel sheets, standardizes column names, and removes duplicates based on specified criteria. The processed data is then saved into a new Excel file.
- Multi-File Aggregation: Reads multiple Excel files from a user-specified directory and aggregates data by sheet name.
- Column Standardization: Standardizes column names for consistency across different data sets.
- Duplicate Removal: Removes duplicate entries based on a unique identifier while retaining the entry with the highest specified value.
- User-Friendly Interface: Utilizes a graphical user interface (GUI) for easy selection of input and output directories.
- Output Generation: Saves the processed and cleaned data into a new Excel file in the user-specified output directory.
- Python 3.x
- pandas
- openpyxl
- tkinter
-
File Reading and Aggregation:
- Uses the
glob
module to identify and read all Excel files within the selected directory. - Utilizes
pandas
to read and parse each sheet within the Excel files. - Aggregates data from sheets with the same name across different files into a single DataFrame per sheet name.
- Uses the
-
Column Standardization:
- Standardizes variations of column names (e.g.,
SR
,sr
,SRID
toSR ID
, andAGECLASS
toAGE
) to ensure uniformity across all data sets.
- Standardizes variations of column names (e.g.,
-
Data Processing:
- Sorts the data by the unique identifier (
SR ID
) and a specified value (AGE
) to prioritize entries with the highest value. - Removes duplicate entries based on the unique identifier, retaining the entry with the highest specified value.
- Sorts the data by the unique identifier (
-
Output Generation:
- Saves the cleaned and processed data into a new Excel file using
pandas
'ExcelWriter
. - Each sheet name in the original data is preserved in the output file.
- Saves the cleaned and processed data into a new Excel file using
-
User Interaction:
- Utilizes
tkinter
to provide a GUI for folder selection, making it easy for users to specify input and output directories without manually entering paths.
- Utilizes
- Clone the repository:
git clone https://github.com/pbonotis15/aggregate-and-process-multiple-excel-files.git
- Navigate to the project directory:
cd to_your_folder
- Install the required packages:
pip install pandas openpyxl
- (OPTIONAL) Install
pyinstaller
for creating an executable.
- Run the script directly using Python to test and verify functionality.
python aggregate-tpbe.py
- (OPTIONAL) Use
pyinstaller
to create an executable file:pyinstaller --onefile aggregate-tpbe.py
- Launch the executable or run the script.
- Use the GUI to select the folder containing the Excel files for input.
- Select the output folder where the processed Excel file will be saved.
- The script will read, process, and save the data, displaying a confirmation message upon completion.
A company regularly receives Excel reports from multiple departments. Each report contains various sheets with task data, but column names may vary and some tasks are duplicated. This tool can automate the aggregation of these reports, standardize the column names, remove duplicates, and produce a clean, consolidated Excel file for further analysis.
This project simplifies data consolidation, enhances data consistency, and saves time, making it ideal for organizations dealing with large volumes of Excel data from multiple sources.