Data Pipeline

This repository contains a data pipeline for processing medical imaging data. It includes modules for anonymizing DICOM files, encrypting patient IDs, extracting metadata, and processing the data. Additionally, the data pipeline offers flexibility and extensibility, allowing users to customize and expand its functionality according to specific project requirements. With a focus on scalability and performance optimization, the pipeline is capable of handling large volumes of medical imaging data efficiently. Its modular design fosters modularity and code reusability, promoting ease of maintenance and future enhancements.

Below are the key functionalities encapsulated within the pipeline:

Anonymization Module: This module is responsible for anonymizing DICOM files, ensuring the removal of sensitive patient-related information while adhering to regulatory compliance standards. It sanitizes the data by eliminating identifiable attributes, thereby safeguarding patient privacy.
Encryption Module: The encryption module adds an extra layer of security by encrypting patient IDs, thus enhancing data protection measures. By encrypting sensitive identifiers, the module ensures that patient information remains confidential and inaccessible to unauthorized parties.
Metadata Extraction: This module facilitates the extraction of metadata from DICOM files, enabling users to access valuable information embedded within the imaging data. It parses the DICOM headers to retrieve essential metadata attributes, providing insights into the imaging parameters and acquisition details.
Data Processing: The data processing module orchestrates the sequential execution of various operations, including preprocessing, analysis, and transformation of medical imaging data. It streamlines the processing pipeline, enabling seamless integration of diverse data processing tasks.

Encompassing these modules, the data pipeline provides a robust framework for effectively managing medical imaging data. Whether it involves anonymizing patient information, encrypting identifiers, extracting metadata, or processing imaging data, the pipeline offers a versatile solution tailored to meet the intricate demands of medical and biomedical imaging workflows (10.1007/s10278-021-00522-6). With its modular architecture, the pipeline facilitates seamless integration into existing healthcare systems and can be customized to accommodate specific use cases and requirements.

Modules

identifier.py: This script processes DICOM files in the "checking" folder by extracting the SOP Instance UID and comparing it to files in the "raw" folder. If a match is found, it renames the file with the corresponding Anonymized Patient ID and moves it to the "identified" folder.
anonymizer.py: Module for anonymizing DICOM files by removing patient-related information and renaming them according to a specified format.
encryption.py: Module for encrypting patient IDs.
extractor.py: Module for extracting metadata from DICOM files.
main.py: Main script for executing the data processing pipeline.
processor.py: Module for processing medical imaging data.

Usage

To use the data pipeline, follow these steps:

Clone the repository:

git clone https://github.com/MIMBCD-UI/data-pipeline.git

Install the required dependencies by creating a virtual environment and installing the packages listed in requirements.txt:

cd data-pipeline
pip3 install -r requirements.txt

Run the main script to execute the data processing pipeline:

python3 src/main.py

DICOM Post-Processing Verification Pipeline

This section details the scripts involved in processing DICOM files within the MIMBCD-UI data pipeline. These scripts are responsible for handling various aspects of anonymization, metadata extraction, and file validation, ensuring the integrity and consistency of medical imaging data.

Data Post-Processing Curation Order

The data post-processing curation involves a series of steps to verify, anonymize, and validate DICOM files. Inside the curation/ folder of the dataset-multimodal-breast repository, containing DICOM files at different stages of processing.

The following sequence outlines the steps involved in the post-processing pipeline inside the curation/ folder:

If at any stage we find the file to be incorrect, we move it to the curation/unsolvable/ folder. If the file is correct, we move it to the dicom/ folder.

Post-Processing Verification Workflow Overview

The following scripts should be executed in sequence as part of the data processing pipeline. Each script serves a specific purpose and contributes to the overall goal of maintaining high-quality, anonymized medical imaging data.

identifier.py - Initial DICOM File Identification
- Purpose: This script processes DICOM files in the "checking" folder by extracting the SOP Instance UID and comparing it to files in the "raw" folder. If a match is found, it renames the file with the corresponding Anonymized Patient ID and moves it to the "identified" folder.
- When to Run: Run this script first to identify and organize the DICOM files before any further processing.
- Outcome: The files are identified and renamed based on the SOP Instance UID, making them ready for further processing.
laterality.py - Initial Metadata Extraction and File Preparation
- Purpose: This script processes DICOM files by converting anonymized patient IDs to their corresponding real patient IDs. It extracts critical metadata such as laterality (which side of the body the image represents) and renames/moves the files accordingly.
- When to Run: Run this script after identifier.py to further organize and prepare the DICOM files.
- Outcome: The files are organized with accurate metadata, making them ready for comparison and validation.
compare.py - Verification of Anonymized and Non-Anonymized File Correspondence
- Purpose: This script compares anonymized and non-anonymized DICOM files to ensure they match based on metadata like InstanceNumber, ViewPosition, and ImageLaterality. It also renames the files and moves them to a "checked" directory for further processing.
- When to Run: Run this script after laterality.py to verify the correspondence between anonymized and non-anonymized files.
- Outcome: Matched files are confirmed and organized in the "checked" directory.
checker.py - File Comparison and Logging
- Purpose: This script provides an additional verification step by comparing anonymized and non-anonymized DICOM files based on InstanceNumber. It logs the paths of matching files to a CSV file for auditing and further analysis.
- When to Run: Execute this script after compare.py to ensure a documented trail of matched files.
- Outcome: A CSV file is generated, listing the paths of successfully matched files, ensuring traceability in the pipeline.
reanonimyzer.py - Final Correction and Re-Anonymization
- Purpose: The final script in the sequence, reanonimyzer.py, corrects any discrepancies in the anonymized patient IDs and metadata based on predefined mappings. It updates the filenames and DICOM metadata as necessary and moves the corrected files to the final "checked" directory.
- When to Run: This script should be run last, after checker.py, to finalize the anonymization and ensure data consistency.
- Outcome: The DICOM files are fully re-anonymized, with all metadata and filenames accurately reflecting the correct anonymized patient IDs, ensuring they are ready for secure storage or further analysis.

How to Run the Scripts

To execute the pipeline, follow the order outlined above:

# Step 1: Run main.py
python3 src/main.py

After that, open the curation/verifynig/ folder and move the files to the curation/checking/ folder.

# Step 3: Run identifier.py
python3 src/identifier.py

Step 4: Run laterality.py

python3 src/laterality.py

Step 5: Run compare.py

python3 src/compare.py

Step 6: Run checker.py

python3 src/checker.py

# Step 7: Run reanonimyzer.py
python3 src/reanonimyzer.py

Contributing

Contributions are welcome! If you'd like to contribute to this project, please fork the repository and submit a pull request with your proposed changes.

License

This project is licensed under the MIT License.

Team

Our team brings everything together sharing ideas and the same purpose, developing even better work. In this section, we will nominate the full list of important people for this repository, as well as respective links.

Authors

Francisco Maria Calisto [ Academic Website | ResearchGate | GitHub | Twitter | LinkedIn ]
Diogo Araújo
Carlos Santiago [ ResearchGate ]
Catarina Barata
Jacinto C. Nascimento [ ResearchGate ]

Promoters

João Fernandes [ ResearchGate ]
Margarida Morais [ ResearchGate ]
João Maria Abrantes [ ResearchGate ]
Nuno Nunes [ ResearchGate ]

Companions

Hugo Lencastre
Nádia Mourão
Miguel Bastos
Pedro Diogo
João Bernardo
Madalena Pedreira
Mauro Machado
Bruno Dias
Bruno Oliveira
Luís Ribeiro Gomes

Acknowledgements

This work was partially supported by national funds by FCT through both UID/EEA/50009/2013 and LARSyS - FCT Project 2022.04485.PTDC (MIA-BREAST) projects hosted by IST, as well as both BL89/2017-IST-ID and PD/BD/150629/2020 grants. We are indebted to those who gave their time and expertise to evaluate our work, who among others are giving us crucial information for the BreastScreening project.

Supporting

Our organization is a non-profit organization. However, we have many needs across our activity. From infrastructure to service needs, we need some time and contribution, as well as help, to support our team and projects.

Contributors

This project exists thanks to all the people who contribute. [Contribute].

Backers

Thank you to all our backers! 🙏 [Become a backer]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Pipeline

Modules

Usage

DICOM Post-Processing Verification Pipeline

Data Post-Processing Curation Order

Post-Processing Verification Workflow Overview

How to Run the Scripts

Step 4: Run laterality.py

Step 5: Run compare.py

Step 6: Run checker.py

Contributing

License

Team

Authors

Promoters

Companions

Acknowledgements

Supporting

Contributors

Backers

Sponsors

Departments

Laboratories

Domain

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Pipeline

Modules

Usage

DICOM Post-Processing Verification Pipeline

Data Post-Processing Curation Order

Post-Processing Verification Workflow Overview

How to Run the Scripts

Step 4: Run laterality.py

Step 5: Run compare.py

Step 6: Run checker.py

Contributing

License

Team

Authors

Promoters

Companions

Acknowledgements

Supporting

Contributors

Backers

Sponsors

Departments

Laboratories

Domain