Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adds AUMCdb example #203

Draft
wants to merge 17 commits into
base: dev
Choose a base branch
from
Draft

Conversation

rvandewater
Copy link

@rvandewater rvandewater commented Sep 18, 2024

Additionally, some fixes for the MIMIC-IV example

Summary by CodeRabbit

  • New Features

    • Introduced comprehensive README files for AUMC and MIMIC-IV examples, detailing dataset extraction and processing instructions.
    • Added configuration files to manage patient data and streamline data processing workflows.
    • Implemented scripts for efficient data handling and parallel processing capabilities.
  • Bug Fixes

    • Clarified directory paths and variable usage in MIMIC-IV README to prevent confusion.
  • Documentation

    • Enhanced README files with installation instructions, usage guidelines, and notes on potential memory issues.
  • Chores

    • Added validation checks in scripts to improve robustness and user guidance.

Copy link
Contributor

coderabbitai bot commented Sep 18, 2024

Walkthrough

The changes introduce several new configuration files, scripts, and a README for the AUMC Example and MIMIC-IV Example projects. These updates include detailed instructions for dataset extraction, structured schemas for patient data management, and enhancements for data processing workflows. The scripts facilitate the conversion of raw data into structured formats, incorporating error handling and parallel processing capabilities. Additionally, modifications to existing files clarify directory paths and improve user guidance on handling data.

Changes

Files Change Summary
AUMC_Example/README.md Added comprehensive instructions for extracting MEDS datasets from AUMCdb, including installation, dataset downloading, and ETL process execution.
AUMC_Example/configs/*.yaml Introduced multiple configuration files defining schemas for patient data, managing input/output directories, and preprocessing steps for various medical data items.
AUMC_Example/joint_script.sh Added a script for processing AUMCdb data with error handling, help messages, and a defined workflow for converting raw data into a MEDS cohort.
AUMC_Example/local_parallelism_runner.yaml Added configuration options for parallel processing, specifying worker parameters and the use of the Joblib library.
AUMC_Example/pre_MEDS.py Introduced a Python script for data wrangling, utilizing Polars for data manipulation and Hydra for configuration management.
AUMC_Example/run.sh Added a script for processing MIMIC-IV data with error handling and unzipping options, ensuring proper execution of the data processing pipeline.
MIMIC-IV_Example/README.md Clarified directory paths for storing MIMIC-IV data and updated commands for running the MEDS ETL process.
MIMIC-IV_Example/run.sh Enhanced validation checks for positional arguments to prevent incorrect usage of the script.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script
    participant DataProcessor
    participant Config

    User->>Script: Run data processing command
    Script->>Config: Load configurations
    Script->>DataProcessor: Process raw data
    DataProcessor->>DataProcessor: Convert to pre-MEDS format
    DataProcessor->>DataProcessor: Shard and merge data
    DataProcessor->>Script: Return processed data
    Script->>User: Output results
Loading

🐇 In the meadow, where data flows,
AUMC and MIMIC, where knowledge grows.
With scripts and configs, we dance and play,
Transforming raw numbers, come what may!
Hopping through datasets, so bright and new,
A rabbit's delight in the work we do! 🐇


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    -- I pushed a fix in commit <commit_id>, please review it.
    -- Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    -- @coderabbitai generate unit testing code for this file.
    -- @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    -- @coderabbitai generate interesting stats about this repository and render them as a table.
    -- @coderabbitai read src/utils.ts and generate unit testing code.
    -- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    -- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@rvandewater rvandewater marked this pull request as draft September 18, 2024 11:11
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Outside diff range and nitpick comments (11)
AUMC_Example/configs/event_configs.yaml (6)

3-13: LGTM, but please fix the trailing spaces.

The patient structure is correctly defined. However, please remove the trailing spaces at lines 4 and 10 to adhere to the formatting conventions.

Apply this diff to fix the trailing spaces:

-  dob: 
+  dob:
   code: "DOB"
   time: col(dateofbirth)
 gender:
   code: ["GENDER", "col(gender)"]
   time: null
-  dod: 
+  dod:
   code: "DEATH"
   time: col(dateofdeath)
Tools
yamllint

[error] 4-4: trailing spaces

(trailing-spaces)


[error] 10-10: trailing spaces

(trailing-spaces)


14-39: LGTM, but please fix the trailing spaces.

The admissions structure is correctly defined. However, please remove the trailing spaces at lines 24 and 34 to adhere to the formatting conventions.

Apply this diff to fix the trailing spaces:

   code: 
     - "ICU_DISCHARGE"
     - col(destination)
   time: col(dischargedattime)
 weight:
   code:
     - "WEIGHT_AT_ADMISSION"
     - col(weightsource)
     - col(weightgroup)
   time: col(admittedattime)
-  height: 
+  height:
   code:
     - "HEIGHT_AT_ADMISSION"
     - col(heightsource)
     - col(heightgroup)
   time: col(admittedattime)
Tools
yamllint

[error] 24-24: trailing spaces

(trailing-spaces)


[error] 34-34: trailing spaces

(trailing-spaces)


50-58: LGTM, but please fix the trailing space.

The listitems structure is correctly defined. However, please remove the trailing space at line 58 to adhere to the formatting conventions.

Apply this diff to fix the trailing space:

   time: col(measuredattime)
-  
+
Tools
yamllint

[error] 58-58: trailing spaces

(trailing-spaces)


68-75: LGTM, but please fix the trailing space.

The procedureorderitems structure is correctly defined. However, please remove the trailing space at line 69 to adhere to the formatting conventions.

Apply this diff to fix the trailing space:

-  event: 
+  event:
   code:
     - PROCEDURE
     - col(ordercategoryname)
     - col(item)
   time: col(registeredattime)
Tools
yamllint

[error] 69-69: trailing spaces

(trailing-spaces)


76-88: LGTM, but please fix the trailing spaces.

The processitems structure is correctly defined. However, please remove the trailing spaces at lines 77 and 84 to adhere to the formatting conventions.

Apply this diff to fix the trailing spaces:

-  start: 
+  start:
   code:
     - PROCESS
     - START
     - col(item)
   time: col(starttime)
-  end: 
+  end:
   code: 
     - PROCESS
     - END
     - col(item)
   time: col(stoptime)
Tools
yamllint

[error] 77-77: trailing spaces

(trailing-spaces)


[error] 84-84: trailing spaces

(trailing-spaces)


90-123: LGTM, but please fix the trailing spaces.

The drugitems structure is correctly defined. However, please remove the trailing spaces at lines 90, 91, 99, and 108 to adhere to the formatting conventions.

Apply this diff to fix the trailing spaces:

-drugitems: 
-  start: 
+drugitems:
+  start:
   code:
     - DRUG
     - START
     - col(ordercategory)
     - col(item)
     - col(action)
   time: col(starttime)
-  rate: 
+  rate:
   code:
     - DRUG
     - RATE
     - col(ordercategory)
     - col(item)
     - col(rateunit)
   time: col(starttime)
   numeric_value: col(rate)
-  dose: 
+  dose:
   code:
     - DRUG
     - DOSE
     - col(ordercategory)
     - col(item)
     - col(doseunit)
   time: col(starttime)
   numeric_value: col(dose)
 end:
   code:
     - DRUG
     - END
     - col(ordercategory)
     - col(item)
   time: col(stoptime)
Tools
yamllint

[error] 90-90: trailing spaces

(trailing-spaces)


[error] 91-91: trailing spaces

(trailing-spaces)


[error] 99-99: trailing spaces

(trailing-spaces)


[error] 108-108: trailing spaces

(trailing-spaces)

AUMC_Example/run.sh (1)

73-74: Reminder: Address the TODO comment.

The TODO comment indicates that wget blocks need to be added once testing is validated. Please ensure that this functionality is implemented and the TODO comment is resolved before considering the script as complete.

Do you want me to open a GitHub issue to track this task?

AUMC_Example/joint_script.sh (1)

1-102: Suggestions for improvement:

  1. Consider using absolute paths or validating the existence of input files and directories to make the script more robust. Using relative paths assumes that the referenced files and directories exist in the expected locations, which could lead to errors if the script is run from a different directory or if the file structure changes.

  2. Ensure that the external scripts and configurations referenced by this script are well-documented and maintained. The script's functionality depends on those files, so any changes or issues there could affect the script's behavior.

  3. Consider providing a default value or a way to automatically determine the optimal number of parallel workers based on available system resources. While passing the number of workers as an argument provides flexibility, it also requires the user to specify an appropriate value. Automatically determining the optimal value could make the script easier to use and more efficient.

AUMC_Example/README.md (2)

115-123: Create GitHub issues for open questions.

The section raises important questions about handling data times and merging death times from different tables. These concerns require further investigation and resolution. Consider creating GitHub issues to track these open questions and facilitate discussion and implementation of appropriate solutions.

Do you want me to open GitHub issues for these questions?

Tools
LanguageTool

[uncategorized] ~116-~116: Possible missing comma found.
Context: ..._. How should those be slotted into the timeline which is otherwise stored at the _datet...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~121-~121: It seems likely that a singular genitive (’s) apostrophe is missing.
Context: ...athtimes between the hosp table and the patients table? 2. How to handle the dob nonsens...

(AI_HYDRA_LEO_APOSTROPHE_S_XS)


129-134: Provide more details or examples.

The section mentions the possibility of performing additional pre-MEDS processing but does not provide specific details or examples. To give users a better understanding of the potential future work, consider adding some concrete examples or a high-level description of the types of pre-MEDS processing steps that could be implemented.

AUMC_Example/pre_MEDS.py (1)

55-63: Consider improving the date of birth estimation logic.

The current logic assumes that the patient was born at the midpoint of the year, which may not be accurate. While this assumption is reasonable given the limited information, it's worth exploring if there are better ways to estimate the date of birth for patients with multiple admissions.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 9588583 and dc30f46.

Files selected for processing (11)
  • AUMC_Example/README.md (1 hunks)
  • AUMC_Example/configs/event_configs.yaml (1 hunks)
  • AUMC_Example/configs/pre_MEDS.yaml (1 hunks)
  • AUMC_Example/configs/table_preprocessors.yaml (1 hunks)
  • AUMC_Example/joint_script.sh (1 hunks)
  • AUMC_Example/local_parallelism_runner.yaml (1 hunks)
  • AUMC_Example/pre_MEDS.py (1 hunks)
  • AUMC_Example/run.sh (1 hunks)
  • AUMC_Example/slurm_runner.yaml (1 hunks)
  • MIMIC-IV_Example/README.md (3 hunks)
  • MIMIC-IV_Example/run.sh (1 hunks)
Additional context used
yamllint
AUMC_Example/configs/event_configs.yaml

[error] 4-4: trailing spaces

(trailing-spaces)


[error] 10-10: trailing spaces

(trailing-spaces)


[error] 24-24: trailing spaces

(trailing-spaces)


[error] 34-34: trailing spaces

(trailing-spaces)


[error] 58-58: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 77-77: trailing spaces

(trailing-spaces)


[error] 84-84: trailing spaces

(trailing-spaces)


[error] 90-90: trailing spaces

(trailing-spaces)


[error] 91-91: trailing spaces

(trailing-spaces)


[error] 99-99: trailing spaces

(trailing-spaces)


[error] 108-108: trailing spaces

(trailing-spaces)

AUMC_Example/configs/table_preprocessors.yaml

[error] 5-5: trailing spaces

(trailing-spaces)


[error] 10-10: trailing spaces

(trailing-spaces)


[error] 20-20: trailing spaces

(trailing-spaces)


[error] 24-24: trailing spaces

(trailing-spaces)


[error] 38-38: trailing spaces

(trailing-spaces)


[error] 42-42: trailing spaces

(trailing-spaces)


[error] 56-56: trailing spaces

(trailing-spaces)


[error] 60-60: trailing spaces

(trailing-spaces)


[error] 78-78: trailing spaces

(trailing-spaces)


[error] 95-95: trailing spaces

(trailing-spaces)


[error] 112-112: trailing spaces

(trailing-spaces)


[error] 116-116: no new line character at the end of file

(new-line-at-end-of-file)

LanguageTool
AUMC_Example/README.md

[style] ~28-~28: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ... chmod +x pre_MEDS.py cd .. If you want to install locally, use: bash git clon...

(REP_WANT_TO_VB)


[uncategorized] ~65-~65: Possible missing comma found.
Context: ...original files will be written in a new directory which will be used as the input to the ...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~83-~83: Possible missing comma found.
Context: ...ctory we'll denote as $AUMC_MEDS_DIR. Note this is a different directory than the ...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~114-~114: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative.
Context: ...including: INSERT TABLES IGNORED HERE Lots of questions remain about how to appropria...

(A_LOT_OF)


[uncategorized] ~116-~116: Possible missing comma found.
Context: ..._. How should those be slotted into the timeline which is otherwise stored at the _datet...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~121-~121: It seems likely that a singular genitive (’s) apostrophe is missing.
Context: ...athtimes between the hosp table and the patients table? 2. How to handle the dob nonsens...

(AI_HYDRA_LEO_APOSTROPHE_S_XS)


[uncategorized] ~126-~126: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...Notes Note: If you use the slurm system and you launch the hydra submitit jobs from...

(COMMA_COMPOUND_SENTENCE)

Markdownlint
AUMC_Example/README.md

109-109: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


3-3: null
Bare URL used

(MD034, no-bare-urls)


40-40: null
Bare URL used

(MD034, no-bare-urls)

Ruff
AUMC_Example/pre_MEDS.py

11-11: gzip imported but unused

Remove unused import: gzip

(F401)


105-109: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring.

(B021)

Additional comments not posted (41)
AUMC_Example/local_parallelism_runner.yaml (1)

1-3: Approve the parallel processing configuration.

The introduction of the parallelize section with n_workers and launcher parameters is a valuable addition to enable parallel processing capabilities. Here are some key points:

  • The dynamic n_workers configuration, retrieved from the N_WORKERS environment variable, allows for flexibility in adjusting the level of parallelism based on the environment. This is particularly useful when running the application in different environments with varying resource availability.

  • The explicit use of the Joblib library, specified by the launcher parameter, is a reliable choice for managing parallel execution. Joblib is a well-established library that provides efficient and easy-to-use tools for parallel processing in Python.

  • By leveraging parallel processing, the application can significantly speed up tasks that are suitable for concurrent execution. This can lead to improved performance and faster completion times for computationally intensive operations.

However, it's important to consider the following:

  • Ensure that the tasks being parallelized are independent and can be safely executed concurrently without introducing race conditions or data inconsistencies.

  • Monitor the resource utilization, especially when using a high number of worker processes, to avoid overloading the system and ensure optimal performance.

  • Consider providing documentation or guidelines on setting the N_WORKERS environment variable to help users determine an appropriate value based on their system's capabilities and the specific requirements of the application.

Overall, the parallel processing configuration looks good and can provide significant benefits in terms of performance and efficiency.

AUMC_Example/configs/pre_MEDS.yaml (1)

1-11: Excellent configuration setup!

The YAML configuration file demonstrates several best practices:

  1. Using environment variables for input_dir and cohort_dir allows for flexibility and portability across different environments.
  2. Dynamically generating the job name based on the current timestamp ensures unique identification for each execution instance, which is beneficial for tracking and debugging purposes.
  3. Specifying structured logging directories enhances the organization and traceability of logs.
  4. Leveraging Hydra's features for job and run management can simplify the overall application setup.

This configuration provides a clean and structured approach to manage input/output directories and logging while promoting flexibility and ease of monitoring.

AUMC_Example/slurm_runner.yaml (7)

5-12: LGTM!

The configuration parameters for the shard_events task seem appropriate. The allocated resources should be sufficient for most sharding operations, and the use of the "short" partition aligns with the expected task duration.


13-21: LGTM!

The configuration parameters for the split_and_shard_subjects task seem appropriate. The allocated resources should be sufficient for most splitting and sharding operations on subjects, and the use of the "short" partition aligns with the expected task duration.


22-29: LGTM!

The configuration parameters for the convert_to_sharded_events task seem appropriate. The allocated resources should be sufficient for most conversion operations, and the use of the "short" partition aligns with the expected task duration.


30-37: LGTM!

The configuration parameters for the merge_to_MEDS_cohort task seem appropriate. The allocated resources, particularly the high memory allocation, should be sufficient for most merging operations, and the use of the "short" partition aligns with the expected task duration.


38-45: LGTM!

The configuration parameters for the extract_code_metadata task seem appropriate. The allocated resources should be sufficient for most metadata extraction operations, and the use of the "short" partition aligns with the expected task duration.


46-54: LGTM!

The configuration parameters for the finalize_MEDS_metadata task seem appropriate. The allocated resources should be sufficient for most metadata finalization operations, and the use of the "short" partition aligns with the expected task duration. The lower resource allocation compared to other tasks suggests that this task is less computationally intensive.


55-61: LGTM!

The configuration parameters for the finalize_MEDS_data task seem appropriate. The allocated resources, particularly the high memory allocation, should be sufficient for most data finalization operations, and the use of the "short" partition aligns with the expected task duration.

AUMC_Example/configs/event_configs.yaml (3)

1-2: LGTM!

The patient ID column is correctly defined.


41-49: LGTM!

The numericitems structure is correctly defined.


59-67: LGTM!

The freetextitems structure is correctly defined.

AUMC_Example/configs/table_preprocessors.yaml (7)

1-17: LGTM!

The admissions configuration is well-structured and includes all the necessary fields for processing admission data.

Tools
yamllint

[error] 5-5: trailing spaces

(trailing-spaces)


[error] 10-10: trailing spaces

(trailing-spaces)


19-35: Configuration looks good!

The numericitems configuration includes all the necessary fields and follows the expected structure. The warning_items field raises a valid question about handling the registeredat and updatedat fields, which should be addressed.

Tools
yamllint

[error] 20-20: trailing spaces

(trailing-spaces)


[error] 24-24: trailing spaces

(trailing-spaces)


37-53: Configuration looks good!

The listitems configuration follows the same structure as numericitems and includes all the necessary fields. The warning_items field raises the same question about handling the registeredat and updatedat fields, which should be addressed consistently across configurations.

Tools
yamllint

[error] 38-38: trailing spaces

(trailing-spaces)


[error] 42-42: trailing spaces

(trailing-spaces)


55-72: Configuration looks good!

The freetextitems configuration follows the same structure as numericitems and listitems, with the addition of a comment field in output_data_cols. The warning_items field raises the same question about handling the registeredat and updatedat fields, which should be addressed consistently across configurations.

Tools
yamllint

[error] 56-56: trailing spaces

(trailing-spaces)


[error] 60-60: trailing spaces

(trailing-spaces)


74-97: Configuration looks good, but please address the warning items.

The drugitems configuration follows the expected structure and includes additional drug-related fields in output_data_cols. However, the warning_items field raises two important questions:

  1. Ignoring several flags may be a mistake and should be carefully reviewed.
  2. The timing of administered dose recording should be clarified to ensure accurate data processing.

Please address these warning items to ensure the configuration is complete and accurate.

Tools
yamllint

[error] 78-78: trailing spaces

(trailing-spaces)


[error] 95-95: trailing spaces

(trailing-spaces)


99-107: LGTM!

The procedureorderitems configuration follows the expected structure and includes all the necessary fields for processing procedure order data.


108-116: Configuration looks good!

The processitems configuration follows the expected structure and includes the necessary fields for processing process item data.

Tools
yamllint

[error] 112-112: trailing spaces

(trailing-spaces)


[error] 116-116: no new line character at the end of file

(new-line-at-end-of-file)

MIMIC-IV_Example/run.sh (1)

38-42: LGTM!

The validation check for the do_unzip flag in the positional arguments is a good addition. It improves the robustness of the script by ensuring that the script is executed with the correct parameters and prevents incorrect usage related to the do_unzip flag.

AUMC_Example/run.sh (4)

1-4: LGTM!

The shebang line and the set -e command are correctly used to ensure the script fails if any internal command fails. This is a good practice to prevent the script from continuing execution if any command encounters an error.


6-22: LGTM!

The display_help function provides clear and well-formatted usage instructions for the script. It includes details about the required and optional arguments, making it easy for users to understand how to use the script correctly.


24-25: LGTM!

Unsetting the SLURM_CPU_BIND environment variable is a good practice when running the script on a Slurm interactive node with Slurm parallelism. This ensures that the script can run correctly in such environments without encountering issues related to CPU binding.


27-105: LGTM!

The code segment for handling command-line arguments and setting up the environment is implemented correctly. It follows a logical flow and provides clear error messages when required arguments are missing. The handling of the optional do_unzip flag is done properly, with appropriate error handling for invalid values.

The file paths for configuration files and the pre-MEDS script are set using absolute paths, which is a good practice. Exporting the environment variables separately ensures that any errors during assignment are caught.

The unzipping of .csv.gz files is performed conditionally based on the do_unzip flag, and appropriate messages are displayed based on the presence of files to unzip. The pre-MEDS conversion is executed using the specified Python script and input/output directories.

Setting the N_WORKERS environment variable to 1 if it is not already set helps avoid issues with the runners. Finally, the extraction pipeline is run using the specified configuration file.

Overall, the code segment is well-structured and follows best practices for handling command-line arguments and setting up the environment.

AUMC_Example/joint_script.sh (4)

1-22: LGTM!

The script header and help function look good:

  • Using set -e is a good practice to fail fast on any error.
  • The display_help function provides clear usage instructions.
  • The help message includes all the necessary details about the script's purpose, arguments, and options.

24-41: LGTM!

The argument validation logic looks good:

  • Checking for the help flag and displaying the help message is a common pattern.
  • Validating the number of arguments ensures that the script is invoked correctly.
  • Assigning the arguments to variables improves code readability and maintainability.

42-49: LGTM!

The memory optimization note is helpful:

  • It provides guidance for users processing large datasets like AUMCdb.
  • The suggested optimizations help keep the memory burden reasonable.
  • Including example parameters makes it easier for users to apply the optimizations.

51-102: LGTM!

The data processing stages look good:

  • The script follows a logical sequence of steps to process the data.
  • Using separate commands or scripts for each stage makes the code modular and maintainable.
  • Running stages in parallel with multiple workers can significantly speed up processing.
  • Using the Hydra framework for job management simplifies the configuration of parallel stages.
MIMIC-IV_Example/README.md (5)

21-21: Improved clarity in environment variable description.

The updated description for MIMICIV_PRE_MEDS_DIR provides a clearer understanding of its purpose, distinguishing it from the directory for raw data.


22-22: Improved clarity in environment variable description.

The updated description for MIMICIV_MEDS_COHORT_DIR provides a clearer understanding of its purpose, specifying that it stores the final MEDS data.


49-49: Improved correctness of metadata file download instructions.

The addition of the cd command ensures that the metadata files are downloaded to the correct directory (MIMICIV_RAW_DIR), improving the clarity and correctness of the instructions.


68-68: Updated command to use the correct environment variable.

The modification to the MEDS ETL command ensures that it uses the MIMICIV_MEDS_COHORT_DIR environment variable, which aligns with the clarified purpose of storing final MEDS data.


70-72: Helpful note on memory usage and environment variable settings.

The added note provides valuable information to users about potential memory usage issues during the ETL process. The suggestion to reduce the shard size in the configuration file offers a practical way to manage memory better. Additionally, the emphasis on checking environment variable settings helps ensure a smooth ETL process.

AUMC_Example/README.md (4)

7-36: LGTM!

The installation instructions are clear, concise, and cover both PyPI and local cloning methods. The inclusion of conda environment setup and package installation steps ensures a complete setup for users.

Tools
LanguageTool

[style] ~28-~28: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ... chmod +x pre_MEDS.py cd .. If you want to install locally, use: bash git clon...

(REP_WANT_TO_VB)


38-41: LGTM!

The download instructions are clear and provide the necessary information for users to obtain the AUMC dataset. The mention of requiring raw .csv files and introducing the $AUMC_RAW_DIR variable adds clarity to the process.

Tools
Markdownlint

40-40: null
Bare URL used

(MD034, no-bare-urls)


42-107: LGTM!

The instructions for running the MEDS ETL process are well-structured and informative. The breakdown into data preparation and MEDS extraction sub-steps provides clarity. The mention of different parallelism options and the expected time and resource requirements is helpful for users.

Tools
LanguageTool

[uncategorized] ~65-~65: Possible missing comma found.
Context: ...original files will be written in a new directory which will be used as the input to the ...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~83-~83: Possible missing comma found.
Context: ...ctory we'll denote as $AUMC_MEDS_DIR. Note this is a different directory than the ...

(AI_HYDRA_LEO_MISSING_COMMA)


126-127: LGTM!

The note about running unset SLURM_CPU_BIND when launching hydra submitit jobs from an interactive slurm node is a helpful tip for users who may encounter errors in that scenario.

Tools
LanguageTool

[uncategorized] ~126-~126: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...Notes Note: If you use the slurm system and you launch the hydra submitit jobs from...

(COMMA_COMPOUND_SENTENCE)

AUMC_Example/pre_MEDS.py (4)

26-36: LGTM!

The function correctly loads the raw AUMCdb file into a Polars DataFrame using pl.scan_csv. The large infer_schema_length value and utf8-lossy encoding ensure that the schema is inferred correctly and encoding errors are handled gracefully.


67-73: LGTM!

The filtering and column selection logic is correct. It keeps only the first admission for each patient and selects the relevant columns for the patient and admission DataFrames.


76-129: LGTM!

The function correctly returns a function that joins a DataFrame to the patient table and adds pseudotimes. It handles various input types, validates the arguments, and selects the relevant columns for the output DataFrame.

Tools
Ruff

105-109: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring.

(B021)


132-207: LGTM!

The main function correctly orchestrates the overall workflow. It loads the preprocessor configurations, processes the patient table, and iterates through the raw cohort files to apply the appropriate preprocessor functions. The output DataFrames are written to Parquet files, and the function handles skipping already processed or unsupported files.

- "starttime"
- "stoptime"
output_data_cols:
- "item"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line at the end of the file.

The static analysis tool has identified a missing new line at the end of the file. Please add one to adhere to formatting best practices.

Tools
yamllint

[error] 116-116: no new line character at the end of file

(new-line-at-end-of-file)

- "measuredat"
- "registeredat"
- "updatedat"
pseudotime_col:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing spaces.

The static analysis tool has identified trailing spaces on this line. Please remove them to adhere to formatting best practices.

-  pseudotime_col: 
+  pseudotime_col:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pseudotime_col:
pseudotime_col:
Tools
yamllint

[error] 60-60: trailing spaces

(trailing-spaces)

- "specialty"

numericitems:
offset_col:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing spaces.

The static analysis tool has identified trailing spaces on this line. Please remove them to adhere to formatting best practices.

-  offset_col: 
+  offset_col:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
offset_col:
offset_col:
Tools
yamllint

[error] 20-20: trailing spaces

(trailing-spaces)

offset_col:
- "start"
- "stop"
pseudotime_col:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing spaces.

The static analysis tool has identified trailing spaces on this line. Please remove them to adhere to formatting best practices.

-  pseudotime_col: 
+  pseudotime_col:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pseudotime_col:
pseudotime_col:
Tools
yamllint

[error] 78-78: trailing spaces

(trailing-spaces)

offset_col:
- "start"
- "stop"
pseudotime_col:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing spaces.

The static analysis tool has identified trailing spaces on this line. Please remove them to adhere to formatting best practices.

-  pseudotime_col: 
+  pseudotime_col:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pseudotime_col:
pseudotime_col:
Tools
yamllint

[error] 112-112: trailing spaces

(trailing-spaces)

- "measuredat"
- "registeredat"
- "updatedat"
pseudotime_col:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing spaces.

The static analysis tool has identified trailing spaces on this line. Please remove them to adhere to formatting best practices.

-  pseudotime_col: 
+  pseudotime_col:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pseudotime_col:
pseudotime_col:
Tools
yamllint

[error] 42-42: trailing spaces

(trailing-spaces)

- "How should we deal with `registeredat` and `updatedat`?"

listitems:
offset_col:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing spaces.

The static analysis tool has identified trailing spaces on this line. Please remove them to adhere to formatting best practices.

-  offset_col: 
+  offset_col:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
offset_col:
offset_col:
Tools
yamllint

[error] 38-38: trailing spaces

(trailing-spaces)

Comment on lines +111 to +113
Currently, some tables are ignored, including:

INSERT TABLES IGNORED HERE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the placeholder for ignored tables.

The section mentions that some tables are currently ignored but does not provide the specific table names. Please update the placeholder INSERT TABLES IGNORED HERE with the actual names of the ignored tables to provide clarity to users.


root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)

import gzip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused import.

The gzip module is imported but not used in the script. Please remove the unused import to keep the code clean.

-import gzip
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import gzip
Tools
Ruff

11-11: gzip imported but unused

Remove unused import: gzip

(F401)

Comment on lines +105 to +109
f"""Takes the {table_name} table and converts it to a form that includes pseudo-timestamps.

The output of this process is ultimately converted to events via the `{table_name}` key in the
`configs/event_configs.yaml` file.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the docstring for the returned function.

The docstring for the returned function is an f-string, which may not be interpreted correctly as a docstring by Python. Please use a regular docstring instead.

-        f"""Takes the {table_name} table and converts it to a form that includes pseudo-timestamps.
-
-        The output of this process is ultimately converted to events via the `{table_name}` key in the
-        `configs/event_configs.yaml` file.
-        """
+        """Takes the {table_name} table and converts it to a form that includes pseudo-timestamps.
+
+        The output of this process is ultimately converted to events via the `{table_name}` key in the
+        `configs/event_configs.yaml` file.
+        """
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
f"""Takes the {table_name} table and converts it to a form that includes pseudo-timestamps.
The output of this process is ultimately converted to events via the `{table_name}` key in the
`configs/event_configs.yaml` file.
"""
"""Takes the {table_name} table and converts it to a form that includes pseudo-timestamps.
The output of this process is ultimately converted to events via the `{table_name}` key in the
`configs/event_configs.yaml` file.
"""
Tools
Ruff

105-109: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring.

(B021)

@rvandewater rvandewater changed the base branch from main to dev September 18, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants