-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release candidate for 0.0.4 #138
Conversation
…an example of loading MEDS_transforms configs as defaults dynamically.
…low for pushing to pypi. The pypi push workflow requires local tagging via 'git tag 0.0.3' then 'git push origin 0.0.3'
Updated workflows and README.
Caution Review failedThe pull request is closed. WalkthroughThe recent changes significantly enhance coding standards, automation, and data processing capabilities within the project. Key updates include the introduction of an Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CI_CD as "CI/CD System"
participant PackageRepo as "Package Repository"
participant TestResults as "Testing Framework"
User->>+CI_CD: Push code changes
CI_CD->>+PackageRepo: Build and publish package
CI_CD->>+TestResults: Run automated tests
TestResults-->>-CI_CD: Return test results
CI_CD-->>-User: Notify completion of workflow
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## main #138 +/- ##
==========================================
+ Coverage 92.48% 92.91% +0.42%
==========================================
Files 24 25 +1
Lines 1664 1863 +199
==========================================
+ Hits 1539 1731 +192
- Misses 125 132 +7 ☔ View full report in Codecov by Sentry. |
Adds a reshard-by-split stage as a transform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Outside diff range, codebase verification and nitpick comments (19)
docs/javascripts/mathjax.js (1)
14-19
: Ensure MathJax is initialized correctly on document events.The subscription to the document event to clear and reset MathJax is a good practice to ensure that mathematical expressions are rendered correctly after dynamic content changes.
Consider adding comments to explain why these specific MathJax methods are called during the event subscription for better maintainability.
+// Clear and reset MathJax to ensure proper rendering of dynamic content
docs/gen_ref_pages.py (2)
1-2
: Add module-level docstring.The file starts with a docstring, but it could be expanded to explain the purpose and usage of the script in more detail.
- """Generate the code reference pages.""" + """Generate the code reference pages for the project. This script uses mkdocs_gen_files to create Markdown documentation files for each Python module in the src directory. It handles special cases for __init__.py and __main__.py files. """
34-35
: Consider adding more context to identifiers.The identifier
ident
is constructed from module path parts. Consider adding more context or comments to explain its purpose.- ident = ".".join(parts) + # Construct the module identifier for mkdocs. + ident = ".".join(parts).github/workflows/python-build.yaml (2)
1-4
: Clarify event trigger.The workflow is triggered on every push. Consider specifying branches or tags to avoid unnecessary runs.
- on: push + on: + push: + branches: + - main + tags: + - 'v*.*.*'
16-21
: Recommendation: Use a virtual environment for secure installation of build tools.The current workflow does not utilize a virtual environment for installing the
build
tool, which could lead to conflicts with other packages. It is recommended to modify the workflow to create and activate a virtual environment before running thepip install
command. This ensures the installation is isolated and secure.
- Modify the workflow to include
python3 -m venv venv && source venv/bin/activate
before installing packages.Analysis chain
Ensure secure installation of build tools.
The installation of build tools should be secure. Consider using a virtual environment or ensuring the installation command is safe.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the security of build tool installation. # Test: Check if the build tool installation is performed in a secure manner. # Expect: Installation should be isolated or verified to be secure. python3 -m venv venv && source venv/bin/activate && pip install build --userLength of output: 97
Script:
#!/bin/bash # Search for virtual environment setup or secure installation practices in the workflow file. rg 'python3 -m venv|pip install' .github/workflows/python-build.yaml -A 3Length of output: 186
docs/pipeline_configuration.md (3)
3-7
: Clarify the design philosophy.The introduction is clear but could benefit from a more concise explanation of the design philosophy to enhance readability.
- MEDS Transform's key design philosophy is that you should be able to design the necessary data pre-processing - and transformation pipeline you need simply by composing a series of simple, reusable, efficient, and - configurable local transformations into a larger pipeline. This can allow researchers to balance having highly - flexible, customizable pipelines with ensuring that all operations in the pipeline are simple, shareable, - efficient, and easy to validate. + MEDS Transform's design philosophy emphasizes creating data pre-processing and transformation pipelines by composing simple, reusable, and efficient local transformations. This approach balances flexibility and customization with simplicity, shareability, efficiency, and ease of validation.
16-29
: Address loose punctuation in stage descriptions.Some stage descriptions have loose punctuation marks that can be tightened for better readability.
- 1. `stage_1`: A metadata map-reduce stage (e.g., counting the occurrence rates of the codes in the - data). + 1. `stage_1`: A metadata map-reduce stage (e.g., counting the occurrence rates of the codes in the data). - 2. `stage_2`: A metadata-only processing stage (e.g., filtering the code dataframe to only codes - that occur more than 10 times). + 2. `stage_2`: A metadata-only processing stage (e.g., filtering the code dataframe to only codes that occur more than 10 times). - 3. `stage_3`: A data processing stage (e.g., filtering the data to only rows with a code that is in the - current running metadata file, which, due to `stage_2`, are those codes that occur more than 10 times). + 3. `stage_3`: A data processing stage (e.g., filtering the data to only rows with a code in the current running metadata file, which, due to `stage_2`, are those codes that occur more than 10 times).Tools
LanguageTool
[uncategorized] ~16-~16: Loose punctuation mark.
Context: ...s of the following stages: 1.stage_1
: A metadata map-reduce stage (e.g., coun...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~18-~18: Loose punctuation mark.
Context: ...the codes in the data). 2.stage_2
: A metadata-only processing stage (e.g.,...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~20-~20: Loose punctuation mark.
Context: ... occur more than 10 times). 3.stage_3
: A data processing stage (e.g., filterin...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... occur more than 10 times). 4.stage_4
: A metadata map-reduce stage (e.g., comp...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~24-~24: Loose punctuation mark.
Context: ...al values in the data). 5.stage_5
: A data processing stage (e.g., occludin...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~26-~26: Loose punctuation mark.
Context: ... deviations from the mean). 6.stage_6
: A metadata map-reduce stage (e.g., comp...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~28-~28: Loose punctuation mark.
Context: ...al values in the data). 7.stage_7
: A data processing stage (e.g., normaliz...(UNLIKELY_OPENING_PUNCTUATION)
33-81
: Improve clarity and grammar in file management descriptions.Some descriptions can be made clearer, and a missing comma in stage 4 needs to be addressed.
- 1. `stage_1`: - - As there is no preceding data stage, this stage will read the data in from `$INPUT_DIR/data` - (the `data` suffix is the default data directory for MEDS datasets). + 1. `stage_1`: + - This stage reads data from `$INPUT_DIR/data` (the `data` suffix is the default data directory for MEDS datasets) as there is no preceding data stage. - 4. `stage_4`: - - This stage will read in the data from the `$COHORT_DIR/stage_3` directory as that is the prior data - processing stage. + 4. `stage_4`: + - This stage reads data from the `$COHORT_DIR/stage_3` directory, as that is the prior data processing stage.Tools
LanguageTool
[uncategorized] ~33-~33: Loose punctuation mark.
Context: ...s in the following manner. 1.stage_1
: - As there is no preceding data sta...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...1/codes.parquetdirectory. 2.
stage_2`: - This stage will read in the metad...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~45-~45: Loose punctuation mark.
Context: ...2/codes.parquetdirectory. 3.
stage_3`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~51-~51: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 4.
stage_4`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~52-~52: Possible missing comma found.
Context: ...the data from the$COHORT_DIR/stage_3
directory as that is the prior data proce...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~60-~60: Loose punctuation mark.
Context: ...tage_4/codes.parquetfile. 5.
stage_5`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~64-~64: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 6.
stage_6`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~75-~75: Loose punctuation mark.
Context: ...n to the stage directories. 7.stage_7
: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
tests/utils.py (1)
107-107
: Inconsistent usage ofdo_use_config_yaml
in test casesThe
do_use_config_yaml
parameter is used intests/transform_tester_base.py
but is missing intests/test_extract.py
, whererun_command
is invoked multiple times. Please review the test cases to ensure consistent usage of this parameter where applicable.
tests/transform_tester_base.py
:do_use_config_yaml
is explicitly set.tests/test_extract.py
:do_use_config_yaml
is not used.Analysis chain
Ensure correct usage of
do_use_config_yaml
.The new parameter
do_use_config_yaml
adds flexibility to the function. Ensure that its usage is consistent across all test cases.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify consistent usage of `do_use_config_yaml` in test cases. # Test: Search for test cases using `run_command`. Expect: Proper usage of `do_use_config_yaml`. rg --type python 'run_command' -A 3Length of output: 66
Script:
#!/bin/bash # Description: Verify consistent usage of `do_use_config_yaml` in test cases. # Test: Search for test cases using `run_command`. Expect: Proper usage of `do_use_config_yaml`. rg --type py 'run_command' -A 3Length of output: 3436
src/MEDS_transforms/reshard_to_split.py (2)
45-55
: Handle file existence more gracefully.The code raises a
FileExistsError
if a file already exists anddo_overwrite
is not set. Consider logging this event before raising the error for better traceability.+ logger.error(f"{str(shards_fp.resolve())} already exists and overwrite is not allowed.") raise FileExistsError(f"{str(shards_fp.resolve())} already exists.")
137-144
: Clarify exception handling and cleanup logic.The cleanup logic in the
try
block could benefit from more detailed logging. Consider adding logs for each step in the cleanup process to help diagnose potential issues.+ logger.debug(f"Attempting to remove directory {str(subshard_dir.resolve())}.") subshard_dir.rmdir()
tests/test_reshard_to_split.py (1)
1-5
: Clarify environment variable usage in comments.The comment mentions setting an environment variable to use local scripts. Ensure this is clear and accurate for users running the tests.
- Set the bash env variable `DO_USE_LOCAL_SCRIPTS=1` to use the local py files, rather than the installed - scripts. + Set the environment variable `DO_USE_LOCAL_SCRIPTS=1` to use local Python files instead of installed scripts.eICU_Example/README.md (1)
13-13
: Fix unordered list indentation.Adjust the indentation of the unordered list items to meet markdownlint standards and improve readability.
- - [ ] Identifying the pre-MEDS steps for eICU + - [ ] Identifying the pre-MEDS steps for eICU - - [ ] Test that it runs at all. - - [ ] Test that the output is as expected. + - [ ] Test that it runs at all. + - [ ] Test that the output is as expected. - - [ ] Sub-sharding - - [ ] Patient split gathering - - [ ] Event extraction - - [ ] Merging + - [ ] Sub-sharding + - [ ] Patient split gathering + - [ ] Event extraction + - [ ] Merging - - [ ] Basic validation - - [ ] Detailed validation + - [ ] Basic validation + - [ ] Detailed validationAlso applies to: 15-16, 21-24, 26-27
Tools
Markdownlint
13-13: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
src/MEDS_transforms/transforms/tokenization.py (1)
237-241
: Ensure necessary directories are created before writing files.The directories "schemas" and "event_seqs" do not exist, which could lead to runtime errors when writing files. Ensure that the code creates these directories if they do not already exist before writing to them.
- Consider adding logic to create the directories before file operations.
Analysis chain
Simplify output file path management.
The direct handling of
shards_single_output
simplifies file path management, reducing complexity. Ensure that all necessary directories exist before writing to prevent runtime errors.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the necessary directories for output files exist. # Test: Check for the existence of required directories. Expect: Directories should exist or be created. for dir in "schemas" "event_seqs"; do if [ ! -d "$dir" ]; then echo "Directory $dir does not exist." fi doneLength of output: 284
docs/preprocessing_operation_prototypes.md (1)
80-80
: Consider simplifying phrasing.The phrase "prior to" could be simplified to "before" for conciseness.
- patient data filters should be applied prior to aggregation. + patient data filters should be applied before aggregation.Tools
LanguageTool
[style] ~80-~80: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ... patient data filters should be applied prior to aggregation. 2. What aggregation functi...(EN_WORDINESS_PREMIUM_PRIOR_TO)
src/MEDS_transforms/extract/README.md (1)
255-255
: Consider revising for professionalism.The phrase "feel free to" is common and can be replaced with a more formal alternative.
- Feel free to reach out if you have any questions or concerns about this. + Please reach out if you have any questions or concerns about this.Tools
LanguageTool
[style] ~255-~255: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... extraction tool for your raw data. Feel free to reach out if you have any questions or ...(FEEL_FREE_TO_STYLE_ME)
src/MEDS_transforms/mapreduce/mapper.py (1)
Line range hint
447-569
:
Avoid usingpartial
in default arguments.The
compute_fn
argument inbind_compute_fn
usespartial
. It's recommended to avoid function calls in default arguments. Define thepartial
outside the function.# Define the default compute function outside default_compute_fn = partial(pl.scan_parquet, glob=False) def bind_compute_fn(cfg: DictConfig, stage_cfg: DictConfig, compute_fn: ANY_COMPUTE_FN_T = default_compute_fn) -> COMPUTE_FN_T:README.md (2)
25-33
: Improve consistency in installation instructions.The installation instructions are clear, but ensure consistency in formatting and phrasing across bullet points for better readability.
- For a pypi installation, install with `pip install MEDS-transforms`. + For a PyPI installation, use `pip install MEDS-transforms`.
40-78
: Enhance clarity in the design philosophy section.The design philosophy section is well-structured. Consider adding examples or links to relevant sections for each philosophy point to enhance understanding.
- This representation is the Medical Event Data Standard (MEDS) format, and in this repository we use it in the "flat" format, where data is organized as rows of `patient_id`, `time`, `code`, `numeric_value` columns. + This representation is the Medical Event Data Standard (MEDS) format. In this repository, we use the "flat" format, where data is organized as rows of `patient_id`, `time`, `code`, and `numeric_value` columns.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (42)
- .editorconfig (1 hunks)
- .github/workflows/python-build.yaml (1 hunks)
- .github/workflows/tests.yaml (1 hunks)
- .pre-commit-config.yaml (3 hunks)
- .readthedocs.yaml (1 hunks)
- MIMIC-IV_Example/README.md (2 hunks)
- README.md (8 hunks)
- docs/gen_ref_pages.py (1 hunks)
- docs/index.md (1 hunks)
- docs/javascripts/mathjax.js (1 hunks)
- docs/pipeline_configuration.md (1 hunks)
- docs/preprocessing_operation_prototypes.md (7 hunks)
- eICU_Example/README.md (10 hunks)
- eICU_Example/configs/table_preprocessors.yaml (1 hunks)
- mkdocs.yml (1 hunks)
- pyproject.toml (2 hunks)
- src/MEDS_transforms/init.py (2 hunks)
- src/MEDS_transforms/aggregate_code_metadata.py (10 hunks)
- src/MEDS_transforms/configs/extract.yaml (1 hunks)
- src/MEDS_transforms/configs/preprocess.yaml (2 hunks)
- src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml (1 hunks)
- src/MEDS_transforms/configs/stage_configs/split_and_shard_patients.yaml (1 hunks)
- src/MEDS_transforms/extract/README.md (6 hunks)
- src/MEDS_transforms/extract/init.py (2 hunks)
- src/MEDS_transforms/extract/convert_to_sharded_events.py (2 hunks)
- src/MEDS_transforms/extract/extract_code_metadata.py (2 hunks)
- src/MEDS_transforms/extract/finalize_MEDS_metadata.py (2 hunks)
- src/MEDS_transforms/extract/merge_to_MEDS_cohort.py (2 hunks)
- src/MEDS_transforms/extract/split_and_shard_patients.py (7 hunks)
- src/MEDS_transforms/mapreduce/mapper.py (4 hunks)
- src/MEDS_transforms/mapreduce/utils.py (3 hunks)
- src/MEDS_transforms/reshard_to_split.py (1 hunks)
- src/MEDS_transforms/transforms/tokenization.py (3 hunks)
- src/MEDS_transforms/utils.py (8 hunks)
- tests/test_add_time_derived_measurements.py (1 hunks)
- tests/test_extract.py (1 hunks)
- tests/test_filter_measurements.py (2 hunks)
- tests/test_occlude_outliers.py (1 hunks)
- tests/test_reorder_measurements.py (1 hunks)
- tests/test_reshard_to_split.py (1 hunks)
- tests/transform_tester_base.py (7 hunks)
- tests/utils.py (2 hunks)
Files skipped from review due to trivial changes (11)
- .editorconfig
- .readthedocs.yaml
- MIMIC-IV_Example/README.md
- docs/index.md
- eICU_Example/configs/table_preprocessors.yaml
- src/MEDS_transforms/configs/extract.yaml
- src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml
- src/MEDS_transforms/configs/stage_configs/split_and_shard_patients.yaml
- tests/test_add_time_derived_measurements.py
- tests/test_occlude_outliers.py
- tests/test_reorder_measurements.py
Additional context used
LanguageTool
docs/pipeline_configuration.md
[uncategorized] ~16-~16: Loose punctuation mark.
Context: ...s of the following stages: 1.stage_1
: A metadata map-reduce stage (e.g., coun...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~18-~18: Loose punctuation mark.
Context: ...the codes in the data). 2.stage_2
: A metadata-only processing stage (e.g.,...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~20-~20: Loose punctuation mark.
Context: ... occur more than 10 times). 3.stage_3
: A data processing stage (e.g., filterin...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... occur more than 10 times). 4.stage_4
: A metadata map-reduce stage (e.g., comp...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~24-~24: Loose punctuation mark.
Context: ...al values in the data). 5.stage_5
: A data processing stage (e.g., occludin...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~26-~26: Loose punctuation mark.
Context: ... deviations from the mean). 6.stage_6
: A metadata map-reduce stage (e.g., comp...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~28-~28: Loose punctuation mark.
Context: ...al values in the data). 7.stage_7
: A data processing stage (e.g., normaliz...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~33-~33: Loose punctuation mark.
Context: ...s in the following manner. 1.stage_1
: - As there is no preceding data sta...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...1/codes.parquetdirectory. 2.
stage_2`: - This stage will read in the metad...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~45-~45: Loose punctuation mark.
Context: ...2/codes.parquetdirectory. 3.
stage_3`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~51-~51: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 4.
stage_4`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~52-~52: Possible missing comma found.
Context: ...the data from the$COHORT_DIR/stage_3
directory as that is the prior data proce...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~60-~60: Loose punctuation mark.
Context: ...tage_4/codes.parquetfile. 5.
stage_5`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~64-~64: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 6.
stage_6`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~75-~75: Loose punctuation mark.
Context: ...n to the stage directories. 7.stage_7
: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[grammar] ~82-~82: The verb form ‘reduce’ does not seem to be suitable in this context.
Context: ...ssing stages (which do not have reduce stages).(HAVE_VB)
docs/preprocessing_operation_prototypes.md
[style] ~80-~80: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ... patient data filters should be applied prior to aggregation. 2. What aggregation functi...(EN_WORDINESS_PREMIUM_PRIOR_TO)
src/MEDS_transforms/extract/README.md
[misspelling] ~83-~83: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...t (again either a string literal or a input column reference) is interpreted ...(EN_A_VS_AN)
[style] ~255-~255: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... extraction tool for your raw data. Feel free to reach out if you have any questions or ...(FEEL_FREE_TO_STYLE_ME)
README.md
[grammar] ~103-~103: The modal verb ‘will’ requires the verb’s base form.
Context: ...rker jobs. Note this will typically required a distributed file system to work corre...(MD_BASEFORM)
[uncategorized] ~106-~106: Possible missing comma found.
Context: ...e same script in different terminal sessions or all at once via the Hydrajoblib
l...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~181-~181: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ... and external ontologies? Will they need to specify select key-event concepts to id...(REP_NEED_TO_VB)
[uncategorized] ~207-~207: Possible missing comma found.
Context: ...achine or across many worker nodes on a cluster provided the worker nodes have acce...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~306-~306: This phrase is redundant. Consider using “outside”.
Context: ... - Removing numeric values that fall outside of pre-specified, per-code ranges (e.g., f...(OUTSIDE_OF)
[style] ~315-~315: Consider using “who” when you are referring to a person instead of an object.
Context: ..." event to the dataset for each patient that occurs at the end of the observation ...(THAT_WHO)
[style] ~324-~324: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... each code occurs in the dataset. - Computing appropriate numeric bins for each code ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[uncategorized] ~328-~328: Possible missing preposition found.
Context: ...ure that the only columns that need be processed going into the pre-processing...(AI_HYDRA_LEO_MISSING_TO)
[grammar] ~446-~446: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.)
Context: ...right prior files. Likely need some kind of a "prior stage name" config variable. ##...(KIND_OF_A)
Ruff
src/MEDS_transforms/reshard_to_split.py
116-116: Function definition does not bind loop variable
in_fps
(B023)
src/MEDS_transforms/extract/split_and_shard_patients.py
22-22: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
src/MEDS_transforms/mapreduce/utils.py
247-250: Use ternary operator
add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
instead ofif
-else
-blockReplace
if
-else
-block withadd_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
(SIM108)
src/MEDS_transforms/mapreduce/mapper.py
574-574: Do not perform function call
partial
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
Markdownlint
eICU_Example/README.md
13-13: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
15-15: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
16-16: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
21-21: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
22-22: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
23-23: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
24-24: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
26-26: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
27-27: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
Additional comments not posted (61)
docs/javascripts/mathjax.js (1)
1-12
: MathJax configuration looks good.The configuration for inline and display math is standard, and the use of
processEscapes
andprocessEnvironments
is appropriate for handling TeX expressions.src/MEDS_transforms/extract/__init__.py (2)
3-3
: Imports are appropriate for the new functionality.The import of
MANDATORY_COLUMNS
andMANDATORY_TYPES
is necessary for constructingMEDS_DATA_MANDATORY_TYPES
dynamically.
16-16
: Dynamic construction ofMEDS_DATA_MANDATORY_TYPES
improves maintainability.Using dictionary comprehension to construct
MEDS_DATA_MANDATORY_TYPES
fromMANDATORY_COLUMNS
andMANDATORY_TYPES
enhances flexibility and reduces the risk of errors when updating mandatory fields.src/MEDS_transforms/__init__.py (4)
4-4
: Import ofpolars
is appropriate for data type handling.The inclusion of
polars
is necessary for defining data types inMANDATORY_TYPES
, which enhances type safety.
15-15
: Definition ofMANDATORY_COLUMNS
ensures data integrity.Listing essential fields in
MANDATORY_COLUMNS
helps maintain data integrity by enforcing the presence of these fields in the dataset.
17-24
: Use ofpolars
types inMANDATORY_TYPES
enhances type safety.Defining expected data types with
polars
types ensures that data processing is consistent and reliable.
26-33
: Mapping deprecated names maintains backward compatibility.The
DEPRECATED_NAMES
dictionary helps transition to new naming conventions while maintaining backward compatibility with older datasets.src/MEDS_transforms/configs/preprocess.yaml (1)
4-4
: Ensure compatibility with the new stagereshard_to_split
.The addition of
reshard_to_split
in thestage_configs
section modifies the data processing pipeline. Verify that all necessary components are in place to support this new stage.Verification successful
The
reshard_to_split
stage is supported and configured properly.The
reshard_to_split
stage is referenced in various parts of the codebase, including configuration files, test scripts, and as an entry point inpyproject.toml
. This indicates that the necessary components are in place to support this new stage.
- Entry Point: Defined in
pyproject.toml
asMEDS_transform-reshard_to_split
.- Tests: Present in
tests/test_reshard_to_split.py
.- Configuration: Parameters set in
src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Ensure compatibility with the new stage `reshard_to_split`. # Test: Search for the `reshard_to_split` function or class definition. Expect: At least one definition. ast-grep --lang python --pattern $'def reshard_to_split($_):\n $$$'Length of output: 68
Script:
#!/bin/bash # Broaden the search for any occurrences of `reshard_to_split` in the codebase. rg 'reshard_to_split' -A 3Length of output: 1869
.github/workflows/tests.yaml (2)
36-36
: Enhance test reporting with JUnit XML.The addition of
--junitxml=junit.xml
in thepytest
command enhances test reporting by generating a JUnit XML report, which is useful for CI/CD integration.
42-46
: Robust test result upload to Codecov.The new step to upload test results to Codecov, conditioned on the previous steps not being cancelled, adds robustness to the workflow. Ensure the
CODECOV_TOKEN
is securely managed.mkdocs.yml (1)
1-45
: Comprehensive MkDocs configuration.The new
mkdocs.yml
file is well-structured, utilizing the Material theme and various extensions to enhance the documentation. Ensure that all referenced files and scripts, likedocs/gen_ref_pages.py
, exist and are correctly configured.docs/gen_ref_pages.py (3)
12-30
: Ensure correct handling of special cases.The logic correctly skips
__main__.py
files and adjusts paths for__init__.py
files. Ensure that this behavior aligns with your documentation requirements.Verification successful
Handling of
__init__.py
and__main__.py
files is correct.The code correctly processes
__init__.py
files by converting them toindex.md
and skips__main__.py
files as intended. The current implementation aligns with the documentation requirements.
__init__.py
files found in thesrc
directory will be converted toindex.md
.- No
__main__.py
files were found, confirming that none are incorrectly processed.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the handling of __init__.py and __main__.py files in the documentation generation. # Test: List all __init__.py and __main__.py files in the src directory. # Expect: __main__.py files are excluded from documentation, __init__.py files are converted to index.md. fd --type f --name '__init__.py' src/ fd --type f --name '__main__.py' src/Length of output: 577
Script:
#!/bin/bash # Description: Verify the handling of __init__.py and __main__.py files in the documentation generation. # Test: List all __init__.py and __main__.py files in the src directory. # Expect: __main__.py files are excluded from documentation, __init__.py files are converted to index.md. fd '__init__.py' src/ fd '__main__.py' src/Length of output: 334
32-32
: Verify navigation structure.The navigation structure is built using mkdocs_gen_files.Nav. Ensure that the generated structure meets the project's documentation needs.
37-40
: Check for file path correctness.Ensure that the paths used for writing files and setting edit paths are correct and that they align with the project's structure.
pyproject.toml (2)
50-50
: Verify new script entry.The new script
MEDS_transform-reshard_to_split
is added. Ensure that the function it points to exists and is correctly implemented.
30-34
: Verify documentation dependencies.The new documentation dependencies enhance the project's documentation capabilities. Verify that the specified versions are compatible and meet the project's needs.
Verification successful
Documentation Dependencies Verified
All specified documentation dependencies are available on PyPI and compatible with the project's requirements.
mkdocs==1.6.0
mkdocs-material==9.5.31
mkdocstrings==0.25.2
mkdocs-gen-files==0.5.0
mkdocs-literate-nav==0.6.1
mkdocs-section-index==0.3.9
mkdocs-git-authors-plugin==0.9.0
mkdocs-git-revision-date-localized-plugin==1.2.6
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the compatibility of new documentation dependencies. # Test: Check if the specified versions of mkdocs and related plugins are available. # Expect: All specified versions should be available for installation. pip install mkdocs==1.6.0 mkdocs-material==9.5.31 mkdocstrings[python,shell]==0.25.2 mkdocs-gen-files==0.5.0 mkdocs-literate-nav==0.6.1 mkdocs-section-index==0.3.9 mkdocs-git-authors-plugin==0.9.0 mkdocs-git-revision-date-localized-plugin==1.2.6Length of output: 320
Script:
#!/bin/bash # Description: Check the availability of specified package versions on PyPI. # List of packages and their versions packages=( "mkdocs==1.6.0" "mkdocs-material==9.5.31" "mkdocstrings[python,shell]==0.25.2" "mkdocs-gen-files==0.5.0" "mkdocs-literate-nav==0.6.1" "mkdocs-section-index==0.3.9" "mkdocs-git-authors-plugin==0.9.0" "mkdocs-git-revision-date-localized-plugin==1.2.6" ) # Check each package version on PyPI for pkg in "${packages[@]}"; do name_version=$(echo $pkg | sed 's/==/-/g' | sed 's/\[.*\]//g') curl -s "https://pypi.org/pypi/${name_version}/json" | jq '.info.version' doneLength of output: 2130
Script:
#!/bin/bash # Description: Check the availability of specified package versions on PyPI. # List of packages and their versions packages=( "mkdocs==1.6.0" "mkdocs-material==9.5.31" "mkdocstrings==0.25.2" "mkdocs-gen-files==0.5.0" "mkdocs-literate-nav==0.6.1" "mkdocs-section-index==0.3.9" "mkdocs-git-authors-plugin==0.9.0" "mkdocs-git-revision-date-localized-plugin==1.2.6" ) # Check each package version on PyPI for pkg in "${packages[@]}"; do name_version=$(echo $pkg | sed 's/==/-/g') package_name=$(echo $name_version | cut -d'-' -f1) version=$(echo $name_version | cut -d'-' -f2) curl -s "https://pypi.org/pypi/${package_name}/${version}/json" | jq '.info.version' doneLength of output: 2945
Script:
#!/bin/bash # Description: Check the availability of specified package versions on PyPI. # List of packages and their versions packages=( "mkdocs==1.6.0" "mkdocs-material==9.5.31" "mkdocstrings==0.25.2" "mkdocs-gen-files==0.5.0" "mkdocs-literate-nav==0.6.1" "mkdocs-section-index==0.3.9" "mkdocs-git-authors-plugin==0.9.0" "mkdocs-git-revision-date-localized-plugin==1.2.6" ) # Check each package version on PyPI for pkg in "${packages[@]}"; do package_name=$(echo $pkg | cut -d'=' -f1) version=$(echo $pkg | cut -d'=' -f3) curl -s "https://pypi.org/pypi/${package_name}/${version}/json" | jq '.info.version' doneLength of output: 2434
.github/workflows/python-build.yaml (4)
72-78
: Verify Sigstore signing configuration.The Sigstore signing step should be correctly configured and secure. Verify the inputs and ensure that the signing process is reliable.
78-95
: Check GitHub Release creation.Ensure that the GitHub Release creation step is correctly configured and that all necessary artifacts are uploaded.
Verification successful
GitHub Release creation is correctly configured.
The workflow configuration for creating a GitHub Release and uploading artifacts is properly set up. The
dist/**
pattern ensures that all necessary distribution files and signatures are uploaded to the release.
- The
github-release
job downloads artifacts, signs them, and uploads them to the GitHub release.- The use of
${{ github.ref_name }}
and${{ github.repository }}
in thegh
commands is appropriate within the GitHub Actions environment.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the configuration for GitHub Release creation. # Test: Check if the GitHub Release step uploads all necessary artifacts. # Expect: All distribution files and signatures should be uploaded. gh release view '${{ github.ref_name }}' --repo '${{ github.repository }}'Length of output: 148
Script:
#!/bin/bash # Search for the GitHub Release creation and artifact upload steps in the workflow file # to verify if all necessary artifacts are included. # Find the workflow file fd 'python-build.yaml' --type f --search-path .github/workflows --exec catLength of output: 2813
30-33
: Confirm tag-based publishing.The workflow publishes to PyPI only on tag pushes. Ensure that this condition is correctly implemented and tested.
Verification successful
Tag-based publishing condition is correctly implemented.
The
publish-to-pypi
job in the GitHub Actions workflow is configured to trigger only on tag pushes using the conditionstartsWith(github.ref, 'refs/tags/')
. This ensures that publishing to PyPI occurs as expected when a tag is pushed.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify tag-based publishing condition. # Test: Check if the workflow correctly identifies tag pushes. # Expect: Workflow should trigger publishing only on tag pushes. git tag v0.0.4 && git push origin v0.0.4Length of output: 132
Script:
#!/bin/bash # Verify the 'publish-to-pypi' job's condition in the GitHub Actions workflow file. # We are looking for the 'if' condition that should trigger the job only on tag pushes. rg -A 3 'publish-to-pypi:' .github/workflows/python-build.yamlLength of output: 224
11-15
: Verify Python version setup.Ensure that the specified Python version (3.12) is compatible with all dependencies and the project requirements.
.pre-commit-config.yaml (3)
129-129
: LGTM! Improved readability ofnbqa-flake8
arguments.The reformatting of arguments into a single line enhances readability without affecting functionality.
4-4
: Verify the new exclusion pattern.Ensure that the updated exclusion pattern aligns with the project's requirements and does not exclude important files from pre-commit checks.
Verification successful
Exclusion pattern verified successfully.
The exclusion pattern in
.pre-commit-config.yaml
correctly excludes the specified files (docs/index.md
,MIMIC-IV_Example/README.md
, andeICU_Example/README.md
) without affecting other critical files. No issues found with the current configuration.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the exclusion pattern in `.pre-commit-config.yaml`. # Test: List files that match the exclusion pattern. Expect: No critical files should be excluded. rg --files --glob '!docs/index.md' --glob '!MIMIC-IV_Example/README.md' --glob '!eICU_Example/README.md'Length of output: 3835
101-103
: Verify compatibility of new markdown hooks.Ensure that the newly added hooks (
mdformat-mkdocs
,mdformat-toc
,mdformat-admon
) are compatible with the project's markdown files and do not introduce formatting issues.tests/utils.py (2)
7-7
: LGTM! Necessary import for configuration handling.The import of
OmegaConf
is necessary for handling YAML configuration files in the updatedrun_command
function.
114-166
: Review error handling and resource cleanup.The error handling and resource cleanup logic have been improved. Ensure that temporary directories are always cleaned up to prevent resource leaks.
src/MEDS_transforms/reshard_to_split.py (4)
4-17
: Consider removing unused imports.The imports from
hydra
,loguru
, andomegaconf
are appropriate for configuration management and logging. However, ensure that all imported modules are used in the file.
87-95
: Verify the filtering logic withinrwlock_wrap
.Ensure that the filtering logic applied with
read_and_filter_fntr
is correct and that it efficiently filters the data based on patient IDs.Verification successful
Filtering logic within
rwlock_wrap
is correctly implemented.The function
read_and_filter_fntr
is used to apply a filter expression to data read from a file. In the context ofrwlock_wrap
, it effectively filters the data based on the specified patient IDs using the expressionpl.col("patient_id").is_in(patients)
. The implementation aligns with the intended functionality.
src/MEDS_transforms/reshard_to_split.py
: Usage ofread_and_filter_fntr
with patient ID filtering.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the filtering logic applied in rwlock_wrap. # Test: Search for read_and_filter_fntr usage to confirm filtering logic. rg --type python $'read_and_filter_fntr'Length of output: 70
Script:
#!/bin/bash # Search for the definition and usage of read_and_filter_fntr to verify the filtering logic. rg 'read_and_filter_fntr' -A 5Length of output: 2990
28-32
: Check for potential data loading issues.The code reads a Parquet file into a DataFrame. Ensure that the file path is correct and that the file exists in the expected location.
21-23
: Ensure correct configuration path usage.The
hydra.main
decorator usesPREPROCESS_CONFIG_YAML
for configuration. Verify that this path is correct and accessible in the runtime environment.tests/test_reshard_to_split.py (1)
193-202
: Ensure comprehensive test coverage.The test function
test_reshard_to_split
usessingle_stage_transform_tester
to validate the script. Ensure that this covers all critical paths and edge cases of the script.tests/test_filter_measurements.py (1)
211-228
: Ensure filtering logic is accurately tested.The new test function
test_match_revise_filter_measurements
uses specific filtering criteria. Verify that these criteria align with the intended logic and cover necessary edge cases.Verification successful
Filtering logic in test function is correctly aligned with implementation.
The test function
test_match_revise_filter_measurements
accurately reflects the filtering logic defined in the source code. It tests various codes and patient thresholds, ensuring comprehensive coverage of the filtering criteria.
- The filtering function
filter_measurements_fntr
insrc/MEDS_transforms/filters/filter_measurements.py
aligns with the test's logic.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the filtering logic in test_match_revise_filter_measurements aligns with the intended logic. # Test: Check for additional filtering logic or criteria related to filter_measurements. rg --type python $'filter_measurements'Length of output: 69
Script:
#!/bin/bash # Description: Verify the filtering logic in test_match_revise_filter_measurements aligns with the intended logic. # Test: Search for the term 'filter_measurements' across all Python files to find related logic or criteria. rg 'filter_measurements' --glob '*.py' -A 5Length of output: 7435
src/MEDS_transforms/extract/finalize_MEDS_metadata.py (2)
149-149
: Streamline initialization ofshards_map_fp
.The direct assignment of
shards_map_fp
within themain
function simplifies the code by reducing unnecessary return values fromstage_init(cfg)
. This is a good practice for improving code readability and maintainability.
196-196
: Ensure existence ofcfg.shards_map_fp
.Before using
Path(cfg.shards_map_fp)
, verify thatcfg.shards_map_fp
is a valid and expected path. This prevents potential runtime errors if the configuration is incorrect.src/MEDS_transforms/transforms/tokenization.py (2)
232-235
: Check for unsupported configurations.Raising a
ValueError
wheninclude_only_train
is true ensures unsupported configurations are flagged early. This is a good practice to prevent unexpected behavior during execution.
20-20
: Update import statements.The import of
shard_iterator
fromMEDS_transforms.mapreduce.utils
suggests a shift in utility functions. Ensure thatshard_iterator
is the correct choice for the intended functionality.Verification successful
Import statement is correct.
The
shard_iterator
function is correctly imported fromMEDS_transforms.mapreduce.utils
and is used across multiple files in the codebase. No updates to the import statement are necessary.
src/MEDS_transforms/mapreduce/utils.py
: Contains the definition ofshard_iterator
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of `shard_iterator` in the codebase. # Test: Search for the `shard_iterator` function definition. Expect: A valid function definition. ast-grep --lang python --pattern $'def shard_iterator($_):'Length of output: 60
Script:
#!/bin/bash # Searching for any occurrences of `shard_iterator` in the codebase to verify its existence and usage. rg 'shard_iterator'Length of output: 2577
src/MEDS_transforms/extract/merge_to_MEDS_cohort.py (2)
12-12
: Ensure consistency with new shard iterator.The import of
shard_iterator_by_shard_map
replaces the previousshard_iterator
. Ensure that this change is consistent with the intended behavior of themap_over
function and that it aligns with the overall data processing logic.Verification successful
Consistent usage of
shard_iterator_by_shard_map
verified.The
shard_iterator_by_shard_map
function is correctly used as an argument in themap_over
function withinmerge_to_MEDS_cohort.py
, ensuring consistency with the intended data processing logic.
merge_to_MEDS_cohort.py
:map_over
usesshard_iterator_by_shard_map
asshard_iterator_fntr
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `shard_iterator_by_shard_map` in the codebase. # Test: Search for the function usage. Expect: Consistent usage with the intended behavior. rg --type python 'shard_iterator_by_shard_map'Length of output: 77
Script:
#!/bin/bash # Description: Verify the usage of `shard_iterator_by_shard_map` in the codebase. # Test: Search for the function usage. Expect: Consistent usage with the intended behavior. rg --type py 'shard_iterator_by_shard_map'Length of output: 776
Script:
#!/bin/bash # Description: Verify the usage of `map_over` in relation to `shard_iterator_by_shard_map`. # Test: Search for the `map_over` function in the `merge_to_MEDS_cohort.py` file. rg 'map_over' src/MEDS_transforms/extract/merge_to_MEDS_cohort.py -A 5Length of output: 363
241-241
: Verify the impact of iterator function change.The
shard_iterator_fntr
parameter has been updated to useshard_iterator_by_shard_map
. Verify that this change does not introduce any issues in themap_over
function's behavior.docs/preprocessing_operation_prototypes.md (1)
45-46
: Formatting improvements enhance readability.The formatting adjustments improve the document's visual consistency and readability. No functional content changes are made.
Also applies to: 74-74, 82-83, 113-113, 155-155, 203-203, 205-205, 210-211, 222-222, 230-233
src/MEDS_transforms/extract/split_and_shard_patients.py (3)
108-108
: Initializesplits
consistently.The
splits
variable is initialized withexternal_splits
, which is appropriate. Ensure this initialization is consistent with the intended logic for handling splits.
111-139
: Improved handling of split fractions and warnings.The restructuring of logic to handle split fractions and warnings about empty splits enhances robustness. Ensure that the warning messages are clear and informative.
268-272
: Improved logging and file handling.The changes to path handling and logging improve clarity and maintainability. The use of
shards_map_fp
for file operations is a good practice.src/MEDS_transforms/extract/README.md (1)
8-10
: Formatting changes improve readability.The adjustments to indentation and alignment enhance the document's visual clarity.
Also applies to: 32-36, 38-44, 80-90, 92-93, 104-104, 220-221, 248-250, 252-255
tests/transform_tester_base.py (3)
303-305
: New parameters enhance configurability.The addition of
do_use_config_yaml
,input_shards_map
, andinput_splits_map
allows for more flexible input configurations.
366-374
: Streamlined command execution logic.The use of
run_command_kwargs
to pass arguments improves readability and maintainability.
320-336
: Ensure correct default handling for input maps.The logic for defaulting
input_shards_map
andinput_splits_map
is correctly implemented, but verify that these defaults align with expected usage scenarios.Verification successful
Defaults for
input_shards_map
andinput_splits_map
are correctly handled and tested.The search results indicate that there are test cases where
input_shards_map
andinput_splits_map
are not explicitly provided, allowing the defaults to be used. This confirms that the default handling logic is being tested as expected.
- Test cases in
tests/test_tokenization.py
,tests/test_reorder_measurements.py
,tests/test_tensorization.py
,tests/test_fit_vocabulary_indices.py
,tests/test_filter_measurements.py
,tests/test_normalization.py
,tests/test_occlude_outliers.py
, andtests/test_filter_patients.py
do not provide these maps, implying default usage.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify usage of `input_shards_map` and `input_splits_map` in tests. # Test: Search for test cases using `single_stage_transform_tester` to ensure defaults are correctly applied. rg --type python 'single_stage_transform_tester' -A 10Length of output: 85
Script:
#!/bin/bash # Description: Verify usage of `input_shards_map` and `input_splits_map` in tests. # Test: Search for test cases using `single_stage_transform_tester` to ensure defaults are correctly applied. rg --type py 'single_stage_transform_tester' -A 10Length of output: 14376
src/MEDS_transforms/utils.py (2)
Line range hint
225-352
: Improved stage configuration logic.The refined handling of metadata stages and
train_only
logic enhances configurability. Verify that all stages are handled correctly.Verification successful
Verification of
populate_stage
logic improvements.The
populate_stage
function is tested through doctests withinsrc/MEDS_transforms/utils.py
, which cover various stage configurations and expected outcomes. These examples verify the function's behavior and logic improvements. Consider adding more comprehensive unit tests for additional coverage.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify correctness of stage configuration logic in `populate_stage`. # Test: Search for test cases or usage examples of `populate_stage` to ensure all stages are configured correctly. rg --type python 'populate_stage' -A 10Length of output: 70
Script:
#!/bin/bash # Description: Verify correctness of stage configuration logic in `populate_stage`. # Test: Search for test cases or usage examples of `populate_stage` to ensure all stages are configured correctly. rg --type py 'populate_stage' -A 10Length of output: 6119
Line range hint
61-98
: Simplified return type improves clarity.The change to return only three
Path
objects reduces complexity. Ensure that dependent code is updated accordingly.Verification successful
Simplified return type is consistent with current usage.
The change to return only three
Path
objects instage_init
aligns with its usage across the codebase, where the function's return values are unpacked into three variables. No updates to dependent code are necessary.
src/MEDS_transforms/reshard_to_split.py
src/MEDS_transforms/mapreduce/mapper.py
src/MEDS_transforms/extract/split_and_shard_patients.py
src/MEDS_transforms/extract/finalize_MEDS_metadata.py
src/MEDS_transforms/extract/convert_to_sharded_events.py
src/MEDS_transforms/extract/extract_code_metadata.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify usage of `stage_init` to ensure dependent code is updated. # Test: Search for all usages of `stage_init` to check for consistency with the new return type. rg --type python 'stage_init' -A 5Length of output: 65
Script:
#!/bin/bash # Description: Verify usage of `stage_init` to ensure dependent code is updated. # Test: Search for all usages of `stage_init` to check for consistency with the new return type. rg --type py 'stage_init' -A 5Length of output: 6872
src/MEDS_transforms/extract/extract_code_metadata.py (2)
352-352
: Change in unpacking ofstage_init
return values is appropriate.The
stage_init
function now returns three values, and the change in unpacking aligns with the updated return signature. Ensure that the removed value is not used elsewhere in the code.
20-20
: Verify the new import path forcfg_to_expr
.The import path has changed, which may indicate a refactor. Ensure that this path is correct and consistent with the project's module organization.
Verification successful
Import Path Verified:
cfg_to_expr
The import path for
cfg_to_expr
insrc/MEDS_transforms/extract/extract_code_metadata.py
is correct. The function is defined insrc/MEDS_transforms/parser.py
, matching the import statement. No issues found.
src/MEDS_transforms/parser.py
: Functioncfg_to_expr
is correctly defined here.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the existence and correctness of the new import path for `cfg_to_expr`. # Test: Check if the `cfg_to_expr` function is correctly imported from the new location. ast-grep --lang python --pattern $'def cfg_to_expr($_) { $$$ }'Length of output: 514
Script:
#!/bin/bash # Description: Verify the existence and correctness of the new import path for `cfg_to_expr`. # Test: Check if the `cfg_to_expr` function is correctly defined in the new location. ast-grep --lang python --pattern 'def cfg_to_expr($$):'Length of output: 57
Script:
#!/bin/bash # Description: Search for the `cfg_to_expr` function definition across the entire codebase. # Test: Find any definition of the `cfg_to_expr` function in the codebase. rg 'def cfg_to_expr' -A 2Length of output: 278
src/MEDS_transforms/mapreduce/utils.py (3)
182-182
: Consistent return types inrwlock_wrap
.The return statements have been updated to consistently return tuples, enhancing code consistency.
Also applies to: 216-216
261-474
: Enhanced functionality inshard_iterator
.The function now returns a tuple with input-output pairs and a boolean for training shards, improving its interface and usability.
477-560
:shard_iterator_by_shard_map
function is robust and well-integrated.The function correctly iterates over shards based on a shard map file and handles configuration errors effectively.
tests/test_extract.py (1)
444-447
: Update toshards_fp
intest_extraction
is appropriate.The change from
splits_fp
toshards_fp
reflects an update in data organization and is correctly integrated into the test.src/MEDS_transforms/mapreduce/mapper.py (2)
228-240
: Clarify the return type ofis_match_revise
.The function
is_match_revise
now has a return type annotation-> bool
. This is a good practice as it improves readability and helps with static analysis.
242-289
: Ensure comprehensive validation invalidate_match_revise
.The function
validate_match_revise
performs thorough validation of thestage_cfg
. The error messages are clear, which aids in debugging.README.md (1)
3-4
: Verify badge links and tokens.Ensure that the URLs for the badges are correct and that the tokens are valid. This is crucial for accurate status display.
Verification successful
Badge links and tokens are valid.
Both badge URLs returned HTTP 200 status codes, confirming their accessibility and validity. No issues were found with the badge links or tokens.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the URLs and tokens for the badges in the README.md file. # Test: Check if the badge URLs are accessible. Expect: HTTP 200 status code. curl -I https://codecov.io/gh/mmcdermott/MEDS_transforms/graph/badge.svg?token=5RORKQOZF9 curl -I https://readthedocs.org/projects/meds-transforms/badge/?version=latestLength of output: 3136
src/MEDS_transforms/aggregate_code_metadata.py (3)
63-66
: Document the new quantile aggregation clearly.The addition of
VALUES_QUANTILES
to theMETADATA_FN
class is well-documented. Ensure that the configuration file is updated to include examples of how to specify quantiles.
601-618
: Test the integration of quantile aggregation inreducer_fntr
.The integration of quantile aggregation in the
reducer_fntr
function appears correct. Ensure that tests cover scenarios where quantiles are part of the aggregation configuration.
105-150
: Ensure correctness ofquantile_reducer
.The
quantile_reducer
function is correctly implemented to calculate specified quantiles. Ensure that it handles edge cases, such as empty datasets, gracefully.src/MEDS_transforms/extract/convert_to_sharded_events.py (2)
20-20
: Simplified import statement.The removal of parentheses around imported functions improves readability without affecting functionality.
683-685
: Improved clarity by reading shards directly.The change to read shards directly from
cfg.shards_map_fp
instead of usingshards_map_fn
fromstage_init
enhances clarity and reduces dependencies.
`relative_table_file_stem.event_name.code` field. This field can be a string literal, a reference to an | ||
input column (denoted by the `col(...)` syntax), or a list of same. If it is a list, the output code will | ||
be a `"//"` separated string of each field in the list. Each field in the list (again either a string | ||
literal or a input column reference) is interpreted either as the specified string literal or as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct article usage.
Use "an" instead of "a" before a vowel sound.
- or a input column reference
+ or an input column reference
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
literal or a input column reference) is interpreted either as the specified string literal or as the | |
literal or an input column reference) is interpreted either as the specified string literal or as the |
Tools
LanguageTool
[misspelling] ~83-~83: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...t (again either a string literal or a input column reference) is interpreted ...(EN_A_VS_AN)
@@ -19,7 +19,7 @@ def shard_patients[ | |||
patients: np.ndarray, | |||
n_patients_per_shard: int = 50000, | |||
external_splits: dict[str, Sequence[SUBJ_ID_T]] | None = None, | |||
split_fracs_dict: dict[str, float] = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}, | |||
split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid mutable default arguments.
Using mutable data structures like dictionaries as default arguments can lead to unexpected behavior. Replace with None
and initialize within the function.
- split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
+ split_fracs_dict: dict[str, float] | None = None,
Initialize within the function:
if split_fracs_dict is None:
split_fracs_dict = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}
Tools
Ruff
22-22: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]: | ||
"""Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration. | ||
|
||
Args: | ||
shards: The list of shards to shuffle. | ||
cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following | ||
keys (some are optional, as marked below): | ||
- `worker` (optional): The worker ID for the MR worker; this is also used to seed the | ||
randomization process. If not provided, the randomization process will be unseeded. | ||
|
||
Returns: | ||
The shuffled list of shards. | ||
|
||
Examples: | ||
>>> cfg = DictConfig({"worker": 1}) | ||
>>> shards = ["train/0", "train/1", "tuning", "held_out"] | ||
>>> shuffle_shards(shards, cfg) | ||
['train/1', 'held_out', 'tuning', 'train/0'] | ||
>>> cfg = DictConfig({"worker": 2}) | ||
>>> shuffle_shards(shards, cfg) | ||
['tuning', 'held_out', 'train/1', 'train/0'] | ||
""" | ||
|
||
if "worker" in cfg: | ||
add_str = str(cfg["worker"]) | ||
else: | ||
add_str = str(datetime.now()) | ||
|
||
shard_keys = [] | ||
for shard in shards: | ||
shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest() | ||
if shard_hash in shard_keys: | ||
raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!") | ||
shard_keys.append(int(shard_hash, 16)) | ||
|
||
return [shard for _, shard in sorted(zip(shard_keys, shards))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shuffle_shards
function is well-implemented.
The function shuffles shards deterministically based on the worker ID, and uses SHA-256 to avoid hash collisions. Consider using a ternary operator for add_str
assignment for clarity.
- if "worker" in cfg:
- add_str = str(cfg["worker"])
- else:
- add_str = str(datetime.now())
+ add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]: | |
"""Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration. | |
Args: | |
shards: The list of shards to shuffle. | |
cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following | |
keys (some are optional, as marked below): | |
- `worker` (optional): The worker ID for the MR worker; this is also used to seed the | |
randomization process. If not provided, the randomization process will be unseeded. | |
Returns: | |
The shuffled list of shards. | |
Examples: | |
>>> cfg = DictConfig({"worker": 1}) | |
>>> shards = ["train/0", "train/1", "tuning", "held_out"] | |
>>> shuffle_shards(shards, cfg) | |
['train/1', 'held_out', 'tuning', 'train/0'] | |
>>> cfg = DictConfig({"worker": 2}) | |
>>> shuffle_shards(shards, cfg) | |
['tuning', 'held_out', 'train/1', 'train/0'] | |
""" | |
if "worker" in cfg: | |
add_str = str(cfg["worker"]) | |
else: | |
add_str = str(datetime.now()) | |
shard_keys = [] | |
for shard in shards: | |
shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest() | |
if shard_hash in shard_keys: | |
raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!") | |
shard_keys.append(int(shard_hash, 16)) | |
return [shard for _, shard in sorted(zip(shard_keys, shards))] | |
def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]: | |
"""Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration. | |
Args: | |
shards: The list of shards to shuffle. | |
cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following | |
keys (some are optional, as marked below): | |
- `worker` (optional): The worker ID for the MR worker; this is also used to seed the | |
randomization process. If not provided, the randomization process will be unseeded. | |
Returns: | |
The shuffled list of shards. | |
Examples: | |
>>> cfg = DictConfig({"worker": 1}) | |
>>> shards = ["train/0", "train/1", "tuning", "held_out"] | |
>>> shuffle_shards(shards, cfg) | |
['train/1', 'held_out', 'tuning', 'train/0'] | |
>>> cfg = DictConfig({"worker": 2}) | |
>>> shuffle_shards(shards, cfg) | |
['tuning', 'held_out', 'train/1', 'train/0'] | |
""" | |
add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now()) | |
shard_keys = [] | |
for shard in shards: | |
shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest() | |
if shard_hash in shard_keys: | |
raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!") | |
shard_keys.append(int(shard_hash, 16)) | |
return [shard for _, shard in sorted(zip(shard_keys, shards))] |
Tools
Ruff
247-250: Use ternary operator
add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
instead ofif
-else
-blockReplace
if
-else
-block withadd_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
(SIM108)
def read_and_filter_fntr(filter_expr: pl.Expr, read_fn: Callable[[Path], DF_T]) -> Callable[[Path], DF_T]: | ||
"""Create a function that reads a DataFrame from a file and filters it based on a given expression. | ||
|
||
This is specified as a functor in this way to allow it to modify arbitrary other read functions for use in | ||
different mapreduce pipelines. | ||
|
||
Args: | ||
filter_expr: The filter expression to apply to the DataFrame. | ||
read_fn: The read function to use to read the DataFrame. | ||
|
||
Returns: | ||
A function that reads a DataFrame from a file and filters it based on the given expression. | ||
|
||
Examples: | ||
>>> dfs = { | ||
... "df1": pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}), | ||
... "df2": pl.DataFrame({"a": [4, 5, 6], "b": [7, 8, 9]}) | ||
... } | ||
>>> read_fn = lambda key: dfs[key] | ||
>>> fn = read_and_filter_fntr((pl.col("a") % 2) == 0, read_fn) | ||
>>> fn("df1") | ||
shape: (1, 2) | ||
┌─────┬─────┐ | ||
│ a ┆ b │ | ||
│ --- ┆ --- │ | ||
│ i64 ┆ i64 │ | ||
╞═════╪═════╡ | ||
│ 2 ┆ 5 │ | ||
└─────┴─────┘ | ||
>>> fn("df2") | ||
shape: (2, 2) | ||
┌─────┬─────┐ | ||
│ a ┆ b │ | ||
│ --- ┆ --- │ | ||
│ i64 ┆ i64 │ | ||
╞═════╪═════╡ | ||
│ 4 ┆ 7 │ | ||
│ 6 ┆ 9 │ | ||
└─────┴─────┘ | ||
>>> fn = read_and_filter_fntr((pl.col("b") % 2) == 0, read_fn) | ||
>>> fn("df1") | ||
shape: (2, 2) | ||
┌─────┬─────┐ | ||
│ a ┆ b │ | ||
│ --- ┆ --- │ | ||
│ i64 ┆ i64 │ | ||
╞═════╪═════╡ | ||
│ 1 ┆ 4 │ | ||
│ 3 ┆ 6 │ | ||
└─────┴─────┘ | ||
>>> fn("df2") | ||
shape: (1, 2) | ||
┌─────┬─────┐ | ||
│ a ┆ b │ | ||
│ --- ┆ --- │ | ||
│ i64 ┆ i64 │ | ||
╞═════╪═════╡ | ||
│ 5 ┆ 8 │ | ||
└─────┴─────┘ | ||
""" | ||
|
||
def read_and_filter(in_fp: Path) -> DF_T: | ||
df = read_fn(in_fp) | ||
return df.filter(pl.col("patient_id").isin(patients)) | ||
return read_fn(in_fp).filter(filter_expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update default argument handling for read_fn
.
The read_fn
argument in read_and_filter_fntr
is a partial
function. It's better to avoid using function calls in default arguments due to Python's handling of mutable defaults. Consider defining the default outside the function.
# Define the default read function outside
default_read_fn = partial(pl.scan_parquet, glob=False)
def read_and_filter_fntr(filter_expr: pl.Expr, read_fn: Callable[[Path], DF_T] = default_read_fn) -> Callable[[Path], DF_T]:
def map_over( | ||
cfg: DictConfig, | ||
compute_fn: COMPUTE_FN_T | None = None, | ||
read_fn: Callable[[Path], DF_T] = partial(pl.scan_parquet, glob=False), | ||
write_fn: Callable[[DF_T, Path], None] = write_lazyframe, | ||
shard_iterator_fntr: SHARD_ITR_FNTR_T = shard_iterator, | ||
) -> list[Path]: | ||
stage_init(cfg) | ||
|
||
start = datetime.now() | ||
|
||
train_only = cfg.stage_cfg.get("train_only", False) | ||
split_fp = Path(cfg.input_dir) / "metadata" / "patient_split.parquet" | ||
|
||
shards, includes_only_train = shard_iterator_fntr(cfg) | ||
|
||
if train_only: | ||
if includes_only_train: | ||
logger.info( | ||
f"Processing train split only via shard prefix. Not filtering with {str(split_fp.resolve())}." | ||
) | ||
elif split_fp.exists(): | ||
logger.info(f"Processing train split only by filtering read dfs via {str(split_fp.resolve())}") | ||
train_patients = ( | ||
pl.scan_parquet(split_fp) | ||
.filter(pl.col("split") == "train") | ||
.select(pl.col("patient_id")) | ||
.collect() | ||
.to_list() | ||
) | ||
read_fn = read_and_filter_fntr(train_patients, read_fn) | ||
else: | ||
raise FileNotFoundError( | ||
f"Train split requested, but shard prefixes can't be used and " | ||
f"patient split file not found at {str(split_fp.resolve())}." | ||
) | ||
elif includes_only_train: | ||
raise ValueError("All splits should be used, but shard iterator is returning only train splits?!?") | ||
|
||
if is_match_revise(cfg.stage_cfg): | ||
compute_fn = match_revise_fntr(cfg, cfg.stage_cfg, compute_fn) | ||
else: | ||
compute_fn = bind_compute_fn(cfg, cfg.stage_cfg, compute_fn) | ||
|
||
all_out_fps = [] | ||
for in_fp, out_fp in shard_iterator_fntr(cfg): | ||
for in_fp, out_fp in shards: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor map_over
to avoid function calls in default arguments.
The read_fn
argument uses partial
. Define this outside the function to avoid potential issues with mutable defaults.
# Define the default read function outside
default_read_fn = partial(pl.scan_parquet, glob=False)
def map_over(
cfg: DictConfig,
compute_fn: COMPUTE_FN_T | None = None,
read_fn: Callable[[Path], DF_T] = default_read_fn,
write_fn: Callable[[DF_T, Path], None] = write_lazyframe,
shard_iterator_fntr: SHARD_ITR_FNTR_T = shard_iterator,
) -> list[Path]:
Tools
Ruff
574-574: Do not perform function call
partial
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
- The time since the last event of a certain type. | ||
- The patient's age as of each unique timepoint. | ||
- The time-of-day of each event. | ||
- Adding a "dummy" event to the dataset for each patient that occurs at the end of the observation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use "who" instead of "that" for referring to people.
Use "who" when referring to people for better grammatical correctness.
- Adding a "dummy" event to the dataset for each patient that occurs at the end of the observation period.
+ Adding a "dummy" event to the dataset for each patient who occurs at the end of the observation period.
Committable suggestion was skipped due to low confidence.
Tools
LanguageTool
[style] ~315-~315: Consider using “who” when you are referring to a person instead of an object.
Context: ..." event to the dataset for each patient that occurs at the end of the observation ...(THAT_WHO)
2. Figure out how to ensure that each pre-processing step reads from the right prior files. Likely need some | ||
kind of a "prior stage name" config variable. | ||
kind of a "prior stage name" config variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct usage of "kind of".
The phrase "kind of a" should be corrected to "kind of" for proper grammar.
- Likely need some kind of a "prior stage name" config variable.
+ Likely need some kind of "prior stage name" config variable.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
kind of a "prior stage name" config variable. | |
Likely need some kind of "prior stage name" config variable. |
Tools
LanguageTool
[grammar] ~446-~446: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.)
Context: ...right prior files. Likely need some kind of a "prior stage name" config variable. ##...(KIND_OF_A)
stage are expressible in the `code` and `numeric_values` columns of the dataset, which helps | ||
standardize further downstream use. | ||
should be included during code grouping and modeling operations. The goal of this step is to ensure that | ||
the only columns that need be processed going into the pre-processing, tokenization, and tensorization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing preposition for clarity.
A preposition is missing, which can affect the sentence's clarity.
- ensure that the only columns that need be processed going into the pre-processing, tokenization, and tensorization stage are expressible in the `code` and `numeric_values` columns of the dataset
+ ensure that the only columns that need to be processed going into the pre-processing, tokenization, and tensorization stage are expressible in the `code` and `numeric_values` columns of the dataset
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
the only columns that need be processed going into the pre-processing, tokenization, and tensorization | |
ensure that the only columns that need to be processed going into the pre-processing, tokenization, and tensorization stage are expressible in the `code` and `numeric_values` columns of the dataset |
Tools
LanguageTool
[uncategorized] ~328-~328: Possible missing preposition found.
Context: ...ure that the only columns that need be processed going into the pre-processing...(AI_HYDRA_LEO_MISSING_TO)
2. This pipeline preferentially uses disk over memory and compute through aggressive caching. You should | ||
have sufficient disk space to store multiple copies of your raw dataset comfortably. | ||
have sufficient disk space to store multiple copies of your raw dataset comfortably. | ||
3. This pipeline can be run on a single machine or across many worker nodes on a cluster provided the worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing comma for clarity.
A comma is needed to separate clauses for better readability.
- This pipeline can be run on a single machine or across many worker nodes on a cluster provided the worker nodes have access to a distributed file system.
+ This pipeline can be run on a single machine or across many worker nodes on a cluster, provided the worker nodes have access to a distributed file system.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
3. This pipeline can be run on a single machine or across many worker nodes on a cluster provided the worker | |
3. This pipeline can be run on a single machine or across many worker nodes on a cluster, provided the worker |
Tools
LanguageTool
[uncategorized] ~207-~207: Possible missing comma found.
Context: ...achine or across many worker nodes on a cluster provided the worker nodes have acce...(AI_HYDRA_LEO_MISSING_COMMA)
stage's script in sequence. | ||
- The user can leverage arbitrary scheduling systems (e.g., Slurm, LSF, Kubernetes, etc.) to run each stage | ||
in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's | ||
script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the | ||
`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically | ||
required a distributed file system to work correctly, as these scripts use manually created file locks to | ||
avoid duplicative work. | ||
in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's | ||
script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the | ||
`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically | ||
required a distributed file system to work correctly, as these scripts use manually created file locks to | ||
avoid duplicative work. | ||
- The user can run each stage in parallel on a single machine by launching multiple copies of the same | ||
script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a | ||
significant speedup depending on the machine configuration as sharding ensures that parallelism can be used | ||
with minimal file read contention. | ||
script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a | ||
significant speedup depending on the machine configuration as sharding ensures that parallelism can be used | ||
with minimal file read contention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing commas for clarity.
There are missing commas in the sentences, which can affect readability.
- The user can run each stage in parallel on a single machine by launching multiple copies of the same script in different terminal sessions or all at once via the Hydra `joblib` launcher.
+ The user can run each stage in parallel on a single machine by launching multiple copies of the same script, in different terminal sessions, or all at once via the Hydra `joblib` launcher.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
stage's script in sequence. | |
- The user can leverage arbitrary scheduling systems (e.g., Slurm, LSF, Kubernetes, etc.) to run each stage | |
in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's | |
script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the | |
`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically | |
required a distributed file system to work correctly, as these scripts use manually created file locks to | |
avoid duplicative work. | |
in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's | |
script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the | |
`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically | |
required a distributed file system to work correctly, as these scripts use manually created file locks to | |
avoid duplicative work. | |
- The user can run each stage in parallel on a single machine by launching multiple copies of the same | |
script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a | |
significant speedup depending on the machine configuration as sharding ensures that parallelism can be used | |
with minimal file read contention. | |
script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a | |
significant speedup depending on the machine configuration as sharding ensures that parallelism can be used | |
with minimal file read contention. | |
stage's script in sequence. | |
- The user can leverage arbitrary scheduling systems (e.g., Slurm, LSF, Kubernetes, etc.) to run each stage | |
in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's | |
script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the | |
`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically | |
required a distributed file system to work correctly, as these scripts use manually created file locks to | |
avoid duplicative work. | |
- The user can run each stage in parallel on a single machine by launching multiple copies of the same | |
script, in different terminal sessions, or all at once via the Hydra `joblib` launcher. This can result in a | |
significant speedup depending on the machine configuration as sharding ensures that parallelism can be used | |
with minimal file read contention. |
Tools
LanguageTool
[grammar] ~103-~103: The modal verb ‘will’ requires the verb’s base form.
Context: ...rker jobs. Note this will typically required a distributed file system to work corre...(MD_BASEFORM)
[uncategorized] ~106-~106: Possible missing comma found.
Context: ...e same script in different terminal sessions or all at once via the Hydrajoblib
l...(AI_HYDRA_LEO_MISSING_COMMA)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/MEDS_transforms/reshard_to_split.py (1 hunks)
Additional context used
Ruff
src/MEDS_transforms/reshard_to_split.py
118-118: Function definition does not bind loop variable
in_fps
(B023)
Additional comments not posted (1)
src/MEDS_transforms/reshard_to_split.py (1)
4-18
: Imports look good.The imports are necessary for the functionality provided in the file.
if shards_fp.is_file(): | ||
if cfg.do_overwrite: | ||
logger.warning(f"Overwriting {str(shards_fp.resolve())}") | ||
shards_fp.unlink() | ||
else: | ||
raise FileExistsError(f"{str(shards_fp.resolve())} already exists.") | ||
|
||
shards_fp.write_text(json.dumps(new_sharded_splits)) | ||
|
||
output_dir = Path(cfg.stage_cfg.output_dir) | ||
|
||
orig_shards_iter, include_only_train = shard_iterator(cfg, out_suffix="") | ||
if include_only_train: | ||
raise ValueError("This stage does not support include_only_train=True") | ||
|
||
orig_shards_iter = [(in_fp, out_fp.relative_to(output_dir)) for in_fp, out_fp in orig_shards_iter] | ||
|
||
new_shards = shuffle_shards(list(new_sharded_splits.keys()), cfg) | ||
new_shards_iter = [(shard_name, output_dir / shard_name) for shard_name in new_shards] | ||
|
||
# Step 1: Sub-sharding stage | ||
logger.info("Starting sub-sharding") | ||
subshard_fps = defaultdict(list) | ||
|
||
for in_fp, orig_shard_name in orig_shards_iter: | ||
for subshard_name, out_dir in new_shards_iter: | ||
out_fp = out_dir / f"{orig_shard_name}.parquet" | ||
subshard_fps[subshard_name].append(out_fp) | ||
patients = new_sharded_splits[subshard_name] | ||
|
||
if not patients: | ||
raise ValueError(f"No patients found for {subshard_name}!") | ||
|
||
logger.info( | ||
f"Sub-sharding {str(in_fp.resolve())} to {len(patients)} patients in {str(out_fp.resolve())}" | ||
) | ||
|
||
rwlock_wrap( | ||
in_fp, | ||
out_fp, | ||
read_and_filter_fntr(pl.col("patient_id").is_in(patients), pl.scan_parquet), | ||
write_lazyframe, | ||
identity_fn, | ||
do_return=False, | ||
do_overwrite=cfg.do_overwrite, | ||
) | ||
|
||
logger.info("Merging sub-shards") | ||
for subshard_name, subshard_dir in new_shards_iter: | ||
in_dir = subshard_dir | ||
in_fps = subshard_fps[subshard_name] | ||
out_fp = subshard_dir.with_suffix(".parquet") | ||
|
||
logger.info(f"Merging {subshard_dir}/**/*.parquet into {str(out_fp.resolve())}") | ||
|
||
if not subshard_fps: | ||
raise ValueError(f"No subshards found for {subshard_name}!") | ||
|
||
if out_fp.is_file(): | ||
logger.info(f"Output file {str(out_fp.resolve())} already exists. Skipping.") | ||
continue | ||
|
||
while not (all(fp.is_file() for fp in in_fps) or out_fp.is_file()): | ||
logger.info("Waiting to begin merging for all sub-shard files to be written...") | ||
time.sleep(cfg.polling_time) | ||
|
||
def read_fn(fp: Path) -> pl.LazyFrame: | ||
return pl.concat([pl.scan_parquet(fp, glob=False) for fp in in_fps], how="diagonal_relaxed").sort( | ||
by=["patient_id", "time"], maintain_order=True, multithreaded=False | ||
) | ||
|
||
logger.info(f"Merging files to {str(out_fp.resolve())}") | ||
result_computed, _ = rwlock_wrap( | ||
in_dir, | ||
out_fp, | ||
read_fn, | ||
write_lazyframe, | ||
identity_fn, | ||
do_return=False, | ||
do_overwrite=cfg.do_overwrite, | ||
) | ||
|
||
if result_computed: | ||
logger.info(f"Cleaning up subsharded files in {str(subshard_dir.resolve())}/*.") | ||
for fp in in_fps: | ||
if fp.exists(): | ||
fp.unlink() | ||
try: | ||
for root, dirs, files in subshard_dir.walk(top_down=False): | ||
walked_dir = root.relative_to(subshard_dir) | ||
if files: | ||
raise FileExistsError(f"Files found in {walked_dir} after cleanup!: {files}") | ||
for d in dirs: | ||
(root / d).rmdir() | ||
subshard_dir.rmdir() | ||
except OSError as e: | ||
contents_str = "\n".join([str(f) for f in subshard_dir.iterdir()]) | ||
raise ValueError(f"Could not remove {str(subshard_dir)}. Contents:\n{contents_str}") from e | ||
|
||
logger.info(f"Done with {cfg.stage}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider modularizing the main
function.
The main
function is lengthy and could benefit from being broken down into smaller, more focused functions. This would improve readability and maintainability.
Tools
Ruff
118-118: Function definition does not bind loop variable
in_fps
(B023)
rwlock_wrap( | ||
in_fp, | ||
out_fp, | ||
read_and_filter_fntr(pl.col("patient_id").is_in(patients), pl.scan_parquet), | ||
write_lazyframe, | ||
identity_fn, | ||
do_return=False, | ||
do_overwrite=cfg.do_overwrite, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling to rwlock_wrap
usage.
Consider adding error handling for potential issues during file operations to ensure robustness.
Also applies to: 123-131
def read_fn(fp: Path) -> pl.LazyFrame: | ||
return pl.concat([pl.scan_parquet(fp, glob=False) for fp in in_fps], how="diagonal_relaxed").sort( | ||
by=["patient_id", "time"], maintain_order=True, multithreaded=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve the loop variable binding issue.
The static analysis tool indicates a potential issue with the function definition not binding the loop variable in_fps
. Ensure that fp
is correctly used within read_fn
.
- return pl.concat([pl.scan_parquet(fp, glob=False) for fp in in_fps], how="diagonal_relaxed").sort(
+ return pl.concat([pl.scan_parquet(fp, glob=False) for fp in [fp]], how="diagonal_relaxed").sort(
Committable suggestion was skipped due to low confidence.
Tools
Ruff
118-118: Function definition does not bind loop variable
in_fps
(B023)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/MEDS_transforms/reshard_to_split.py (1 hunks)
Additional context used
Ruff
src/MEDS_transforms/reshard_to_split.py
120-120: Function definition does not bind loop variable
in_fps
(B023)
Additional comments not posted (4)
src/MEDS_transforms/reshard_to_split.py (4)
1-18
: Header and imports look good.The shebang, docstring, and imports are clear and appropriate for the file's functionality.
90-98
: Add error handling torwlock_wrap
usage.Consider adding error handling for potential issues during file operations to ensure robustness.
119-122
: Resolve the loop variable binding issue.The static analysis tool indicates a potential issue with the function definition not binding the loop variable
in_fps
. Ensure thatfp
is correctly used withinread_fn
.Tools
Ruff
120-120: Function definition does not bind loop variable
in_fps
(B023)
135-150
: Ensure robust cleanup logic.The cleanup logic attempts to remove files and directories. Ensure that appropriate error handling is in place for potential issues like permission errors.
Verification successful
Ensure comprehensive error handling in cleanup logic.
The code includes a
try
block for handling potential issues during directory cleanup. Ensure that the correspondingexcept
block is robust enough to handle exceptions such as permission errors or other filesystem-related issues.
- Review the
except
block following thetry
statement to ensure comprehensive error handling.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify cleanup logic robustness by checking for error handling in file system operations. # Test: Search for error handling in cleanup logic. Expect: Appropriate error handling mechanisms. rg --type python -A 5 $'try:\n for root, dirs, files in subshard_dir.walk(top_down=False):'Length of output: 286
Script:
#!/bin/bash # Description: Re-verify cleanup logic robustness by checking for error handling in file system operations with multiline support. # Test: Search for error handling in cleanup logic. Expect: Appropriate error handling mechanisms. rg --type python --multiline -A 5 $'try:\n for root, dirs, files in subshard_dir.walk(top_down=False):'Length of output: 137
Script:
#!/bin/bash # Description: Re-verify cleanup logic robustness by checking for error handling in file system operations. # Test: Search for error handling in cleanup logic. Expect: Appropriate error handling mechanisms. rg --multiline -t py -A 5 $'try:\n for root, dirs, files in subshard_dir.walk(top_down=False):'Length of output: 98
Script:
#!/bin/bash # Description: Broaden the search to identify any try-except blocks related to file system operations. # Test: Search for try-except blocks in the specified file to ensure error handling is present. rg --multiline -t py -A 5 'try:' src/MEDS_transforms/reshard_to_split.pyLength of output: 396
def main(cfg: DictConfig): | ||
"""Re-shard a MEDS cohort to in a manner that subdivides patient splits.""" | ||
|
||
stage_init(cfg) | ||
|
||
splits_file = Path(cfg.input_dir) / "metadata" / "patient_splits.parquet" | ||
splits_df = pl.read_parquet(splits_file, use_pyarrow=True) | ||
splits_map = defaultdict(list) | ||
for pt_id, sp in splits_df.iter_rows(): | ||
splits_map[sp].append(pt_id) | ||
|
||
new_sharded_splits = shard_patients( | ||
patients=splits_df["patient_id"].to_numpy(), | ||
n_patients_per_shard=cfg.stage_cfg.n_patients_per_shard, | ||
external_splits=splits_map, | ||
split_fracs_dict=None, | ||
seed=cfg.get("seed", 1), | ||
) | ||
|
||
output_dir = Path(cfg.stage_cfg.output_dir) | ||
|
||
# Write shards to file | ||
if "shards_map_fp" in cfg: | ||
shards_fp = Path(cfg.shards_map_fp) | ||
else: | ||
shards_fp = output_dir / ".shards.json" | ||
|
||
if shards_fp.is_file(): | ||
if cfg.do_overwrite: | ||
logger.warning(f"Overwriting {str(shards_fp.resolve())}") | ||
shards_fp.unlink() | ||
else: | ||
old_shards_map = json.loads(shards_fp.read_text()) | ||
if old_shards_map != new_sharded_splits: | ||
raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.") | ||
|
||
shards_fp.write_text(json.dumps(new_sharded_splits)) | ||
|
||
output_dir = Path(cfg.stage_cfg.output_dir) | ||
|
||
orig_shards_iter, include_only_train = shard_iterator(cfg, out_suffix="") | ||
if include_only_train: | ||
raise ValueError("This stage does not support include_only_train=True") | ||
|
||
orig_shards_iter = [(in_fp, out_fp.relative_to(output_dir)) for in_fp, out_fp in orig_shards_iter] | ||
|
||
new_shards = shuffle_shards(list(new_sharded_splits.keys()), cfg) | ||
new_shards_iter = [(shard_name, output_dir / shard_name) for shard_name in new_shards] | ||
|
||
# Step 1: Sub-sharding stage | ||
logger.info("Starting sub-sharding") | ||
subshard_fps = defaultdict(list) | ||
|
||
for in_fp, orig_shard_name in orig_shards_iter: | ||
for subshard_name, out_dir in new_shards_iter: | ||
out_fp = out_dir / f"{orig_shard_name}.parquet" | ||
subshard_fps[subshard_name].append(out_fp) | ||
patients = new_sharded_splits[subshard_name] | ||
|
||
if not patients: | ||
raise ValueError(f"No patients found for {subshard_name}!") | ||
|
||
logger.info( | ||
f"Sub-sharding {str(in_fp.resolve())} to {len(patients)} patients in {str(out_fp.resolve())}" | ||
) | ||
|
||
rwlock_wrap( | ||
in_fp, | ||
out_fp, | ||
read_and_filter_fntr(pl.col("patient_id").is_in(patients), pl.scan_parquet), | ||
write_lazyframe, | ||
identity_fn, | ||
do_return=False, | ||
do_overwrite=cfg.do_overwrite, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider modularizing the main
function.
The main
function is lengthy and could benefit from being broken down into smaller, more focused functions. This would improve readability and maintainability.
if shards_fp.is_file(): | ||
if cfg.do_overwrite: | ||
logger.warning(f"Overwriting {str(shards_fp.resolve())}") | ||
shards_fp.unlink() | ||
else: | ||
old_shards_map = json.loads(shards_fp.read_text()) | ||
if old_shards_map != new_sharded_splits: | ||
raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for file existence and differences.
The current logic raises a FileExistsError
if the shard map differs. Consider logging the differences for better debugging.
- raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")
+ logger.error(f"Shard map differs: {old_shards_map} vs {new_sharded_splits}")
+ raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if shards_fp.is_file(): | |
if cfg.do_overwrite: | |
logger.warning(f"Overwriting {str(shards_fp.resolve())}") | |
shards_fp.unlink() | |
else: | |
old_shards_map = json.loads(shards_fp.read_text()) | |
if old_shards_map != new_sharded_splits: | |
raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.") | |
if shards_fp.is_file(): | |
if cfg.do_overwrite: | |
logger.warning(f"Overwriting {str(shards_fp.resolve())}") | |
shards_fp.unlink() | |
else: | |
old_shards_map = json.loads(shards_fp.read_text()) | |
if old_shards_map != new_sharded_splits: | |
logger.error(f"Shard map differs: {old_shards_map} vs {new_sharded_splits}") | |
raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.") |
Not yet ready -- see #139 |
…arallel_mode Fixes brittle reduce stage checking in Reshard stage.
Summary by CodeRabbit
New Features
.editorconfig
for consistent coding style across the project.Bug Fixes
Documentation
mkdocs.yml
.pipeline_configuration.md
for detailing the data processing framework.Refactor
Style
Tests