-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release candidate for 0.0.4 #146
Conversation
…an example of loading MEDS_transforms configs as defaults dynamically.
…low for pushing to pypi. The pypi push workflow requires local tagging via 'git tag 0.0.3' then 'git push origin 0.0.3'
Updated workflows and README.
This moves the parser to a general import, and absorbs some type checking commits from another PR to clean up the PR.
…implemented fully.
…ng splits parquet file.
Improved compliance by removing creation of shards.json file and adding `patient_splits.parquet` file.
Adds a reshard-by-split stage as a transform.
…arallel_mode Fixes brittle reduce stage checking in Reshard stage.
WalkthroughThis update introduces several enhancements across the project, including a new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant GitHub
participant PyPI
participant Docs
User->>GitHub: Push code changes
GitHub->>GitHub: Run build workflow
GitHub->>PyPI: Publish package
GitHub->>Docs: Generate documentation
Docs-->>User: Access updated docs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Outside diff range, codebase verification and nitpick comments (21)
src/MEDS_transforms/__init__.py (1)
26-33
:DEPRECATED_NAMES
is not used in the codebase.The
DEPRECATED_NAMES
dictionary is defined but not utilized anywhere in the codebase. Ensure that these mappings are applied where necessary to maintain backward compatibility in data transformations.
src/MEDS_transforms/__init__.py
: Declaration ofDEPRECATED_NAMES
.Analysis chain
Check
DEPRECATED_NAMES
for backward compatibility.The
DEPRECATED_NAMES
dictionary maps old column names to new ones. Ensure these mappings are correctly applied to maintain backward compatibility.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `DEPRECATED_NAMES` in the codebase. # Test: Search for `DEPRECATED_NAMES` usage. Expect: Correct application in data transformation logic. rg --type python 'DEPRECATED_NAMES'Length of output: 66
Script:
#!/bin/bash # Description: Verify the usage of `DEPRECATED_NAMES` in the codebase. # Test: Search for `DEPRECATED_NAMES` usage. Expect: Correct application in data transformation logic. rg --type py 'DEPRECATED_NAMES'Length of output: 84
docs/gen_ref_pages.py (1)
1-1
: Add a detailed module docstring.The current docstring provides a brief description. Consider expanding it to include details about the script's purpose, usage, and any assumptions or dependencies.
-"""Generate the code reference pages.""" +"""Generate the code reference pages for the project. This script traverses the source directory to create markdown documentation files for each Python module. It integrates with MkDocs to build a navigable API reference. """.github/workflows/python-build.yaml (2)
1-1
: Clarify the workflow name.The name "Publish Python 🐍 distribution 📦 to PyPI and TestPyPI" suggests publishing to both PyPI and TestPyPI, but the workflow only targets PyPI. Consider updating the name for clarity.
- name: Publish Python 🐍 distribution 📦 to PyPI and TestPyPI + name: Publish Python 🐍 distribution 📦 to PyPI
82-82
: Improve GitHub release notes.Currently, the release notes are empty. Consider adding meaningful release notes to provide context for the release.
- --notes "" + --notes "Release notes for version ${{ github.ref_name }}".pre-commit-config.yaml (1)
129-129
: Improve readability ofnbqa-flake8
args.The
args
fornbqa-flake8
have been reformatted to a single line. Consider using a multi-line format for better readability.- args: ["--extend-ignore=E203,E402,E501,F401,F841", "--exclude=logs/*,data/*"] + args: + - "--extend-ignore=E203,E402,E501,F401,F841" + - "--exclude=logs/*,data/*"docs/pipeline_configuration.md (4)
16-29
: Correct punctuation in the list of stages.There are loose punctuation marks at the start of each stage description. Consider reformatting for consistency.
- 1. `stage_1`: A metadata map-reduce stage (e.g., counting the occurrence rates of the codes in the + 1. `stage_1`: A metadata map-reduce stage (e.g., counting the occurrence rates of the codes in the data). - 2. `stage_2`: A metadata-only processing stage (e.g., filtering the code dataframe to only codes + 2. `stage_2`: A metadata-only processing stage (e.g., filtering the code dataframe to only codes that occur more than 10 times). - 3. `stage_3`: A data processing stage (e.g., filtering the data to only rows with a code that is in the + 3. `stage_3`: A data processing stage (e.g., filtering the data to only rows with a code that is in the current running metadata file).Tools
LanguageTool
[uncategorized] ~16-~16: Loose punctuation mark.
Context: ...s of the following stages: 1.stage_1
: A metadata map-reduce stage (e.g., coun...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~18-~18: Loose punctuation mark.
Context: ...the codes in the data). 2.stage_2
: A metadata-only processing stage (e.g.,...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~20-~20: Loose punctuation mark.
Context: ... occur more than 10 times). 3.stage_3
: A data processing stage (e.g., filterin...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... occur more than 10 times). 4.stage_4
: A metadata map-reduce stage (e.g., comp...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~24-~24: Loose punctuation mark.
Context: ...al values in the data). 5.stage_5
: A data processing stage (e.g., occludin...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~26-~26: Loose punctuation mark.
Context: ... deviations from the mean). 6.stage_6
: A metadata map-reduce stage (e.g., comp...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~28-~28: Loose punctuation mark.
Context: ...al values in the data). 7.stage_7
: A data processing stage (e.g., normaliz...(UNLIKELY_OPENING_PUNCTUATION)
46-46
: Consider adding a comma for clarity.A comma is missing after "directory" for improved readability.
- This stage will read in the data from the `$INPUT_DIR/data` directory as there has still been no + This stage will read in the data from the `$INPUT_DIR/data` directory, as there has still been noTools
LanguageTool
[uncategorized] ~46-~46: Possible missing comma found.
Context: ... in the data from the$INPUT_DIR/data
directory as there has still been no prio...(AI_HYDRA_LEO_MISSING_COMMA)
82-82
: Clarify the use of 'reduce' in the context.The use of 'reduce' might be unclear. Consider rephrasing for clarity.
- (which do not have reduce stages). + (which do not include reduction stages).Tools
LanguageTool
[grammar] ~82-~82: The verb form ‘reduce’ does not seem to be suitable in this context.
Context: ...ssing stages (which do not have reduce stages).(HAVE_VB)
1-7
: Misalignment Between Documentation and Code ImplementationThe introduction in the documentation mentions the design philosophy of being "simple, reusable, efficient, and configurable." However, the implementation primarily references "simple" and lacks mentions of "reusable," "efficient," and "configurable." This suggests a potential misalignment between the documentation and the codebase. Consider reviewing the implementation to ensure consistency with the documented design philosophy.
- Files with "simple": Multiple occurrences in
src/MEDS_transforms/parser.py
,aggregate_code_metadata.py
, and others.- Missing Terms: No instances of "reusable," "efficient," or "configurable" found in the implementation.
Analysis chain
Ensure clarity and consistency in the introduction.
The introduction outlines the design philosophy. Ensure it aligns with the actual implementation and is clear to the intended audience.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the alignment of the introduction with the implementation. # Test: Search for design philosophy keywords in the implementation. Expect: Consistent terminology. rg 'simple|reusable|efficient|configurable' src/Length of output: 3042
tests/utils.py (4)
107-107
: Consider renamingdo_use_config_yaml
for clarity.The parameter
do_use_config_yaml
could be renamed to something more intuitive, likeuse_yaml_config
, to improve readability.- do_use_config_yaml: bool = False, + use_yaml_config: bool = False,
143-143
: Improve command logging for better debugging.Consider adding more context to the command log, such as the environment or additional parameters, to aid in debugging.
- err_cmd_lines.append(f"Running command: {full_cmd}") + err_cmd_lines.append(f"Running command: {full_cmd} with environment: {os.environ}")
Line range hint
12-12
:
Reminder: Add doctests.The TODO comment indicates that doctests are missing. Adding them will ensure the function behaves as expected.
Do you want me to generate the doctests or open a GitHub issue to track this task?
114-116
: Consider enhancingconfig_name
validation.The current validation ensures
config_name
is notNone
whendo_use_config_yaml
isTrue
. However, consider adding checks to ensure it is a valid non-empty string to prevent potential issues.
tests/utils.py
: Enhance validation logic forconfig_name
.Analysis chain
Ensure
config_name
validation is sufficient.The check for
config_name
whendo_use_config_yaml
isTrue
is crucial. Ensure that this validation is comprehensive and thatconfig_name
is not justNone
but also a valid string.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that `config_name` is a valid string when `do_use_config_yaml` is True. # Test: Check for non-empty string assignment to `config_name`. rg --type python -A 5 $'do_use_config_yaml=True' | grep -B 5 -A 5 'config_name'Length of output: 107
Script:
#!/bin/bash # Corrected script to verify that `config_name` is a valid string when `do_use_config_yaml` is True. # Search for instances where `do_use_config_yaml` is set to True and check how `config_name` is handled. rg 'do_use_config_yaml\s*=\s*True' -A 5 --type pyLength of output: 158
Script:
#!/bin/bash # Extend the search to include more context around the usage of `do_use_config_yaml=True` in `tests/test_filter_measurements.py`. # Search for the instance and include more lines to see how `config_name` is handled. rg 'do_use_config_yaml\s*=\s*True' -A 10 -B 5 tests/test_filter_measurements.pyLength of output: 359
Script:
#!/bin/bash # Search for `config_name` in `tests/test_filter_measurements.py` to understand its usage and validation. rg 'config_name' -A 5 -B 5 tests/test_filter_measurements.pyLength of output: 60
Script:
#!/bin/bash # Search the entire codebase for `config_name` to identify its definition and validation logic. rg 'config_name' -A 5 -B 5 --type pyLength of output: 21468
eICU_Example/README.md (1)
13-27
: Address markdownlint indentation issues.Adjust the indentation to match markdownlint expectations for unordered lists.
- - [ ] Identifying the pre-MEDS steps for eICU + - [ ] Identifying the pre-MEDS steps for eICU - - [ ] Test that it runs at all. + - [ ] Test that it runs at all. - - [ ] Test that the output is as expected. + - [ ] Test that the output is as expected. - - [ ] Sub-sharding + - [ ] Sub-sharding - - [ ] Patient split gathering + - [ ] Patient split gathering - - [ ] Event extraction + - [ ] Event extraction - - [ ] Merging + - [ ] Merging - - [ ] Basic validation + - [ ] Basic validation - - [ ] Detailed validation + - [ ] Detailed validationTools
Markdownlint
13-13: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
15-15: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
16-16: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
21-21: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
22-22: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
23-23: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
24-24: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
26-26: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
27-27: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
docs/preprocessing_operation_prototypes.md (1)
74-74
: Consider rephrasing for conciseness.The phrase "prior to" might be wordy. Consider using "before" for conciseness.
- patient data filters should be applied prior to aggregation. + patient data filters should be applied before aggregation.src/MEDS_transforms/extract/README.md (2)
83-83
: Correct grammatical error: Use "an" instead of "a".Change "a input" to "an input" for grammatical correctness.
- literal or a input column reference) is interpreted + literal or an input column reference) is interpretedTools
LanguageTool
[misspelling] ~83-~83: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...t (again either a string literal or a input column reference) is interpreted ...(EN_A_VS_AN)
255-255
: Consider more formal language.Replace "feel free to" with a more formal alternative to enhance professionalism.
- Feel free to reach out if you have any questions or concerns about this. + Please reach out if you have any questions or concerns about this.Tools
LanguageTool
[style] ~255-~255: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... extraction tool for your raw data. Feel free to reach out if you have any questions or ...(FEEL_FREE_TO_STYLE_ME)
src/MEDS_transforms/utils.py (1)
Line range hint
61-98
: Update the function signature and return statement.The
stage_init
function's return type annotation should be updated to reflect the removal ofshards_map_fp
from the return tuple. The return type annotation should match the actual return statement.- def stage_init(cfg: DictConfig) -> tuple[Path, Path, Path]: + def stage_init(cfg: DictConfig) -> tuple[Path, Path, Path, Path]:src/MEDS_transforms/mapreduce/utils.py (1)
16-46
: Add error logging for exception handling.In the
is_complete_parquet_file
function, consider adding logging for exceptions to aid in debugging when a file is not complete.- except Exception: + except Exception as e: + logger.error(f"Failed to read Parquet file {fp}: {e}")tests/test_extract.py (1)
444-445
: Update assertion message for clarity.The assertion message should clearly indicate the expected file path for clarity.
- assert shards_fp.is_file(), f"Expected splits @ {str(shards_fp.resolve())} to exist." + assert shards_fp.is_file(), f"Expected shards file @ {str(shards_fp.resolve())} to exist."README.md (1)
98-108
: Consider style improvement and approve changes.The modifications enhance clarity. However, consider addressing the style issue highlighted by static analysis regarding the missing comma:
- script in different terminal sessions or all at once via the Hydra `joblib` launcher + script in different terminal sessions, or all at once via the Hydra `joblib` launcherTools
LanguageTool
[grammar] ~103-~103: The modal verb ‘will’ requires the verb’s base form.
Context: ...rker jobs. Note this will typically required a distributed file system to work corre...(MD_BASEFORM)
[uncategorized] ~106-~106: Possible missing comma found.
Context: ...e same script in different terminal sessions or all at once via the Hydrajoblib
l...(AI_HYDRA_LEO_MISSING_COMMA)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (42)
- .editorconfig (1 hunks)
- .github/workflows/python-build.yaml (1 hunks)
- .github/workflows/tests.yaml (1 hunks)
- .pre-commit-config.yaml (3 hunks)
- .readthedocs.yaml (1 hunks)
- MIMIC-IV_Example/README.md (2 hunks)
- README.md (8 hunks)
- docs/gen_ref_pages.py (1 hunks)
- docs/index.md (1 hunks)
- docs/javascripts/mathjax.js (1 hunks)
- docs/pipeline_configuration.md (1 hunks)
- docs/preprocessing_operation_prototypes.md (7 hunks)
- eICU_Example/README.md (10 hunks)
- eICU_Example/configs/table_preprocessors.yaml (1 hunks)
- mkdocs.yml (1 hunks)
- pyproject.toml (2 hunks)
- src/MEDS_transforms/init.py (2 hunks)
- src/MEDS_transforms/aggregate_code_metadata.py (12 hunks)
- src/MEDS_transforms/configs/extract.yaml (1 hunks)
- src/MEDS_transforms/configs/preprocess.yaml (2 hunks)
- src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml (1 hunks)
- src/MEDS_transforms/configs/stage_configs/split_and_shard_patients.yaml (1 hunks)
- src/MEDS_transforms/extract/README.md (6 hunks)
- src/MEDS_transforms/extract/init.py (2 hunks)
- src/MEDS_transforms/extract/convert_to_sharded_events.py (2 hunks)
- src/MEDS_transforms/extract/extract_code_metadata.py (2 hunks)
- src/MEDS_transforms/extract/finalize_MEDS_metadata.py (2 hunks)
- src/MEDS_transforms/extract/merge_to_MEDS_cohort.py (2 hunks)
- src/MEDS_transforms/extract/split_and_shard_patients.py (7 hunks)
- src/MEDS_transforms/mapreduce/mapper.py (4 hunks)
- src/MEDS_transforms/mapreduce/utils.py (8 hunks)
- src/MEDS_transforms/reshard_to_split.py (1 hunks)
- src/MEDS_transforms/transforms/tokenization.py (5 hunks)
- src/MEDS_transforms/utils.py (8 hunks)
- tests/test_add_time_derived_measurements.py (1 hunks)
- tests/test_extract.py (1 hunks)
- tests/test_filter_measurements.py (2 hunks)
- tests/test_occlude_outliers.py (1 hunks)
- tests/test_reorder_measurements.py (1 hunks)
- tests/test_reshard_to_split.py (1 hunks)
- tests/transform_tester_base.py (7 hunks)
- tests/utils.py (2 hunks)
Files skipped from review due to trivial changes (11)
- .editorconfig
- .readthedocs.yaml
- MIMIC-IV_Example/README.md
- docs/index.md
- eICU_Example/configs/table_preprocessors.yaml
- src/MEDS_transforms/configs/extract.yaml
- src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml
- src/MEDS_transforms/extract/extract_code_metadata.py
- tests/test_add_time_derived_measurements.py
- tests/test_occlude_outliers.py
- tests/test_reorder_measurements.py
Additional context used
LanguageTool
docs/pipeline_configuration.md
[uncategorized] ~16-~16: Loose punctuation mark.
Context: ...s of the following stages: 1.stage_1
: A metadata map-reduce stage (e.g., coun...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~18-~18: Loose punctuation mark.
Context: ...the codes in the data). 2.stage_2
: A metadata-only processing stage (e.g.,...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~20-~20: Loose punctuation mark.
Context: ... occur more than 10 times). 3.stage_3
: A data processing stage (e.g., filterin...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... occur more than 10 times). 4.stage_4
: A metadata map-reduce stage (e.g., comp...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~24-~24: Loose punctuation mark.
Context: ...al values in the data). 5.stage_5
: A data processing stage (e.g., occludin...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~26-~26: Loose punctuation mark.
Context: ... deviations from the mean). 6.stage_6
: A metadata map-reduce stage (e.g., comp...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~28-~28: Loose punctuation mark.
Context: ...al values in the data). 7.stage_7
: A data processing stage (e.g., normaliz...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~33-~33: Loose punctuation mark.
Context: ...s in the following manner. 1.stage_1
: - As there is no preceding data sta...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...1/codes.parquetdirectory. 2.
stage_2`: - This stage will read in the metad...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~45-~45: Loose punctuation mark.
Context: ...2/codes.parquetdirectory. 3.
stage_3`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~46-~46: Possible missing comma found.
Context: ... in the data from the$INPUT_DIR/data
directory as there has still been no prio...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~51-~51: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 4.
stage_4`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~60-~60: Loose punctuation mark.
Context: ...tage_4/codes.parquetfile. 5.
stage_5`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~64-~64: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 6.
stage_6`: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~75-~75: Loose punctuation mark.
Context: ...n to the stage directories. 7.stage_7
: - This stage will read in the data ...(UNLIKELY_OPENING_PUNCTUATION)
[grammar] ~82-~82: The verb form ‘reduce’ does not seem to be suitable in this context.
Context: ...ssing stages (which do not have reduce stages).(HAVE_VB)
docs/preprocessing_operation_prototypes.md
[style] ~80-~80: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ... patient data filters should be applied prior to aggregation. 2. What aggregation functi...(EN_WORDINESS_PREMIUM_PRIOR_TO)
src/MEDS_transforms/extract/README.md
[misspelling] ~83-~83: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...t (again either a string literal or a input column reference) is interpreted ...(EN_A_VS_AN)
[style] ~255-~255: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... extraction tool for your raw data. Feel free to reach out if you have any questions or ...(FEEL_FREE_TO_STYLE_ME)
README.md
[grammar] ~103-~103: The modal verb ‘will’ requires the verb’s base form.
Context: ...rker jobs. Note this will typically required a distributed file system to work corre...(MD_BASEFORM)
[uncategorized] ~106-~106: Possible missing comma found.
Context: ...e same script in different terminal sessions or all at once via the Hydrajoblib
l...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~181-~181: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ... and external ontologies? Will they need to specify select key-event concepts to id...(REP_NEED_TO_VB)
[style] ~306-~306: This phrase is redundant. Consider using “outside”.
Context: ... - Removing numeric values that fall outside of pre-specified, per-code ranges (e.g., f...(OUTSIDE_OF)
[style] ~315-~315: Consider using “who” when you are referring to a person instead of an object.
Context: ..." event to the dataset for each patient that occurs at the end of the observation ...(THAT_WHO)
[style] ~324-~324: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... each code occurs in the dataset. - Computing appropriate numeric bins for each code ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[grammar] ~446-~446: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.)
Context: ...right prior files. Likely need some kind of a "prior stage name" config variable. ##...(KIND_OF_A)
Ruff
src/MEDS_transforms/reshard_to_split.py
132-132: Function definition does not bind loop variable
in_fps
(B023)
138-138: Function definition does not bind loop variable
in_fps
(B023)
147-147: Function definition does not bind loop variable
subshard_dir
(B023)
147-147: Function definition does not bind loop variable
out_fp
(B023)
153-153: Function definition does not bind loop variable
subshard_dir
(B023)
154-154: Function definition does not bind loop variable
in_fps
(B023)
158-158: Function definition does not bind loop variable
subshard_dir
(B023)
159-159: Function definition does not bind loop variable
subshard_dir
(B023)
164-164: Function definition does not bind loop variable
subshard_dir
(B023)
166-166: Function definition does not bind loop variable
subshard_dir
(B023)
167-167: Function definition does not bind loop variable
subshard_dir
(B023)
src/MEDS_transforms/extract/split_and_shard_patients.py
22-22: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
src/MEDS_transforms/mapreduce/utils.py
275-278: Use ternary operator
add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
instead ofif
-else
-blockReplace
if
-else
-block withadd_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
(SIM108)
src/MEDS_transforms/mapreduce/mapper.py
574-574: Do not perform function call
partial
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
Markdownlint
eICU_Example/README.md
13-13: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
15-15: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
16-16: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
21-21: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
22-22: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
23-23: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
24-24: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
26-26: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
27-27: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
Additional comments not posted (73)
src/MEDS_transforms/configs/stage_configs/split_and_shard_patients.yaml (1)
3-3
: LGTM! But verify the impact of the directory change.The change in the
output_dir
to${cohort_dir}/metadata
is a positive step towards better file organization.However, ensure that the rest of the code that relies on the output directory is updated accordingly.
Verification successful
No references to old output directory structure found.
The change to
output_dir: ${cohort_dir}/metadata
does not appear to impact other parts of the codebase, as no references to the old structure were found in YAML files.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for code that references the old output directory structure. # Test: Search for usage of the old output directory. Expect: No references to the old structure. # Note: Adjust the pattern if the variable is used differently in the code. ast-g-r --lang yaml --pattern $'output_dir: $COHORT_DIR'Length of output: 132
Script:
#!/bin/bash # Description: Check for code that references the old output directory structure. # Test: Search for usage of the old output directory. Expect: No references to the old structure. # Note: Adjust the pattern if the variable is used differently in the code. ast-grep --lang yaml --pattern $'output_dir: $COHORT_DIR'Length of output: 167
Script:
#!/bin/bash # Description: Search for references to the old output directory structure in YAML files. # Use ripgrep to search for the pattern in YAML files. rg 'output_dir: \$COHORT_DIR' --glob '*.yaml' -A 2Length of output: 52
docs/javascripts/mathjax.js (1)
1-19
: MathJax configuration is well-structured.The configuration for MathJax is standard and should work well for rendering mathematical expressions in the documentation.
src/MEDS_transforms/__init__.py (3)
4-4
: Consider the impact of thepolars
import.The introduction of
polars
suggests a shift towards using this library for data manipulation. Ensure that all necessary dependencies are installed and documented.Verification successful
Polars is included in the dependencies.
The
polars
library is listed in thepyproject.toml
file, ensuring it is part of the project's dependencies. No further action is needed regarding its inclusion.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the presence of `polars` in the dependencies. # Test: Search for `polars` in the dependency files. Expect: Occurrences in requirements or setup files. rg --type yaml --type toml --type ini --type txt 'polars'Length of output: 85
Script:
#!/bin/bash # Description: Re-verify the presence of `polars` in the dependencies. # Test: Search for `polars` in the common dependency files. Expect: Occurrences in requirements or setup files. rg --type yaml --type toml --type txt 'polars' --glob '*requirements*.txt' --glob 'Pipfile' --glob 'pyproject.toml' --glob 'setup.cfg'Length of output: 369
17-24
: VerifyMANDATORY_TYPES
for type consistency.The
MANDATORY_TYPES
dictionary specifies expected data types usingpolars
. Ensure these types are enforced and compatible with data sources.Verification successful
Indirect enforcement of
MANDATORY_TYPES
through related constructsThe
MANDATORY_TYPES
dictionary itself is not directly used in the codebase. However, related constructs likeMEDS_DATA_MANDATORY_TYPES
andMEDS_METADATA_MANDATORY_TYPES
are used in various files to enforce type consistency. These related constructs ensure that the expected data types are enforced across the codebase.
finalize_MEDS_metadata.py
extract_code_metadata.py
finalize_MEDS_data.py
These files contain logic to validate and enforce schema types, indirectly reflecting the intended use of
MANDATORY_TYPES
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage and enforcement of `MANDATORY_TYPES` in the codebase. # Test: Search for `MANDATORY_TYPES` usage. Expect: Consistent enforcement in data validation logic. rg --type python 'MANDATORY_TYPES'Length of output: 65
Script:
#!/bin/bash # Description: Verify the usage and enforcement of `MANDATORY_TYPES` in the codebase. # Corrected Test: Search for `MANDATORY_TYPES` usage in Python files. rg 'MANDATORY_TYPES' --glob '*.py' -A 5Length of output: 8562
15-15
: EnsureMANDATORY_COLUMNS
aligns with data processing requirements.The
MANDATORY_COLUMNS
list defines essential columns for data processing. Verify that these columns are consistently used throughout the data pipeline.Verification successful
MANDATORY_COLUMNS
is consistently used in data processing.The
MANDATORY_COLUMNS
list is effectively utilized in the data processing pipeline, ensuring that data types are checked and columns are selected appropriately. This aligns with the data processing requirements.
src/MEDS_transforms/extract/finalize_MEDS_data.py
: Used for retyping columns and selecting mandatory columns in the DataFrame.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `MANDATORY_COLUMNS` in the codebase. # Test: Search for `MANDATORY_COLUMNS` usage. Expect: Consistent references in data processing logic. rg --type python 'MANDATORY_COLUMNS'Length of output: 67
Script:
#!/bin/bash # Description: Verify the usage of `MANDATORY_COLUMNS` in the codebase. # Test: Search for `MANDATORY_COLUMNS` usage. Expect: Consistent references in data processing logic. rg 'MANDATORY_COLUMNS' --glob '*.py' -A 3Length of output: 801
Script:
#!/bin/bash # Description: Verify the usage of `MEDS_DATA_MANDATORY_TYPES` in the codebase. # Test: Search for `MEDS_DATA_MANDATORY_TYPES` usage. Expect: Consistent references in data processing logic. rg 'MEDS_DATA_MANDATORY_TYPES' --glob '*.py' -A 3Length of output: 1858
src/MEDS_transforms/configs/preprocess.yaml (1)
4-4
: Evaluate the addition ofreshard_to_split
.The
reshard_to_split
stage enhances the pipeline's reshaping capabilities. Verify its integration and ensure it aligns with the overall data processing strategy.Verification successful
Integration of
reshard_to_split
verified.The
reshard_to_split
stage is integrated into the pipeline, with references found in configuration and test files, indicating proper setup and testing. Further verification of its logic and alignment with the data processing strategy may be needed manually.
tests/test_reshard_to_split.py
: Contains a test function forreshard_to_split
.src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml
: Configuration entry forreshard_to_split
.src/MEDS_transforms/configs/preprocess.yaml
: Includesreshard_to_split
in the list.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the integration of `reshard_to_split` in the pipeline. # Test: Search for `reshard_to_split` usage. Expect: Proper integration in pipeline configuration and logic. rg --type yaml --type python 'reshard_to_split'Length of output: 78
Script:
#!/bin/bash # Description: Verify the integration of `reshard_to_split` in the pipeline. # Test: Search for `reshard_to_split` usage. Expect: Proper integration in pipeline configuration and logic. rg --type yaml --type py 'reshard_to_split'Length of output: 514
.github/workflows/tests.yaml (2)
36-36
: Review the addition of JUnit XML reporting.The
--junitxml=junit.xml
option enhances test reporting. Ensure the generated reports are utilized effectively in the CI/CD pipeline.
42-46
: Check the conditional upload of test results to Codecov.The step to upload test results is conditioned on the job not being cancelled. Verify that this condition is correctly implemented and that the results are uploaded as expected.
mkdocs.yml (6)
1-4
: Ensure accurate project metadata.The site metadata, such as
site_name
,repo_url
,site_description
, andsite_author
, should accurately reflect the project details. Ensure these values are correct and up-to-date.
14-16
: Review theme settings.Ensure the chosen theme and locale settings meet the project's requirements. The
material
theme is widely used and customizable, but verify that it aligns with the desired presentation.
18-29
: Validate markdown extensions configuration.The listed markdown extensions enhance the documentation's capabilities. Verify that these extensions are necessary and correctly configured for the project's needs.
31-33
: Check extra JavaScript inclusion.Ensure the JavaScript files included are necessary and correctly configured. Verify that the URLs are correct and accessible.
35-45
: Review plugin configuration.Ensure that all listed plugins are necessary and correctly configured. Verify that they are compatible with the current MkDocs setup and project requirements.
6-12
: Verify navigation links.Ensure that the navigation links point to the correct markdown files and URLs. This includes checking that all referenced files exist and are properly named.
Verification successful
Navigation links verified successfully.
All referenced markdown files in
mkdocs.yml
exist and are correctly named.
index.md
found atdocs/index.md
pipeline_configuration.md
found atdocs/pipeline_configuration.md
preprocessing_operation_prototypes.md
found atdocs/preprocessing_operation_prototypes.md
terminology.md
found atdocs/terminology.md
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that all navigation links in mkdocs.yml point to existing files. # Test: Check for the existence of the referenced markdown files. nav_files=("index.md" "pipeline_configuration.md" "preprocessing_operation_prototypes.md" "terminology.md") for file in "${nav_files[@]}"; do if ! fd --type f --name "$file" docs; then echo "Missing file: $file" fi doneLength of output: 1757
Script:
#!/bin/bash # Corrected script to verify that all navigation links in mkdocs.yml point to existing files. # Test: Check for the existence of the referenced markdown files. nav_files=("index.md" "pipeline_configuration.md" "preprocessing_operation_prototypes.md" "terminology.md") for file in "${nav_files[@]}"; do if ! fd --type f "$file"; then echo "Missing file: $file" fi doneLength of output: 489
docs/gen_ref_pages.py (5)
12-15
: Ensure correct path handling.The script uses
Path.rglob
to find Python files. Verify that this approach correctly captures all relevant files, especially in complex directory structures.
32-32
: Verify navigation structure.Ensure that the navigation structure correctly reflects the project's module hierarchy. This affects how the documentation is organized and presented.
34-35
: Check module documentation generation.Ensure that the
mkdocstrings
syntax (::: {ident}
) is correctly generating documentation for each module.
37-40
: Ensure file writing operations are correct.Verify that the file writing operations correctly create and update the documentation files without errors.
42-43
: Validate navigation file generation.Ensure the
SUMMARY.md
file is generated correctly and reflects the intended navigation structure.pyproject.toml (2)
30-34
: Verify documentation dependencies.Ensure that the specified versions of documentation dependencies are compatible with each other and meet the project's requirements.
50-50
: Check new script entry for correctness.The new script entry
MEDS_transform-reshard_to_split
should be verified to ensure it correctly points to the intended function and module..github/workflows/python-build.yaml (3)
39-39
: Replace placeholder in environment URL.The URL contains a placeholder
<package-name>
. Ensure this is replaced with the actual package name before deploying.- url: https://pypi.org/p/MEDS-transforms # Replace <package-name> with your PyPI project name + url: https://pypi.org/p/<actual-package-name>Likely invalid or redundant comment.
51-51
: Ensure correct version of PyPI publish action.Using a specific version or release tag for
pypa/gh-action-pypi-publish
ensures stability and compatibility.Verification successful
Version of PyPI publish action is correctly specified.
The workflow uses
pypa/gh-action-pypi-publish@release/v1
, which is a specific release tag ensuring stability and compatibility. No changes are necessary.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the version of `pypa/gh-action-pypi-publish` used in the workflow. # Test: Check if a specific version or release tag is used. Expect: Specific version or release tag. rg 'uses: pypa/gh-action-pypi-publish@release/v1' .github/workflows/python-build.yamlLength of output: 140
15-15
: Verify Python version compatibility.Ensure that Python 3.12 is compatible with your project dependencies and target environments, as it is a relatively new release.
Verification successful
Python 3.12 Compatibility Confirmed
The
pyproject.toml
file specifiesrequires-python = ">=3.12"
, indicating compatibility with Python 3.12. There are no conflicting constraints in other dependency files.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify compatibility of Python 3.12 with project dependencies. # Test: Check if Python 3.12 is specified in the dependency files. Expect: Compatibility declarations. rg --type yaml 'python_version: "3.12"' || rg --type toml 'python = "3.12"'Length of output: 76
Script:
#!/bin/bash # Description: Search for Python version constraints in common dependency files. # Check for Python version constraints in requirements.txt rg 'python_version' $(fd requirements.txt) # Check for Python version constraints in setup.py rg 'python_requires' $(fd setup.py) # Check for Python version constraints in pyproject.toml rg 'requires-python' $(fd pyproject.toml)Length of output: 162
.pre-commit-config.yaml (2)
101-103
: Ensure additional dependencies are installed.The added Markdown formatting tools (
mdformat-mkdocs
,mdformat-toc
,mdformat-admon
) require additional dependencies. Verify these are installed in the environment.
4-4
: Review the updated exclude pattern.The
exclude
pattern now includes specific README files. Ensure these are intentionally excluded from pre-commit checks.tests/utils.py (2)
Line range hint
56-56
:
LGTM!The
dict_to_hydra_kwargs
function is well-implemented with comprehensive error handling and documentation.
125-127
: Remember to clean up temporary directories.Ensure that
conf_dir.cleanup()
is called in all execution paths to prevent resource leaks.src/MEDS_transforms/reshard_to_split.py (5)
27-35
: LGTM!The
valid_json_file
function is well-implemented with proper error handling for JSON decoding.
52-53
: LGTM!The
write_json
function is well-implemented and correctly handles JSON serialization.
182-183
: LGTM!The
__main__
block is standard and correctly invokes themain
function.
38-49
: LGTM! But verify external dependencies.The
make_new_shards_fn
function is well-implemented. Ensure that theshard_patients
function and configuration settings are correctly defined and used.
59-179
: Verify re-sharding logic and external utilities.The
main
function appears to handle the re-sharding process correctly. Ensure that the logic is sound and that external utilities such asrwlock_wrap
andshard_iterator
are correctly used and defined.Tools
Ruff
132-132: Function definition does not bind loop variable
in_fps
(B023)
138-138: Function definition does not bind loop variable
in_fps
(B023)
147-147: Function definition does not bind loop variable
subshard_dir
(B023)
147-147: Function definition does not bind loop variable
out_fp
(B023)
153-153: Function definition does not bind loop variable
subshard_dir
(B023)
154-154: Function definition does not bind loop variable
in_fps
(B023)
158-158: Function definition does not bind loop variable
subshard_dir
(B023)
159-159: Function definition does not bind loop variable
subshard_dir
(B023)
164-164: Function definition does not bind loop variable
subshard_dir
(B023)
166-166: Function definition does not bind loop variable
subshard_dir
(B023)
167-167: Function definition does not bind loop variable
subshard_dir
(B023)
tests/test_reshard_to_split.py (2)
11-190
: LGTM!The test setup is comprehensive and covers various scenarios for the re-sharding process.
193-202
: Verify test utility integration.The
test_reshard_to_split
function appears to correctly test the re-sharding process. Ensure that thesingle_stage_transform_tester
utility is correctly integrated and used.Verification successful
Integration of
single_stage_transform_tester
is correct.The
single_stage_transform_tester
function is correctly defined in thetransform_tester_base
module and imported into the test file. Its integration and usage in thetest_reshard_to_split
function are appropriate. No issues found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the definition and usage of `single_stage_transform_tester`. # Test: Search for the `single_stage_transform_tester` function definition. ast-grep --lang python --pattern $'def single_stage_transform_tester($_, $_, $_, $_, $_, $_):'Length of output: 95
Script:
#!/bin/bash # Description: Search for import statements related to `single_stage_transform_tester`. # Test: Search for the import of `single_stage_transform_tester`. rg 'import .*single_stage_transform_tester' -A 2 # Test: Search for `from ... import ...` statements that include `single_stage_transform_tester`. rg 'from .* import .*single_stage_transform_tester' -A 2Length of output: 4185
Script:
#!/bin/bash # Description: Locate the definition of `single_stage_transform_tester` in `transform_tester_base`. # Find the file `transform_tester_base` and search for the function definition. fd 'transform_tester_base' | xargs rg 'def single_stage_transform_tester'Length of output: 108
tests/test_filter_measurements.py (3)
8-8
: Consolidated imports improve readability.The consolidation of imports enhances clarity without affecting functionality.
Line range hint
106-116
:
Existing test function is well-structured.The
test_filter_measurements
function uses thesingle_stage_transform_tester
effectively to validate filtering logic.
211-228
: New test function is well-structured and comprehensive.The
test_match_revise_filter_measurements
function effectively tests filtering logic with specific criteria, enhancing test coverage.src/MEDS_transforms/extract/finalize_MEDS_metadata.py (1)
Line range hint
149-196
:
Improved handling ofshards_map_fp
enhances maintainability.The direct assignment of
shards_map_fp
from configuration clarifies its source and reduces dependency onstage_init
output.eICU_Example/README.md (1)
Line range hint
92-195
:
Formatting changes enhance document readability.The adjustments to step numbering and indentation improve the logical flow and visual structure of the document.
src/MEDS_transforms/transforms/tokenization.py (5)
Line range hint
34-50
: Functionfill_to_nans
is well-implemented.The function is correctly implemented with clear documentation and examples.
Line range hint
53-69
: Functionsplit_static_and_dynamic
is well-implemented.The function effectively splits data into static and dynamic components using Polars LazyFrame operations.
Line range hint
72-108
: Functionextract_statics_and_schema
is well-implemented.The function correctly extracts static data and schema information using group-by operations.
Line range hint
111-143
: Functionextract_seq_of_patient_events
is well-implemented.The function effectively extracts sequences of patient events using Polars expressions.
Line range hint
231-267
: Functionmain
is well-implemented with improved logic.The changes streamline data processing and enhance clarity. The conditional check for
include_only_train
adds robustness.Ensure that
shard_iterator
is used correctly throughout the codebase.Verification successful
Consistent Usage of
shard_iterator
ConfirmedThe
shard_iterator
function is consistently used across multiple files in the codebase, includingtokenization.py
,reshard_to_split.py
, andmapper.py
. This indicates that the changes intokenization.py
regarding the use ofshard_iterator
are aligned with the overall structure and usage patterns within the project.
- Files with
shard_iterator
usage:
src/MEDS_transforms/transforms/tokenization.py
src/MEDS_transforms/reshard_to_split.py
src/MEDS_transforms/mapreduce/mapper.py
src/MEDS_transforms/mapreduce/utils.py
src/MEDS_transforms/transforms/tensorization.py
src/MEDS_transforms/extract/merge_to_MEDS_cohort.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `shard_iterator` in the codebase. # Test: Search for the function usage. Expect: Consistent usage across the codebase. rg --type python -A 5 $'shard_iterator'Length of output: 69
Script:
#!/bin/bash # Description: Verify the usage of `shard_iterator` in the codebase. # Test: Search for the function usage. Expect: Consistent usage across the codebase. rg --type py -A 5 $'shard_iterator'Length of output: 12884
src/MEDS_transforms/extract/merge_to_MEDS_cohort.py (2)
Line range hint
16-99
: Functionmerge_subdirs_and_sort
is well-implemented.The function effectively merges and sorts parquet files with comprehensive documentation and logging.
Line range hint
102-144
: Functionmain
is well-implemented with enhanced shard handling.The use of
shard_iterator_by_shard_map
improves shard handling. The configuration ofread_fn
usingpartial
is efficient.Ensure that
shard_iterator_by_shard_map
is used correctly throughout the codebase.Verification successful
Correct usage of
shard_iterator_by_shard_map
confirmed.The
shard_iterator_by_shard_map
function is correctly defined and used within the codebase. It is imported and utilized appropriately inmerge_subdirs_and_sort
inmerge_to_MEDS_cohort.py
.
- File:
src/MEDS_transforms/extract/merge_to_MEDS_cohort.py
- Function:
merge_subdirs_and_sort
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `shard_iterator_by_shard_map` in the codebase. # Test: Search for the function usage. Expect: Consistent usage across the codebase. rg --type python -A 5 $'shard_iterator_by_shard_map'Length of output: 82
Script:
#!/bin/bash # Corrected: Verify the usage of `shard_iterator_by_shard_map` in the codebase. # Test: Search for the function usage. Expect: Consistent usage across the codebase. rg --type py -A 5 $'shard_iterator_by_shard_map'Length of output: 3856
src/MEDS_transforms/extract/split_and_shard_patients.py (3)
111-135
: LGTM! Validation logic is correctly adjusted.The validation of the sum of split fractions now only occurs when there are patients to split, which optimizes the control flow.
Line range hint
205-272
: LGTM! Improved file handling and logging.The changes enhance clarity and reliability in constructing paths for logging and writing output.
136-142
: Verify handling ofsplit_fracs_dict
whenNone
.Ensure that the logic correctly handles cases where
split_fracs_dict
isNone
and external splits cover all patients.Verification successful
Handling of
split_fracs_dict
whenNone
is correctly implemented.The
shard_patients
function allowssplit_fracs_dict
to beNone
when external splits fully specify the population, as indicated in the docstring. This ensures that the logic handles the scenario appropriately without errors. No further action is needed.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify handling of `split_fracs_dict` when it is `None`. # Test: Search for the function call to `shard_patients` with `split_fracs_dict=None`. Expect: No runtime errors. rg --type python 'shard_patients\(.*split_fracs_dict=None.*\)'Length of output: 95
Script:
#!/bin/bash # Search for the definition of the `shard_patients` function and inspect its implementation. ast-grep --lang python --pattern 'def shard_patients($_, split_fracs_dict, $$$):'Length of output: 83
Script:
#!/bin/bash # Search for calls to `shard_patients` with `split_fracs_dict=None`. rg 'shard_patients\(.*split_fracs_dict=None.*\)' -A 3Length of output: 571
Script:
#!/bin/bash # Search for the definition of the `shard_patients` function using `rg`. rg 'def shard_patients' -A 10Length of output: 1042
Script:
#!/bin/bash # Extract more lines from the `shard_patients` function to understand its handling of `split_fracs_dict`. rg 'def shard_patients' -A 50Length of output: 6031
tests/transform_tester_base.py (1)
Line range hint
303-374
: LGTM! Enhanced configurability and data handling.The introduction of new parameters and refactoring improves flexibility and readability. Ensure that the new parameters are correctly integrated.
src/MEDS_transforms/utils.py (1)
352-352
: Ensure logic consistency withtrain_only
.The change from
process_split
totrain_only
should be verified for consistency across the codebase to ensure it aligns with the intended logic.Verification successful
The
train_only
logic change is consistent across the codebase.The
train_only
key is used consistently in configuration and logic checks across multiple files, indicating that the change aligns with the intended functionality.
- Instances found in
src/MEDS_transforms/utils.py
,src/MEDS_transforms/mapreduce/utils.py
, andsrc/MEDS_transforms/mapreduce/mapper.py
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify consistency of `train_only` usage across the codebase. # Test: Search for `train_only` usage. Expect: Consistent usage across all instances. rg --type python "train_only"Length of output: 60
Script:
#!/bin/bash # Correcting the file type for Python and searching for `train_only` usage. rg --type py "train_only"Length of output: 2069
src/MEDS_transforms/mapreduce/utils.py (4)
120-124
: LGTM!The
default_file_checker
function correctly checks for file existence and completeness.
505-588
: LGTM!The
shard_iterator_by_shard_map
function correctly handles shard iteration based on a map file.
136-137
: Ensureout_fp_checker
is used consistently.The addition of
out_fp_checker
enhances flexibility. Ensure its usage is consistent throughout the function.Verification successful
Verification Successful: Consistent Usage of
out_fp_checker
The
out_fp_checker
is used consistently within therwlock_wrap
function to determine the existence of the output file path, influencing the function's control flow appropriately. No issues found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify consistent usage of `out_fp_checker` in `rwlock_wrap`. # Test: Check for all instances of `out_fp_checker` in `rwlock_wrap`. ast-grep --lang python --pattern $'def rwlock_wrap[$$$]( $$$, out_fp_checker: Callable[[Path], bool] = default_file_checker, $$$ ) -> bool: $$$ if out_fp_checker(out_fp): $$$ $$$ 'Length of output: 11117
480-490
: Verifytrain_only
logic for shard filtering.The logic for filtering shards with
train_only
should be verified to ensure it behaves as expected.Verification successful
Verified
train_only
logic inshard_iterator
.The logic for filtering shards based on the
train_only
configuration is correctly implemented in theshard_iterator
function. The function filters shards to include only those with the "train/" prefix whentrain_only
is true and such shards exist. Otherwise, it logs an informational message and processes all shards.
- The
train_only
logic is correctly applied in theshard_iterator
function.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify `train_only` logic for shard filtering in `shard_iterator`. # Test: Ensure `train_only` correctly filters shards. ast-grep --lang python --pattern $'def shard_iterator[$$$]( $$$, train_only: bool = False, $$$ ) -> tuple[list[tuple[Path, Path]], bool]: $$$ train_only = cfg.stage_cfg.get("train_only", None) train_shards = [shard_name for shard_name in shards if shard_name.startswith("train/")] if train_only and train_shards: shards = train_shards includes_only_train = True $$$ 'Length of output: 416
Script:
#!/bin/bash # Search for the `shard_iterator` function in the codebase to verify its implementation. rg 'def shard_iterator' -A 20Length of output: 4139
Script:
#!/bin/bash # Search for the full implementation of the `shard_iterator` function to verify the `train_only` logic. rg 'def shard_iterator' -A 50Length of output: 9686
Script:
#!/bin/bash # Search for the full implementation of the `shard_iterator` function with more lines to verify the `train_only` logic. rg 'def shard_iterator' -A 100Length of output: 16912
Script:
#!/bin/bash # Search for the full implementation of the `shard_iterator` function with more lines to verify the `train_only` logic. rg 'def shard_iterator' -A 150Length of output: 21323
tests/test_extract.py (1)
447-447
: LGTM!The changes to the test logic from splits to shards maintain the integrity of the test.
src/MEDS_transforms/mapreduce/mapper.py (6)
228-239
: Function implementation looks good.The
is_match_revise
function correctly checks for the presence of the_match_revise
key in the configuration.
242-288
: Validation logic is comprehensive.The
validate_match_revise
function effectively ensures the configuration is in the correct format. Consider adding test cases to cover all possible edge cases.
291-444
: Complex function requires thorough testing.The
match_revise_fntr
function is well-documented and handles complex logic for match and revise operations. Ensure comprehensive test coverage to validate its behavior under various configurations.
Line range hint
447-568
: Pattern matching logic is well-implemented.The
bind_compute_fn
function effectively uses pattern matching to bind parameters. Verify that all possible compute function types are correctly handled.
571-616
: Control flow refactoring looks good.The
map_over
function has been refactored to include match and revise logic and handle "train only" splits. Verify the control flow to ensure all scenarios are correctly managed.Tools
Ruff
574-574: Do not perform function call
partial
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
157-219
: LGTM! Verify updated function usage.The change to use a filter expression instead of a list of patient IDs enhances flexibility. Ensure that all calls to
read_and_filter_fntr
in the codebase are updated to match this new signature.Verification successful
Function usage updated correctly.
The
read_and_filter_fntr
function is used with the new signature in the codebase, ensuring compatibility with the recent changes.
- File:
src/MEDS_transforms/reshard_to_split.py
- Usage: Matches the new signature with a
pl.Expr
for filtering.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all function calls to `read_and_filter_fntr` match the new signature. # Test: Search for the function usage. Expect: Only occurrences of the new signature. rg --type python -A 5 $'read_and_filter_fntr'Length of output: 75
Script:
#!/bin/bash # Description: Verify all function calls to `read_and_filter_fntr` match the new signature. # Test: Search for the function usage. Expect: Only occurrences of the new signature. rg --type py -A 5 $'read_and_filter_fntr'Length of output: 2936
README.md (3)
3-4
: Badges enhance visibility.The addition of badges for code coverage and documentation status improves the project's visibility and accessibility. Ensure the links are correct and up-to-date.
25-33
: Installation instructions are clearer.The reformatting of installation instructions improves readability. Ensure the commands are accurate and up-to-date.
40-78
: Consider grammatical improvement and approve formatting.The reformatting enhances clarity. However, consider addressing the grammatical issue highlighted by static analysis regarding the use of "will":
- Note this will typically required a distributed file system + Note this will typically require a distributed file systemLikely invalid or redundant comment.
src/MEDS_transforms/aggregate_code_metadata.py (3)
106-151
: Quantile reducer function looks good.The
quantile_reducer
function effectively calculates specified quantiles. Verify its integration with existing aggregation logic to ensure seamless functionality.
64-67
: Enumeration update is appropriate.The addition of
VALUES_QUANTILES
to theMETADATA_FN
enumeration enhances statistical analysis capabilities. Ensure it is correctly integrated into the system.
165-171
: Aggregation dictionary update is correct.The addition of
values/quantiles
mapped toquantile_reducer
in theCODE_METADATA_AGGREGATIONS
dictionary appears correct. Verify its usage in the aggregation process to ensure proper functionality.src/MEDS_transforms/extract/convert_to_sharded_events.py (2)
20-20
: Consolidated imports for improved readability.The consolidation of imports from
MEDS_transforms.utils
into a single line enhances the readability of the code. This change is approved.
683-685
: Streamlined initialization ofshards
.The change from returning
shards_map_fn
to directly usingcfg.shards_map_fp
simplifies the code and improves readability. Ensure that all usages ofstage_init
in the codebase are updated to reflect this change.Verification successful
Consistent usage of
stage_init
across the codebase.The changes to the
stage_init
function's return values and their usage inconvert_to_sharded_events.py
are consistent with other parts of the codebase. The updates have been correctly applied.
- Confirmed in
src/MEDS_transforms/extract/convert_to_sharded_events.py
and other related files.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `stage_init` and ensure it aligns with the new return values. # Test: Search for the function usage. Expect: Only occurrences of the new usage pattern. rg --type python -A 5 $'stage_init'Length of output: 65
Script:
#!/bin/bash # Correct the `rg` command to search for Python files by using the `--type-add` option. rg --type-add 'py:*.py' --type py -A 5 $'stage_init'Length of output: 6830
src/MEDS_transforms/extract/__init__.py (2)
3-3
: Imports are well-structured and necessary.The addition of
MANDATORY_COLUMNS
andMANDATORY_TYPES
is appropriate as they are used in constructingMEDS_DATA_MANDATORY_TYPES
.
16-16
: Dynamic construction ofMEDS_DATA_MANDATORY_TYPES
is efficient and maintainable.The use of a dictionary comprehension to map columns to types enhances flexibility and centralizes configuration, which is a good practice.
@@ -19,7 +19,7 @@ def shard_patients[ | |||
patients: np.ndarray, | |||
n_patients_per_shard: int = 50000, | |||
external_splits: dict[str, Sequence[SUBJ_ID_T]] | None = None, | |||
split_fracs_dict: dict[str, float] = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}, | |||
split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid mutable default arguments.
Using mutable data structures like dictionaries as default arguments can lead to unexpected behavior. Replace with None
and initialize within the function.
- split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
+ split_fracs_dict: dict[str, float] | None = None,
Then, initialize within the function:
if split_fracs_dict is None:
split_fracs_dict = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}
Tools
Ruff
22-22: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
if parts[-1] == "__init__": | ||
parts = parts[:-1] | ||
doc_path = doc_path.with_name("index.md") | ||
full_doc_path = full_doc_path.with_name("index.md") | ||
|
||
readme_path = Path("/".join(parts + ("README.md",))) | ||
if (src / readme_path).exists(): | ||
md_file_lines.append(f'--8<-- "src/{str(readme_path)}"') | ||
elif parts[-1] == "__main__": | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimize conditional logic for special files.
The handling of __init__.py
and __main__.py
files is clear, but consider refactoring the logic for clarity and maintainability.
if parts[-1] in {"__init__", "__main__"}:
if parts[-1] == "__init__":
parts = parts[:-1]
doc_path = doc_path.with_name("index.md")
full_doc_path = full_doc_path.with_name("index.md")
readme_path = Path("/".join(parts + ("README.md",)))
if (src / readme_path).exists():
md_file_lines.append(f'--8<-- "src/{str(readme_path)}"')
continue
@@ -0,0 +1,95 @@ | |||
name: Publish Python 🐍 distribution 📦 to PyPI and TestPyPI | |||
|
|||
on: push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trigger on specific branches or tags.
Currently, the workflow triggers on all pushes. Consider specifying branches or tags to avoid unnecessary builds.
on:
push:
branches:
- main
- 'release/*'
tags:
- 'v*'
def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]: | ||
"""Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration. | ||
|
||
Args: | ||
shards: The list of shards to shuffle. | ||
cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following | ||
keys (some are optional, as marked below): | ||
- `worker` (optional): The worker ID for the MR worker; this is also used to seed the | ||
randomization process. If not provided, the randomization process will be unseeded. | ||
|
||
Returns: | ||
The shuffled list of shards. | ||
|
||
Examples: | ||
>>> cfg = DictConfig({"worker": 1}) | ||
>>> shards = ["train/0", "train/1", "tuning", "held_out"] | ||
>>> shuffle_shards(shards, cfg) | ||
['train/1', 'held_out', 'tuning', 'train/0'] | ||
>>> cfg = DictConfig({"worker": 2}) | ||
>>> shuffle_shards(shards, cfg) | ||
['tuning', 'held_out', 'train/1', 'train/0'] | ||
""" | ||
|
||
if "worker" in cfg: | ||
add_str = str(cfg["worker"]) | ||
else: | ||
add_str = str(datetime.now()) | ||
|
||
shard_keys = [] | ||
for shard in shards: | ||
shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest() | ||
if shard_hash in shard_keys: | ||
raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!") | ||
shard_keys.append(int(shard_hash, 16)) | ||
|
||
return [shard for _, shard in sorted(zip(shard_keys, shards))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential hash collisions.
Consider adding a mechanism to handle hash collisions more gracefully in shuffle_shards
.
- if shard_hash in shard_keys:
- raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!")
+ if shard_hash in shard_keys:
+ logger.warning(f"Hash collision for shard {shard} with add_str {add_str}. Retrying with a new seed.")
+ return shuffle_shards(shards, cfg)
Committable suggestion was skipped due to low confidence.
Tools
Ruff
275-278: Use ternary operator
add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
instead ofif
-else
-blockReplace
if
-else
-block withadd_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
(SIM108)
files. Similar to task configuration files, but for models. | ||
2. Figure out how to ensure that each pre-processing step reads from the right prior files. Likely need some | ||
kind of a "prior stage name" config variable. | ||
kind of a "prior stage name" config variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider grammatical improvement and approve content.
The section provides clear insights. However, consider addressing the grammatical issue highlighted by static analysis regarding the use of "kind of a":
- kind of a "prior stage name" config variable
+ kind of "prior stage name" config variable
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
files. Similar to task configuration files, but for models. | |
2. Figure out how to ensure that each pre-processing step reads from the right prior files. Likely need some | |
kind of a "prior stage name" config variable. | |
kind of a "prior stage name" config variable. | |
files. Similar to task configuration files, but for models. | |
2. Figure out how to ensure that each pre-processing step reads from the right prior files. Likely need some | |
kind of "prior stage name" config variable. |
Tools
LanguageTool
[grammar] ~446-~446: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.)
Context: ...right prior files. Likely need some kind of a "prior stage name" config variable. ##...(KIND_OF_A)
Summary by CodeRabbit
New Features
.editorconfig
file to enforce consistent coding styles across the project.Bug Fixes
Documentation
Tests
Chores