Release candidate for 0.0.4 #138

mmcdermott · 2024-08-10T22:18:48Z

Summary by CodeRabbit

New Features
- Introduced .editorconfig for consistent coding style across the project.
- Added automated workflows for building, publishing, and testing Python packages.
- Implemented a script to generate code reference pages for documentation.
- Added a utility for re-sharding a MEDS cohort into subsharded splits.
Bug Fixes
- Enhanced test functions to validate new data handling logic and configurations.
Documentation
- Improved README files and added new documentation for data processing pipeline and configuration.
- Introduced new files for better documentation structure, including mkdocs.yml.
- Launched a new pipeline_configuration.md for detailing the data processing framework.
Refactor
- Streamlined the handling of mandatory data types and columns in the codebase.
- Improved modularity and clarity in various processing functions and configurations.
- Enhanced filtering and processing logic in data transformation workflows.
Style
- Updated formatting and consistency across multiple documentation and code files.
Tests
- Added new test cases for transforming and reshaping patient data, enhancing coverage and validation.
- Updated testing framework to accommodate new data structures and configurations.

…hing works yet.

…an example of loading MEDS_transforms configs as defaults dynamically.

…low for pushing to pypi. The pypi push workflow requires local tagging via 'git tag 0.0.3' then 'git push origin 0.0.3'

Updated workflows and README.

coderabbitai · 2024-08-10T22:20:12Z

Caution

Review failed

The pull request is closed.

Walkthrough

The recent changes significantly enhance coding standards, automation, and data processing capabilities within the project. Key updates include the introduction of an .editorconfig for consistent formatting, improved CI/CD workflows for building and testing Python packages, and new scripts for efficient data sharding. These modifications streamline operations, bolster documentation, and enhance testing frameworks, leading to improved maintainability and collaboration.

Changes

Files	Change Summary
`.editorconfig`	Established coding style guidelines for consistency, including character encoding, line endings, and indentation rules.
`.github/workflows/`	Added automated workflows for publishing and testing Python packages; improved test result reporting in the testing workflow.
`.pre-commit-config.yaml`	Updated exclusion patterns and added new markdown formatting hooks to enhance pre-commit checks.
`.readthedocs.yaml`	Configured automated documentation builds on Read the Docs with specified tools and requirements for MkDocs.
`mkdocs.yml`	Created a configuration for the project documentation site, defining navigation and metadata with enhancements for markdown.
`src/MEDS_transforms/__init__.py`	Introduced mandatory data structures for improved data integrity and handling.
`src/MEDS_transforms/aggregate_code_metadata.py`	Added quantile calculation functionality to enhance data aggregation processes.
`src/MEDS_transforms/configs/`	Updated several YAML configuration files for data processing and sharding, including path adjustments and new processing stages.
`src/MEDS_transforms/mapreduce/`	Enhanced data filtering and processing logic with new utility functions and improved error handling.
`src/MEDS_transforms/reshard_to_split.py`	Added a utility for re-sharding patient data into subsharded splits with configuration management and data integrity checks.
`tests/`	Introduced new tests for reshaping data shards and enhanced existing tests for clarity and structure.
`docs/`	Updated documentation to reflect new features and workflows, along with the addition of new reference files.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CI_CD as "CI/CD System"
    participant PackageRepo as "Package Repository"
    participant TestResults as "Testing Framework"
    
    User->>+CI_CD: Push code changes
    CI_CD->>+PackageRepo: Build and publish package
    CI_CD->>+TestResults: Run automated tests
    TestResults-->>-CI_CD: Return test results
    CI_CD-->>-User: Notify completion of workflow

🐰 In the garden where code does bloom,
A rabbit hops with joy, dispelling gloom.
New scripts and flows, oh what a delight,
With colors of success shining so bright!
For every change, a hop and a cheer,
In the world of coding, we have nothing to fear! 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov-commenter · 2024-08-10T22:20:19Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 90.60403% with 28 lines in your changes missing coverage. Please review.

Project coverage is 92.91%. Comparing base (d0e7930) to head (e0d1ecb).

✅ All tests successful. No failed tests found.

Files	Patch %	Lines
src/MEDS_transforms/reshard_to_split.py	80.45%	17 Missing ⚠️
src/MEDS_transforms/mapreduce/mapper.py	91.89%	6 Missing ⚠️
src/MEDS_transforms/mapreduce/utils.py	94.64%	3 Missing ⚠️
...EDS_transforms/extract/split_and_shard_patients.py	96.55%	1 Missing ⚠️
src/MEDS_transforms/transforms/tokenization.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #138      +/-   ##
==========================================
+ Coverage   92.48%   92.91%   +0.42%     
==========================================
  Files          24       25       +1     
  Lines        1664     1863     +199     
==========================================
+ Hits         1539     1731     +192     
- Misses        125      132       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Adds a reshard-by-split stage as a transform.

coderabbitai

Actionable comments posted: 12

Outside diff range, codebase verification and nitpick comments (19)

docs/javascripts/mathjax.js (1)
14-19: Ensure MathJax is initialized correctly on document events.

The subscription to the document event to clear and reset MathJax is a good practice to ensure that mathematical expressions are rendered correctly after dynamic content changes.

Consider adding comments to explain why these specific MathJax methods are called during the event subscription for better maintainability.
+// Clear and reset MathJax to ensure proper rendering of dynamic content
docs/gen_ref_pages.py (2)
1-2: Add module-level docstring.

The file starts with a docstring, but it could be expanded to explain the purpose and usage of the script in more detail.
- """Generate the code reference pages."""
+ """Generate the code reference pages for the project.

This script uses mkdocs_gen_files to create Markdown documentation files
for each Python module in the src directory. It handles special cases for
__init__.py and __main__.py files.
"""
34-35: Consider adding more context to identifiers.

The identifier ident is constructed from module path parts. Consider adding more context or comments to explain its purpose.
- ident = ".".join(parts)
+ # Construct the module identifier for mkdocs.
+ ident = ".".join(parts)
.github/workflows/python-build.yaml (2)
1-4: Clarify event trigger.

The workflow is triggered on every push. Consider specifying branches or tags to avoid unnecessary runs.
- on: push
+ on:
+   push:
+     branches:
+       - main
+     tags:
+       - 'v*.*.*'
16-21: Recommendation: Use a virtual environment for secure installation of build tools.

The current workflow does not utilize a virtual environment for installing the build tool, which could lead to conflicts with other packages. It is recommended to modify the workflow to create and activate a virtual environment before running the pip install command. This ensures the installation is isolated and secure.

Modify the workflow to include python3 -m venv venv && source venv/bin/activate before installing packages.

Analysis chain

Ensure secure installation of build tools.

The installation of build tools should be secure. Consider using a virtual environment or ensuring the installation command is safe.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the security of build tool installation.

# Test: Check if the build tool installation is performed in a secure manner.
# Expect: Installation should be isolated or verified to be secure.
python3 -m venv venv && source venv/bin/activate && pip install build --user
Length of output: 97

Script:
#!/bin/bash
# Search for virtual environment setup or secure installation practices in the workflow file.
rg 'python3 -m venv|pip install' .github/workflows/python-build.yaml -A 3
Length of output: 186
docs/pipeline_configuration.md (3)
3-7: Clarify the design philosophy.

The introduction is clear but could benefit from a more concise explanation of the design philosophy to enhance readability.
- MEDS Transform's key design philosophy is that you should be able to design the necessary data pre-processing
- and transformation pipeline you need simply by composing a series of simple, reusable, efficient, and
- configurable local transformations into a larger pipeline. This can allow researchers to balance having highly
- flexible, customizable pipelines with ensuring that all operations in the pipeline are simple, shareable,
- efficient, and easy to validate.
+ MEDS Transform's design philosophy emphasizes creating data pre-processing and transformation pipelines by composing simple, reusable, and efficient local transformations. This approach balances flexibility and customization with simplicity, shareability, efficiency, and ease of validation.
16-29: Address loose punctuation in stage descriptions.

Some stage descriptions have loose punctuation marks that can be tightened for better readability.
- 1. `stage_1`: A metadata map-reduce stage (e.g., counting the occurrence rates of the codes in the
-     data).
+ 1. `stage_1`: A metadata map-reduce stage (e.g., counting the occurrence rates of the codes in the data).
- 2. `stage_2`: A metadata-only processing stage (e.g., filtering the code dataframe to only codes
-     that occur more than 10 times).
+ 2. `stage_2`: A metadata-only processing stage (e.g., filtering the code dataframe to only codes that occur more than 10 times).
- 3. `stage_3`: A data processing stage (e.g., filtering the data to only rows with a code that is in the
-     current running metadata file, which, due to `stage_2`, are those codes that occur more than 10 times).
+ 3. `stage_3`: A data processing stage (e.g., filtering the data to only rows with a code in the current running metadata file, which, due to `stage_2`, are those codes that occur more than 10 times).
Tools

LanguageTool

[uncategorized] ~16-~16: Loose punctuation mark.
Context: ...s of the following stages: 1. stage_1: A metadata map-reduce stage (e.g., coun...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~18-~18: Loose punctuation mark.
Context: ...the codes in the data). 2. stage_2: A metadata-only processing stage (e.g.,...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~20-~20: Loose punctuation mark.
Context: ... occur more than 10 times). 3. stage_3: A data processing stage (e.g., filterin...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... occur more than 10 times). 4. stage_4: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~24-~24: Loose punctuation mark.
Context: ...al values in the data). 5. stage_5: A data processing stage (e.g., occludin...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~26-~26: Loose punctuation mark.
Context: ... deviations from the mean). 6. stage_6: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~28-~28: Loose punctuation mark.
Context: ...al values in the data). 7. stage_7: A data processing stage (e.g., normaliz...

(UNLIKELY_OPENING_PUNCTUATION)

33-81: Improve clarity and grammar in file management descriptions.

Some descriptions can be made clearer, and a missing comma in stage 4 needs to be addressed.
- 1. `stage_1`:
-     - As there is no preceding data stage, this stage will read the data in from `$INPUT_DIR/data`
-         (the `data` suffix is the default data directory for MEDS datasets).
+ 1. `stage_1`:
+     - This stage reads data from `$INPUT_DIR/data` (the `data` suffix is the default data directory for MEDS datasets) as there is no preceding data stage.
- 4. `stage_4`:
-     - This stage will read in the data from the `$COHORT_DIR/stage_3` directory as that is the prior data
-         processing stage.
+ 4. `stage_4`:
+     - This stage reads data from the `$COHORT_DIR/stage_3` directory, as that is the prior data processing stage.
Tools

LanguageTool

[uncategorized] ~33-~33: Loose punctuation mark.
Context: ...s in the following manner. 1. stage_1: - As there is no preceding data sta...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...1/codes.parquetdirectory. 2.stage_2`: - This stage will read in the metad...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~45-~45: Loose punctuation mark.
Context: ...2/codes.parquetdirectory. 3.stage_3`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~51-~51: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 4.stage_4`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~52-~52: Possible missing comma found.
Context: ...the data from the $COHORT_DIR/stage_3 directory as that is the prior data proce...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~60-~60: Loose punctuation mark.
Context: ...tage_4/codes.parquetfile. 5.stage_5`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~64-~64: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 6.stage_6`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~75-~75: Loose punctuation mark.
Context: ...n to the stage directories. 7. stage_7: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)
tests/utils.py (1)
107-107: Inconsistent usage of do_use_config_yaml in test cases

The do_use_config_yaml parameter is used in tests/transform_tester_base.py but is missing in tests/test_extract.py, where run_command is invoked multiple times. Please review the test cases to ensure consistent usage of this parameter where applicable.

tests/transform_tester_base.py: do_use_config_yaml is explicitly set.

tests/test_extract.py: do_use_config_yaml is not used.

Analysis chain

Ensure correct usage of do_use_config_yaml.

The new parameter do_use_config_yaml adds flexibility to the function. Ensure that its usage is consistent across all test cases.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify consistent usage of `do_use_config_yaml` in test cases.

# Test: Search for test cases using `run_command`. Expect: Proper usage of `do_use_config_yaml`.
rg --type python 'run_command' -A 3
Length of output: 66

Script:
#!/bin/bash
# Description: Verify consistent usage of `do_use_config_yaml` in test cases.

# Test: Search for test cases using `run_command`. Expect: Proper usage of `do_use_config_yaml`.
rg --type py 'run_command' -A 3
Length of output: 3436
src/MEDS_transforms/reshard_to_split.py (2)
45-55: Handle file existence more gracefully.

The code raises a FileExistsError if a file already exists and do_overwrite is not set. Consider logging this event before raising the error for better traceability.
+ logger.error(f"{str(shards_fp.resolve())} already exists and overwrite is not allowed.")
  raise FileExistsError(f"{str(shards_fp.resolve())} already exists.")
137-144: Clarify exception handling and cleanup logic.

The cleanup logic in the try block could benefit from more detailed logging. Consider adding logs for each step in the cleanup process to help diagnose potential issues.
+ logger.debug(f"Attempting to remove directory {str(subshard_dir.resolve())}.")
  subshard_dir.rmdir()
tests/test_reshard_to_split.py (1)
1-5: Clarify environment variable usage in comments.

The comment mentions setting an environment variable to use local scripts. Ensure this is clear and accurate for users running the tests.
- Set the bash env variable `DO_USE_LOCAL_SCRIPTS=1` to use the local py files, rather than the installed
- scripts.
+ Set the environment variable `DO_USE_LOCAL_SCRIPTS=1` to use local Python files instead of installed scripts.
eICU_Example/README.md (1)
13-13: Fix unordered list indentation.

Adjust the indentation of the unordered list items to meet markdownlint standards and improve readability.
-    - [ ] Identifying the pre-MEDS steps for eICU
+  - [ ] Identifying the pre-MEDS steps for eICU
-    - [ ] Test that it runs at all.
-    - [ ] Test that the output is as expected.
+  - [ ] Test that it runs at all.
+  - [ ] Test that the output is as expected.
-    - [ ] Sub-sharding
-    - [ ] Patient split gathering
-    - [ ] Event extraction
-    - [ ] Merging
+  - [ ] Sub-sharding
+  - [ ] Patient split gathering
+  - [ ] Event extraction
+  - [ ] Merging
-    - [ ] Basic validation
-    - [ ] Detailed validation
+  - [ ] Basic validation
+  - [ ] Detailed validation
Also applies to: 15-16, 21-24, 26-27

Tools

Markdownlint

13-13: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)
src/MEDS_transforms/transforms/tokenization.py (1)
237-241: Ensure necessary directories are created before writing files.

The directories "schemas" and "event_seqs" do not exist, which could lead to runtime errors when writing files. Ensure that the code creates these directories if they do not already exist before writing to them.

Consider adding logic to create the directories before file operations.

Analysis chain

Simplify output file path management.

The direct handling of shards_single_output simplifies file path management, reducing complexity. Ensure that all necessary directories exist before writing to prevent runtime errors.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that the necessary directories for output files exist.

# Test: Check for the existence of required directories. Expect: Directories should exist or be created.
for dir in "schemas" "event_seqs"; do
  if [ ! -d "$dir" ]; then
    echo "Directory $dir does not exist."
  fi
done
Length of output: 284
docs/preprocessing_operation_prototypes.md (1)
80-80: Consider simplifying phrasing.

The phrase "prior to" could be simplified to "before" for conciseness.
- patient data filters should be applied prior to aggregation.
+ patient data filters should be applied before aggregation.
Tools

LanguageTool

[style] ~80-~80: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ... patient data filters should be applied prior to aggregation. 2. What aggregation functi...

(EN_WORDINESS_PREMIUM_PRIOR_TO)
src/MEDS_transforms/extract/README.md (1)
255-255: Consider revising for professionalism.

The phrase "feel free to" is common and can be replaced with a more formal alternative.
- Feel free to reach out if you have any questions or concerns about this.
+ Please reach out if you have any questions or concerns about this.
Tools

LanguageTool

[style] ~255-~255: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... extraction tool for your raw data. Feel free to reach out if you have any questions or ...

(FEEL_FREE_TO_STYLE_ME)
src/MEDS_transforms/mapreduce/mapper.py (1)
Line range hint 447-569:
Avoid using partial in default arguments.

The compute_fn argument in bind_compute_fn uses partial. It's recommended to avoid function calls in default arguments. Define the partial outside the function.
# Define the default compute function outside
default_compute_fn = partial(pl.scan_parquet, glob=False)

def bind_compute_fn(cfg: DictConfig, stage_cfg: DictConfig, compute_fn: ANY_COMPUTE_FN_T = default_compute_fn) -> COMPUTE_FN_T:
README.md (2)
25-33: Improve consistency in installation instructions.

The installation instructions are clear, but ensure consistency in formatting and phrasing across bullet points for better readability.
- For a pypi installation, install with `pip install MEDS-transforms`.
+ For a PyPI installation, use `pip install MEDS-transforms`.
40-78: Enhance clarity in the design philosophy section.

The design philosophy section is well-structured. Consider adding examples or links to relevant sections for each philosophy point to enhance understanding.
- This representation is the Medical Event Data Standard (MEDS) format, and in this repository we use it in the "flat" format, where data is organized as rows of `patient_id`, `time`, `code`, `numeric_value` columns.
+ This representation is the Medical Event Data Standard (MEDS) format. In this repository, we use the "flat" format, where data is organized as rows of `patient_id`, `time`, `code`, and `numeric_value` columns.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d0e7930 and 5e47f0c.

Files selected for processing (42)

.editorconfig (1 hunks)
.github/workflows/python-build.yaml (1 hunks)
.github/workflows/tests.yaml (1 hunks)
.pre-commit-config.yaml (3 hunks)
.readthedocs.yaml (1 hunks)
MIMIC-IV_Example/README.md (2 hunks)
README.md (8 hunks)
docs/gen_ref_pages.py (1 hunks)
docs/index.md (1 hunks)
docs/javascripts/mathjax.js (1 hunks)
docs/pipeline_configuration.md (1 hunks)
docs/preprocessing_operation_prototypes.md (7 hunks)
eICU_Example/README.md (10 hunks)
eICU_Example/configs/table_preprocessors.yaml (1 hunks)
mkdocs.yml (1 hunks)
pyproject.toml (2 hunks)
src/MEDS_transforms/init.py (2 hunks)
src/MEDS_transforms/aggregate_code_metadata.py (10 hunks)
src/MEDS_transforms/configs/extract.yaml (1 hunks)
src/MEDS_transforms/configs/preprocess.yaml (2 hunks)
src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml (1 hunks)
src/MEDS_transforms/configs/stage_configs/split_and_shard_patients.yaml (1 hunks)
src/MEDS_transforms/extract/README.md (6 hunks)
src/MEDS_transforms/extract/init.py (2 hunks)
src/MEDS_transforms/extract/convert_to_sharded_events.py (2 hunks)
src/MEDS_transforms/extract/extract_code_metadata.py (2 hunks)
src/MEDS_transforms/extract/finalize_MEDS_metadata.py (2 hunks)
src/MEDS_transforms/extract/merge_to_MEDS_cohort.py (2 hunks)
src/MEDS_transforms/extract/split_and_shard_patients.py (7 hunks)
src/MEDS_transforms/mapreduce/mapper.py (4 hunks)
src/MEDS_transforms/mapreduce/utils.py (3 hunks)
src/MEDS_transforms/reshard_to_split.py (1 hunks)
src/MEDS_transforms/transforms/tokenization.py (3 hunks)
src/MEDS_transforms/utils.py (8 hunks)
tests/test_add_time_derived_measurements.py (1 hunks)
tests/test_extract.py (1 hunks)
tests/test_filter_measurements.py (2 hunks)
tests/test_occlude_outliers.py (1 hunks)
tests/test_reorder_measurements.py (1 hunks)
tests/test_reshard_to_split.py (1 hunks)
tests/transform_tester_base.py (7 hunks)
tests/utils.py (2 hunks)

Files skipped from review due to trivial changes (11)

.editorconfig
.readthedocs.yaml
MIMIC-IV_Example/README.md
docs/index.md
eICU_Example/configs/table_preprocessors.yaml
src/MEDS_transforms/configs/extract.yaml
src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml
src/MEDS_transforms/configs/stage_configs/split_and_shard_patients.yaml
tests/test_add_time_derived_measurements.py
tests/test_occlude_outliers.py
tests/test_reorder_measurements.py

Additional context used

LanguageTool

docs/pipeline_configuration.md

[uncategorized] ~16-~16: Loose punctuation mark.
Context: ...s of the following stages: 1. stage_1: A metadata map-reduce stage (e.g., coun...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~18-~18: Loose punctuation mark.
Context: ...the codes in the data). 2. stage_2: A metadata-only processing stage (e.g.,...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~20-~20: Loose punctuation mark.
Context: ... occur more than 10 times). 3. stage_3: A data processing stage (e.g., filterin...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... occur more than 10 times). 4. stage_4: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~24-~24: Loose punctuation mark.
Context: ...al values in the data). 5. stage_5: A data processing stage (e.g., occludin...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~26-~26: Loose punctuation mark.
Context: ... deviations from the mean). 6. stage_6: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~28-~28: Loose punctuation mark.
Context: ...al values in the data). 7. stage_7: A data processing stage (e.g., normaliz...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~33-~33: Loose punctuation mark.
Context: ...s in the following manner. 1. stage_1: - As there is no preceding data sta...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...1/codes.parquetdirectory. 2.stage_2`: - This stage will read in the metad...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~45-~45: Loose punctuation mark.
Context: ...2/codes.parquetdirectory. 3.stage_3`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~51-~51: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 4.stage_4`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~52-~52: Possible missing comma found.
Context: ...the data from the $COHORT_DIR/stage_3 directory as that is the prior data proce...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~60-~60: Loose punctuation mark.
Context: ...tage_4/codes.parquetfile. 5.stage_5`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~64-~64: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 6.stage_6`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~75-~75: Loose punctuation mark.
Context: ...n to the stage directories. 7. stage_7: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)

[grammar] ~82-~82: The verb form ‘reduce’ does not seem to be suitable in this context.
Context: ...ssing stages (which do not have reduce stages).

(HAVE_VB)

docs/preprocessing_operation_prototypes.md

[style] ~80-~80: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ... patient data filters should be applied prior to aggregation. 2. What aggregation functi...

(EN_WORDINESS_PREMIUM_PRIOR_TO)

src/MEDS_transforms/extract/README.md

[misspelling] ~83-~83: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...t (again either a string literal or a input column reference) is interpreted ...

(EN_A_VS_AN)

[style] ~255-~255: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... extraction tool for your raw data. Feel free to reach out if you have any questions or ...

(FEEL_FREE_TO_STYLE_ME)

README.md

[grammar] ~103-~103: The modal verb ‘will’ requires the verb’s base form.
Context: ...rker jobs. Note this will typically required a distributed file system to work corre...

(MD_BASEFORM)

[uncategorized] ~106-~106: Possible missing comma found.
Context: ...e same script in different terminal sessions or all at once via the Hydra joblib l...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~181-~181: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ... and external ontologies? Will they need to specify select key-event concepts to id...

(REP_NEED_TO_VB)

[uncategorized] ~207-~207: Possible missing comma found.
Context: ...achine or across many worker nodes on a cluster provided the worker nodes have acce...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~306-~306: This phrase is redundant. Consider using “outside”.
Context: ... - Removing numeric values that fall outside of pre-specified, per-code ranges (e.g., f...

(OUTSIDE_OF)

[style] ~315-~315: Consider using “who” when you are referring to a person instead of an object.
Context: ..." event to the dataset for each patient that occurs at the end of the observation ...

(THAT_WHO)

[style] ~324-~324: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... each code occurs in the dataset. - Computing appropriate numeric bins for each code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

[uncategorized] ~328-~328: Possible missing preposition found.
Context: ...ure that the only columns that need be processed going into the pre-processing...

(AI_HYDRA_LEO_MISSING_TO)

[grammar] ~446-~446: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.)
Context: ...right prior files. Likely need some kind of a "prior stage name" config variable. ##...

(KIND_OF_A)

Ruff

src/MEDS_transforms/reshard_to_split.py

116-116: Function definition does not bind loop variable in_fps

(B023)

src/MEDS_transforms/extract/split_and_shard_patients.py

22-22: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

src/MEDS_transforms/mapreduce/utils.py

247-250: Use ternary operator add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now()) instead of if-else-block

Replace if-else-block with add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())

(SIM108)

src/MEDS_transforms/mapreduce/mapper.py

574-574: Do not perform function call partial in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

Markdownlint

eICU_Example/README.md

13-13: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

15-15: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

16-16: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

21-21: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

22-22: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

23-23: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

24-24: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

26-26: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

27-27: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

Additional comments not posted (61)

docs/javascripts/mathjax.js (1)

1-12: MathJax configuration looks good.

The configuration for inline and display math is standard, and the use of processEscapes and processEnvironments is appropriate for handling TeX expressions.

src/MEDS_transforms/extract/__init__.py (2)

3-3: Imports are appropriate for the new functionality.

The import of MANDATORY_COLUMNS and MANDATORY_TYPES is necessary for constructing MEDS_DATA_MANDATORY_TYPES dynamically.

16-16: Dynamic construction of MEDS_DATA_MANDATORY_TYPES improves maintainability.

Using dictionary comprehension to construct MEDS_DATA_MANDATORY_TYPES from MANDATORY_COLUMNS and MANDATORY_TYPES enhances flexibility and reduces the risk of errors when updating mandatory fields.

src/MEDS_transforms/__init__.py (4)

4-4: Import of polars is appropriate for data type handling.

The inclusion of polars is necessary for defining data types in MANDATORY_TYPES, which enhances type safety.

15-15: Definition of MANDATORY_COLUMNS ensures data integrity.

Listing essential fields in MANDATORY_COLUMNS helps maintain data integrity by enforcing the presence of these fields in the dataset.

17-24: Use of polars types in MANDATORY_TYPES enhances type safety.

Defining expected data types with polars types ensures that data processing is consistent and reliable.

26-33: Mapping deprecated names maintains backward compatibility.

The DEPRECATED_NAMES dictionary helps transition to new naming conventions while maintaining backward compatibility with older datasets.
src/MEDS_transforms/configs/preprocess.yaml (1)
4-4: Ensure compatibility with the new stage reshard_to_split.

The addition of reshard_to_split in the stage_configs section modifies the data processing pipeline. Verify that all necessary components are in place to support this new stage.

Verification successful

The reshard_to_split stage is supported and configured properly.

The reshard_to_split stage is referenced in various parts of the codebase, including configuration files, test scripts, and as an entry point in pyproject.toml. This indicates that the necessary components are in place to support this new stage.

Entry Point: Defined in pyproject.toml as MEDS_transform-reshard_to_split.

Tests: Present in tests/test_reshard_to_split.py.

Configuration: Parameters set in src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Ensure compatibility with the new stage `reshard_to_split`.

# Test: Search for the `reshard_to_split` function or class definition. Expect: At least one definition.
ast-grep --lang python --pattern $'def reshard_to_split($_):\n  $$$'
Length of output: 68

Script:
#!/bin/bash
# Broaden the search for any occurrences of `reshard_to_split` in the codebase.
rg 'reshard_to_split' -A 3
Length of output: 1869
.github/workflows/tests.yaml (2)

36-36: Enhance test reporting with JUnit XML.

The addition of --junitxml=junit.xml in the pytest command enhances test reporting by generating a JUnit XML report, which is useful for CI/CD integration.

42-46: Robust test result upload to Codecov.

The new step to upload test results to Codecov, conditioned on the previous steps not being cancelled, adds robustness to the workflow. Ensure the CODECOV_TOKEN is securely managed.

mkdocs.yml (1)

1-45: Comprehensive MkDocs configuration.

The new mkdocs.yml file is well-structured, utilizing the Material theme and various extensions to enhance the documentation. Ensure that all referenced files and scripts, like docs/gen_ref_pages.py, exist and are correctly configured.
docs/gen_ref_pages.py (3)
12-30: Ensure correct handling of special cases.

The logic correctly skips __main__.py files and adjusts paths for __init__.py files. Ensure that this behavior aligns with your documentation requirements.

Verification successful

Handling of __init__.py and __main__.py files is correct.

The code correctly processes __init__.py files by converting them to index.md and skips __main__.py files as intended. The current implementation aligns with the documentation requirements.

__init__.py files found in the src directory will be converted to index.md.

No __main__.py files were found, confirming that none are incorrectly processed.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the handling of __init__.py and __main__.py files in the documentation generation.

# Test: List all __init__.py and __main__.py files in the src directory.
# Expect: __main__.py files are excluded from documentation, __init__.py files are converted to index.md.
fd --type f --name '__init__.py' src/
fd --type f --name '__main__.py' src/
Length of output: 577

Script:
#!/bin/bash
# Description: Verify the handling of __init__.py and __main__.py files in the documentation generation.

# Test: List all __init__.py and __main__.py files in the src directory.
# Expect: __main__.py files are excluded from documentation, __init__.py files are converted to index.md.
fd '__init__.py' src/
fd '__main__.py' src/
Length of output: 334
32-32: Verify navigation structure.

The navigation structure is built using mkdocs_gen_files.Nav. Ensure that the generated structure meets the project's documentation needs.

37-40: Check for file path correctness.

Ensure that the paths used for writing files and setting edit paths are correct and that they align with the project's structure.
pyproject.toml (2)
50-50: Verify new script entry.

The new script MEDS_transform-reshard_to_split is added. Ensure that the function it points to exists and is correctly implemented.

30-34: Verify documentation dependencies.

The new documentation dependencies enhance the project's documentation capabilities. Verify that the specified versions are compatible and meet the project's needs.

Verification successful

Documentation Dependencies Verified

All specified documentation dependencies are available on PyPI and compatible with the project's requirements.

mkdocs==1.6.0

mkdocs-material==9.5.31

mkdocstrings==0.25.2

mkdocs-gen-files==0.5.0

mkdocs-literate-nav==0.6.1

mkdocs-section-index==0.3.9

mkdocs-git-authors-plugin==0.9.0

mkdocs-git-revision-date-localized-plugin==1.2.6
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the compatibility of new documentation dependencies.

# Test: Check if the specified versions of mkdocs and related plugins are available.
# Expect: All specified versions should be available for installation.
pip install mkdocs==1.6.0 mkdocs-material==9.5.31 mkdocstrings[python,shell]==0.25.2 mkdocs-gen-files==0.5.0 mkdocs-literate-nav==0.6.1 mkdocs-section-index==0.3.9 mkdocs-git-authors-plugin==0.9.0 mkdocs-git-revision-date-localized-plugin==1.2.6
Length of output: 320

Script:
#!/bin/bash
# Description: Check the availability of specified package versions on PyPI.

# List of packages and their versions
packages=(
  "mkdocs==1.6.0"
  "mkdocs-material==9.5.31"
  "mkdocstrings[python,shell]==0.25.2"
  "mkdocs-gen-files==0.5.0"
  "mkdocs-literate-nav==0.6.1"
  "mkdocs-section-index==0.3.9"
  "mkdocs-git-authors-plugin==0.9.0"
  "mkdocs-git-revision-date-localized-plugin==1.2.6"
)

# Check each package version on PyPI
for pkg in "${packages[@]}"; do
  name_version=$(echo $pkg | sed 's/==/-/g' | sed 's/\[.*\]//g')
  curl -s "https://pypi.org/pypi/${name_version}/json" | jq '.info.version'
done
Length of output: 2130

Script:
#!/bin/bash
# Description: Check the availability of specified package versions on PyPI.

# List of packages and their versions
packages=(
  "mkdocs==1.6.0"
  "mkdocs-material==9.5.31"
  "mkdocstrings==0.25.2"
  "mkdocs-gen-files==0.5.0"
  "mkdocs-literate-nav==0.6.1"
  "mkdocs-section-index==0.3.9"
  "mkdocs-git-authors-plugin==0.9.0"
  "mkdocs-git-revision-date-localized-plugin==1.2.6"
)

# Check each package version on PyPI
for pkg in "${packages[@]}"; do
  name_version=$(echo $pkg | sed 's/==/-/g')
  package_name=$(echo $name_version | cut -d'-' -f1)
  version=$(echo $name_version | cut -d'-' -f2)
  curl -s "https://pypi.org/pypi/${package_name}/${version}/json" | jq '.info.version'
done
Length of output: 2945

Script:
#!/bin/bash
# Description: Check the availability of specified package versions on PyPI.

# List of packages and their versions
packages=(
  "mkdocs==1.6.0"
  "mkdocs-material==9.5.31"
  "mkdocstrings==0.25.2"
  "mkdocs-gen-files==0.5.0"
  "mkdocs-literate-nav==0.6.1"
  "mkdocs-section-index==0.3.9"
  "mkdocs-git-authors-plugin==0.9.0"
  "mkdocs-git-revision-date-localized-plugin==1.2.6"
)

# Check each package version on PyPI
for pkg in "${packages[@]}"; do
  package_name=$(echo $pkg | cut -d'=' -f1)
  version=$(echo $pkg | cut -d'=' -f3)
  curl -s "https://pypi.org/pypi/${package_name}/${version}/json" | jq '.info.version'
done
Length of output: 2434
.github/workflows/python-build.yaml (4)
72-78: Verify Sigstore signing configuration.

The Sigstore signing step should be correctly configured and secure. Verify the inputs and ensure that the signing process is reliable.

78-95: Check GitHub Release creation.

Ensure that the GitHub Release creation step is correctly configured and that all necessary artifacts are uploaded.

Verification successful

GitHub Release creation is correctly configured.

The workflow configuration for creating a GitHub Release and uploading artifacts is properly set up. The dist/** pattern ensures that all necessary distribution files and signatures are uploaded to the release.

The github-release job downloads artifacts, signs them, and uploads them to the GitHub release.

The use of ${{ github.ref_name }} and ${{ github.repository }} in the gh commands is appropriate within the GitHub Actions environment.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the configuration for GitHub Release creation.

# Test: Check if the GitHub Release step uploads all necessary artifacts.
# Expect: All distribution files and signatures should be uploaded.
gh release view '${{ github.ref_name }}' --repo '${{ github.repository }}'
Length of output: 148

Script:
#!/bin/bash
# Search for the GitHub Release creation and artifact upload steps in the workflow file
# to verify if all necessary artifacts are included.

# Find the workflow file
fd 'python-build.yaml' --type f --search-path .github/workflows --exec cat
Length of output: 2813
30-33: Confirm tag-based publishing.

The workflow publishes to PyPI only on tag pushes. Ensure that this condition is correctly implemented and tested.

Verification successful

Tag-based publishing condition is correctly implemented.

The publish-to-pypi job in the GitHub Actions workflow is configured to trigger only on tag pushes using the condition startsWith(github.ref, 'refs/tags/'). This ensures that publishing to PyPI occurs as expected when a tag is pushed.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify tag-based publishing condition.

# Test: Check if the workflow correctly identifies tag pushes.
# Expect: Workflow should trigger publishing only on tag pushes.
git tag v0.0.4 && git push origin v0.0.4
Length of output: 132

Script:
#!/bin/bash
# Verify the 'publish-to-pypi' job's condition in the GitHub Actions workflow file.
# We are looking for the 'if' condition that should trigger the job only on tag pushes.
rg -A 3 'publish-to-pypi:' .github/workflows/python-build.yaml
Length of output: 224
11-15: Verify Python version setup.

Ensure that the specified Python version (3.12) is compatible with all dependencies and the project requirements.
.pre-commit-config.yaml (3)
129-129: LGTM! Improved readability of nbqa-flake8 arguments.

The reformatting of arguments into a single line enhances readability without affecting functionality.

4-4: Verify the new exclusion pattern.

Ensure that the updated exclusion pattern aligns with the project's requirements and does not exclude important files from pre-commit checks.

Verification successful

Exclusion pattern verified successfully.

The exclusion pattern in .pre-commit-config.yaml correctly excludes the specified files (docs/index.md, MIMIC-IV_Example/README.md, and eICU_Example/README.md) without affecting other critical files. No issues found with the current configuration.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the exclusion pattern in `.pre-commit-config.yaml`.

# Test: List files that match the exclusion pattern. Expect: No critical files should be excluded.
rg --files --glob '!docs/index.md' --glob '!MIMIC-IV_Example/README.md' --glob '!eICU_Example/README.md'
Length of output: 3835
101-103: Verify compatibility of new markdown hooks.

Ensure that the newly added hooks (mdformat-mkdocs, mdformat-toc, mdformat-admon) are compatible with the project's markdown files and do not introduce formatting issues.
tests/utils.py (2)

7-7: LGTM! Necessary import for configuration handling.

The import of OmegaConf is necessary for handling YAML configuration files in the updated run_command function.

114-166: Review error handling and resource cleanup.

The error handling and resource cleanup logic have been improved. Ensure that temporary directories are always cleaned up to prevent resource leaks.
src/MEDS_transforms/reshard_to_split.py (4)
4-17: Consider removing unused imports.

The imports from hydra, loguru, and omegaconf are appropriate for configuration management and logging. However, ensure that all imported modules are used in the file.

87-95: Verify the filtering logic within rwlock_wrap.

Ensure that the filtering logic applied with read_and_filter_fntr is correct and that it efficiently filters the data based on patient IDs.

Verification successful

Filtering logic within rwlock_wrap is correctly implemented.

The function read_and_filter_fntr is used to apply a filter expression to data read from a file. In the context of rwlock_wrap, it effectively filters the data based on the specified patient IDs using the expression pl.col("patient_id").is_in(patients). The implementation aligns with the intended functionality.

src/MEDS_transforms/reshard_to_split.py: Usage of read_and_filter_fntr with patient ID filtering.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the filtering logic applied in rwlock_wrap.

# Test: Search for read_and_filter_fntr usage to confirm filtering logic.
rg --type python $'read_and_filter_fntr'
Length of output: 70

Script:
#!/bin/bash
# Search for the definition and usage of read_and_filter_fntr to verify the filtering logic.
rg 'read_and_filter_fntr' -A 5
Length of output: 2990
28-32: Check for potential data loading issues.

The code reads a Parquet file into a DataFrame. Ensure that the file path is correct and that the file exists in the expected location.

21-23: Ensure correct configuration path usage.

The hydra.main decorator uses PREPROCESS_CONFIG_YAML for configuration. Verify that this path is correct and accessible in the runtime environment.
tests/test_reshard_to_split.py (1)

193-202: Ensure comprehensive test coverage.

The test function test_reshard_to_split uses single_stage_transform_tester to validate the script. Ensure that this covers all critical paths and edge cases of the script.
tests/test_filter_measurements.py (1)
211-228: Ensure filtering logic is accurately tested.

The new test function test_match_revise_filter_measurements uses specific filtering criteria. Verify that these criteria align with the intended logic and cover necessary edge cases.

Verification successful

Filtering logic in test function is correctly aligned with implementation.

The test function test_match_revise_filter_measurements accurately reflects the filtering logic defined in the source code. It tests various codes and patient thresholds, ensuring comprehensive coverage of the filtering criteria.

The filtering function filter_measurements_fntr in src/MEDS_transforms/filters/filter_measurements.py aligns with the test's logic.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the filtering logic in test_match_revise_filter_measurements aligns with the intended logic.

# Test: Check for additional filtering logic or criteria related to filter_measurements.
rg --type python $'filter_measurements'
Length of output: 69

Script:
#!/bin/bash
# Description: Verify the filtering logic in test_match_revise_filter_measurements aligns with the intended logic.

# Test: Search for the term 'filter_measurements' across all Python files to find related logic or criteria.
rg 'filter_measurements' --glob '*.py' -A 5
Length of output: 7435
src/MEDS_transforms/extract/finalize_MEDS_metadata.py (2)

149-149: Streamline initialization of shards_map_fp.

The direct assignment of shards_map_fp within the main function simplifies the code by reducing unnecessary return values from stage_init(cfg). This is a good practice for improving code readability and maintainability.

196-196: Ensure existence of cfg.shards_map_fp.

Before using Path(cfg.shards_map_fp), verify that cfg.shards_map_fp is a valid and expected path. This prevents potential runtime errors if the configuration is incorrect.
src/MEDS_transforms/transforms/tokenization.py (2)
232-235: Check for unsupported configurations.

Raising a ValueError when include_only_train is true ensures unsupported configurations are flagged early. This is a good practice to prevent unexpected behavior during execution.

20-20: Update import statements.

The import of shard_iterator from MEDS_transforms.mapreduce.utils suggests a shift in utility functions. Ensure that shard_iterator is the correct choice for the intended functionality.

Verification successful

Import statement is correct.

The shard_iterator function is correctly imported from MEDS_transforms.mapreduce.utils and is used across multiple files in the codebase. No updates to the import statement are necessary.

src/MEDS_transforms/mapreduce/utils.py: Contains the definition of shard_iterator.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the implementation of `shard_iterator` in the codebase.

# Test: Search for the `shard_iterator` function definition. Expect: A valid function definition.
ast-grep --lang python --pattern $'def shard_iterator($_):'
Length of output: 60

Script:
#!/bin/bash
# Searching for any occurrences of `shard_iterator` in the codebase to verify its existence and usage.
rg 'shard_iterator'
Length of output: 2577
src/MEDS_transforms/extract/merge_to_MEDS_cohort.py (2)
12-12: Ensure consistency with new shard iterator.

The import of shard_iterator_by_shard_map replaces the previous shard_iterator. Ensure that this change is consistent with the intended behavior of the map_over function and that it aligns with the overall data processing logic.

Verification successful

Consistent usage of shard_iterator_by_shard_map verified.

The shard_iterator_by_shard_map function is correctly used as an argument in the map_over function within merge_to_MEDS_cohort.py, ensuring consistency with the intended data processing logic.

merge_to_MEDS_cohort.py: map_over uses shard_iterator_by_shard_map as shard_iterator_fntr.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `shard_iterator_by_shard_map` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage with the intended behavior.
rg --type python 'shard_iterator_by_shard_map'
Length of output: 77

Script:
#!/bin/bash
# Description: Verify the usage of `shard_iterator_by_shard_map` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage with the intended behavior.
rg --type py 'shard_iterator_by_shard_map'
Length of output: 776

Script:
#!/bin/bash
# Description: Verify the usage of `map_over` in relation to `shard_iterator_by_shard_map`.

# Test: Search for the `map_over` function in the `merge_to_MEDS_cohort.py` file.
rg 'map_over' src/MEDS_transforms/extract/merge_to_MEDS_cohort.py -A 5
Length of output: 363
241-241: Verify the impact of iterator function change.

The shard_iterator_fntr parameter has been updated to use shard_iterator_by_shard_map. Verify that this change does not introduce any issues in the map_over function's behavior.
docs/preprocessing_operation_prototypes.md (1)

45-46: Formatting improvements enhance readability.

The formatting adjustments improve the document's visual consistency and readability. No functional content changes are made.

Also applies to: 74-74, 82-83, 113-113, 155-155, 203-203, 205-205, 210-211, 222-222, 230-233

src/MEDS_transforms/extract/split_and_shard_patients.py (3)

108-108: Initialize splits consistently.

The splits variable is initialized with external_splits, which is appropriate. Ensure this initialization is consistent with the intended logic for handling splits.

111-139: Improved handling of split fractions and warnings.

The restructuring of logic to handle split fractions and warnings about empty splits enhances robustness. Ensure that the warning messages are clear and informative.

268-272: Improved logging and file handling.

The changes to path handling and logging improve clarity and maintainability. The use of shards_map_fp for file operations is a good practice.

src/MEDS_transforms/extract/README.md (1)

8-10: Formatting changes improve readability.

The adjustments to indentation and alignment enhance the document's visual clarity.

Also applies to: 32-36, 38-44, 80-90, 92-93, 104-104, 220-221, 248-250, 252-255
tests/transform_tester_base.py (3)
303-305: New parameters enhance configurability.

The addition of do_use_config_yaml, input_shards_map, and input_splits_map allows for more flexible input configurations.

366-374: Streamlined command execution logic.

The use of run_command_kwargs to pass arguments improves readability and maintainability.

320-336: Ensure correct default handling for input maps.

The logic for defaulting input_shards_map and input_splits_map is correctly implemented, but verify that these defaults align with expected usage scenarios.

Verification successful

Defaults for input_shards_map and input_splits_map are correctly handled and tested.

The search results indicate that there are test cases where input_shards_map and input_splits_map are not explicitly provided, allowing the defaults to be used. This confirms that the default handling logic is being tested as expected.

Test cases in tests/test_tokenization.py, tests/test_reorder_measurements.py, tests/test_tensorization.py, tests/test_fit_vocabulary_indices.py, tests/test_filter_measurements.py, tests/test_normalization.py, tests/test_occlude_outliers.py, and tests/test_filter_patients.py do not provide these maps, implying default usage.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify usage of `input_shards_map` and `input_splits_map` in tests.

# Test: Search for test cases using `single_stage_transform_tester` to ensure defaults are correctly applied.
rg --type python 'single_stage_transform_tester' -A 10
Length of output: 85

Script:
#!/bin/bash
# Description: Verify usage of `input_shards_map` and `input_splits_map` in tests.

# Test: Search for test cases using `single_stage_transform_tester` to ensure defaults are correctly applied.
rg --type py 'single_stage_transform_tester' -A 10
Length of output: 14376
src/MEDS_transforms/utils.py (2)
Line range hint 225-352: Improved stage configuration logic.

The refined handling of metadata stages and train_only logic enhances configurability. Verify that all stages are handled correctly.

Verification successful

Verification of populate_stage logic improvements.

The populate_stage function is tested through doctests within src/MEDS_transforms/utils.py, which cover various stage configurations and expected outcomes. These examples verify the function's behavior and logic improvements. Consider adding more comprehensive unit tests for additional coverage.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify correctness of stage configuration logic in `populate_stage`.

# Test: Search for test cases or usage examples of `populate_stage` to ensure all stages are configured correctly.
rg --type python 'populate_stage' -A 10
Length of output: 70

Script:
#!/bin/bash
# Description: Verify correctness of stage configuration logic in `populate_stage`.

# Test: Search for test cases or usage examples of `populate_stage` to ensure all stages are configured correctly.
rg --type py 'populate_stage' -A 10
Length of output: 6119
Line range hint 61-98: Simplified return type improves clarity.

The change to return only three Path objects reduces complexity. Ensure that dependent code is updated accordingly.

Verification successful

Simplified return type is consistent with current usage.

The change to return only three Path objects in stage_init aligns with its usage across the codebase, where the function's return values are unpacked into three variables. No updates to dependent code are necessary.

src/MEDS_transforms/reshard_to_split.py

src/MEDS_transforms/mapreduce/mapper.py

src/MEDS_transforms/extract/split_and_shard_patients.py

src/MEDS_transforms/extract/finalize_MEDS_metadata.py

src/MEDS_transforms/extract/convert_to_sharded_events.py

src/MEDS_transforms/extract/extract_code_metadata.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify usage of `stage_init` to ensure dependent code is updated.

# Test: Search for all usages of `stage_init` to check for consistency with the new return type.
rg --type python 'stage_init' -A 5
Length of output: 65

Script:
#!/bin/bash
# Description: Verify usage of `stage_init` to ensure dependent code is updated.

# Test: Search for all usages of `stage_init` to check for consistency with the new return type.
rg --type py 'stage_init' -A 5
Length of output: 6872
src/MEDS_transforms/extract/extract_code_metadata.py (2)
352-352: Change in unpacking of stage_init return values is appropriate.

The stage_init function now returns three values, and the change in unpacking aligns with the updated return signature. Ensure that the removed value is not used elsewhere in the code.

20-20: Verify the new import path for cfg_to_expr.

The import path has changed, which may indicate a refactor. Ensure that this path is correct and consistent with the project's module organization.

Verification successful

Import Path Verified: cfg_to_expr

The import path for cfg_to_expr in src/MEDS_transforms/extract/extract_code_metadata.py is correct. The function is defined in src/MEDS_transforms/parser.py, matching the import statement. No issues found.

src/MEDS_transforms/parser.py: Function cfg_to_expr is correctly defined here.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the existence and correctness of the new import path for `cfg_to_expr`.

# Test: Check if the `cfg_to_expr` function is correctly imported from the new location.
ast-grep --lang python --pattern $'def cfg_to_expr($_) { $$$ }'
Length of output: 514

Script:
#!/bin/bash
# Description: Verify the existence and correctness of the new import path for `cfg_to_expr`.

# Test: Check if the `cfg_to_expr` function is correctly defined in the new location.
ast-grep --lang python --pattern 'def cfg_to_expr($$):'
Length of output: 57

Script:
#!/bin/bash
# Description: Search for the `cfg_to_expr` function definition across the entire codebase.

# Test: Find any definition of the `cfg_to_expr` function in the codebase.
rg 'def cfg_to_expr' -A 2
Length of output: 278
src/MEDS_transforms/mapreduce/utils.py (3)

182-182: Consistent return types in rwlock_wrap.

The return statements have been updated to consistently return tuples, enhancing code consistency.

Also applies to: 216-216

261-474: Enhanced functionality in shard_iterator.

The function now returns a tuple with input-output pairs and a boolean for training shards, improving its interface and usability.

477-560: shard_iterator_by_shard_map function is robust and well-integrated.

The function correctly iterates over shards based on a shard map file and handles configuration errors effectively.

tests/test_extract.py (1)

444-447: Update to shards_fp in test_extraction is appropriate.

The change from splits_fp to shards_fp reflects an update in data organization and is correctly integrated into the test.

src/MEDS_transforms/mapreduce/mapper.py (2)

228-240: Clarify the return type of is_match_revise.

The function is_match_revise now has a return type annotation -> bool. This is a good practice as it improves readability and helps with static analysis.

242-289: Ensure comprehensive validation in validate_match_revise.

The function validate_match_revise performs thorough validation of the stage_cfg. The error messages are clear, which aids in debugging.
README.md (1)
3-4: Verify badge links and tokens.

Ensure that the URLs for the badges are correct and that the tokens are valid. This is crucial for accurate status display.

Verification successful

Badge links and tokens are valid.

Both badge URLs returned HTTP 200 status codes, confirming their accessibility and validity. No issues were found with the badge links or tokens.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the URLs and tokens for the badges in the README.md file.

# Test: Check if the badge URLs are accessible. Expect: HTTP 200 status code.
curl -I https://codecov.io/gh/mmcdermott/MEDS_transforms/graph/badge.svg?token=5RORKQOZF9
curl -I https://readthedocs.org/projects/meds-transforms/badge/?version=latest
Length of output: 3136
src/MEDS_transforms/aggregate_code_metadata.py (3)

63-66: Document the new quantile aggregation clearly.

The addition of VALUES_QUANTILES to the METADATA_FN class is well-documented. Ensure that the configuration file is updated to include examples of how to specify quantiles.

601-618: Test the integration of quantile aggregation in reducer_fntr.

The integration of quantile aggregation in the reducer_fntr function appears correct. Ensure that tests cover scenarios where quantiles are part of the aggregation configuration.

105-150: Ensure correctness of quantile_reducer.

The quantile_reducer function is correctly implemented to calculate specified quantiles. Ensure that it handles edge cases, such as empty datasets, gracefully.

src/MEDS_transforms/extract/convert_to_sharded_events.py (2)

20-20: Simplified import statement.

The removal of parentheses around imported functions improves readability without affecting functionality.

683-685: Improved clarity by reading shards directly.

The change to read shards directly from cfg.shards_map_fp instead of using shards_map_fn from stage_init enhances clarity and reduces dependencies.

coderabbitai · 2024-08-10T22:47:27Z

src/MEDS_transforms/extract/README.md

+    `relative_table_file_stem.event_name.code` field. This field can be a string literal, a reference to an
+    input column (denoted by the `col(...)` syntax), or a list of same. If it is a list, the output code will
+    be a `"//"` separated string of each field in the list. Each field in the list (again either a string
+    literal or a input column reference) is interpreted either as the specified string literal or as the


Correct article usage.

Use "an" instead of "a" before a vowel sound.

- or a input column reference + or an input column reference

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

literal or a input column reference) is interpreted either as the specified string literal or as the

literal or an input column reference) is interpreted either as the specified string literal or as the

Tools

LanguageTool

[misspelling] ~83-~83: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...t (again either a string literal or a input column reference) is interpreted ...

(EN_A_VS_AN)

coderabbitai · 2024-08-10T22:47:27Z

src/MEDS_transforms/extract/split_and_shard_patients.py

@@ -19,7 +19,7 @@ def shard_patients[
    patients: np.ndarray,
    n_patients_per_shard: int = 50000,
    external_splits: dict[str, Sequence[SUBJ_ID_T]] | None = None,
-    split_fracs_dict: dict[str, float] = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
+    split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},


Avoid mutable default arguments.

Using mutable data structures like dictionaries as default arguments can lead to unexpected behavior. Replace with None and initialize within the function.

- split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}, + split_fracs_dict: dict[str, float] | None = None,

Initialize within the function:

if split_fracs_dict is None: split_fracs_dict = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}

Tools

Ruff

22-22: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

coderabbitai · 2024-08-10T22:47:28Z

src/MEDS_transforms/mapreduce/utils.py

+def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]:
+    """Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration.
+
+    Args:
+        shards: The list of shards to shuffle.
+        cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following
+            keys (some are optional, as marked below):
+            - `worker` (optional): The worker ID for the MR worker; this is also used to seed the
+              randomization process. If not provided, the randomization process will be unseeded.
+
+    Returns:
+        The shuffled list of shards.
+
+    Examples:
+        >>> cfg = DictConfig({"worker": 1})
+        >>> shards = ["train/0", "train/1", "tuning", "held_out"]
+        >>> shuffle_shards(shards, cfg)
+        ['train/1', 'held_out', 'tuning', 'train/0']
+        >>> cfg = DictConfig({"worker": 2})
+        >>> shuffle_shards(shards, cfg)
+        ['tuning', 'held_out', 'train/1', 'train/0']
+    """
+
+    if "worker" in cfg:
+        add_str = str(cfg["worker"])
+    else:
+        add_str = str(datetime.now())
+
+    shard_keys = []
+    for shard in shards:
+        shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest()
+        if shard_hash in shard_keys:
+            raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!")
+        shard_keys.append(int(shard_hash, 16))
+
+    return [shard for _, shard in sorted(zip(shard_keys, shards))]


shuffle_shards function is well-implemented.

The function shuffles shards deterministically based on the worker ID, and uses SHA-256 to avoid hash collisions. Consider using a ternary operator for add_str assignment for clarity.

- if "worker" in cfg: - add_str = str(cfg["worker"]) - else: - add_str = str(datetime.now()) + add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]:

"""Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration.

Args:

shards: The list of shards to shuffle.

cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following

keys (some are optional, as marked below):

- `worker` (optional): The worker ID for the MR worker; this is also used to seed the

randomization process. If not provided, the randomization process will be unseeded.

Returns:

The shuffled list of shards.

Examples:

>>> cfg = DictConfig({"worker": 1})

>>> shards = ["train/0", "train/1", "tuning", "held_out"]

>>> shuffle_shards(shards, cfg)

['train/1', 'held_out', 'tuning', 'train/0']

>>> cfg = DictConfig({"worker": 2})

>>> shuffle_shards(shards, cfg)

['tuning', 'held_out', 'train/1', 'train/0']

"""

if "worker" in cfg:

add_str = str(cfg["worker"])

else:

add_str = str(datetime.now())

shard_keys = []

for shard in shards:

shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest()

if shard_hash in shard_keys:

raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!")

shard_keys.append(int(shard_hash, 16))

return [shard for _, shard in sorted(zip(shard_keys, shards))]

def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]:

"""Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration.

Args:

shards: The list of shards to shuffle.

cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following

keys (some are optional, as marked below):

- `worker` (optional): The worker ID for the MR worker; this is also used to seed the

randomization process. If not provided, the randomization process will be unseeded.

Returns:

The shuffled list of shards.

Examples:

>>> cfg = DictConfig({"worker": 1})

>>> shards = ["train/0", "train/1", "tuning", "held_out"]

>>> shuffle_shards(shards, cfg)

['train/1', 'held_out', 'tuning', 'train/0']

>>> cfg = DictConfig({"worker": 2})

>>> shuffle_shards(shards, cfg)

['tuning', 'held_out', 'train/1', 'train/0']

"""

add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())

shard_keys = []

for shard in shards:

shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest()

if shard_hash in shard_keys:

raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!")

shard_keys.append(int(shard_hash, 16))

return [shard for _, shard in sorted(zip(shard_keys, shards))]

Tools

Ruff

247-250: Use ternary operator add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now()) instead of if-else-block

Replace if-else-block with add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())

(SIM108)

coderabbitai · 2024-08-10T22:47:28Z

src/MEDS_transforms/mapreduce/mapper.py

+def read_and_filter_fntr(filter_expr: pl.Expr, read_fn: Callable[[Path], DF_T]) -> Callable[[Path], DF_T]:
+    """Create a function that reads a DataFrame from a file and filters it based on a given expression.
+
+    This is specified as a functor in this way to allow it to modify arbitrary other read functions for use in
+    different mapreduce pipelines.
+
+    Args:
+        filter_expr: The filter expression to apply to the DataFrame.
+        read_fn: The read function to use to read the DataFrame.
+
+    Returns:
+        A function that reads a DataFrame from a file and filters it based on the given expression.
+
+    Examples:
+        >>> dfs = {
+        ...     "df1": pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}),
+        ...     "df2": pl.DataFrame({"a": [4, 5, 6], "b": [7, 8, 9]})
+        ... }
+        >>> read_fn = lambda key: dfs[key]
+        >>> fn = read_and_filter_fntr((pl.col("a") % 2) == 0, read_fn)
+        >>> fn("df1")
+        shape: (1, 2)
+        ┌─────┬─────┐
+        │ a   ┆ b   │
+        │ --- ┆ --- │
+        │ i64 ┆ i64 │
+        ╞═════╪═════╡
+        │ 2   ┆ 5   │
+        └─────┴─────┘
+        >>> fn("df2")
+        shape: (2, 2)
+        ┌─────┬─────┐
+        │ a   ┆ b   │
+        │ --- ┆ --- │
+        │ i64 ┆ i64 │
+        ╞═════╪═════╡
+        │ 4   ┆ 7   │
+        │ 6   ┆ 9   │
+        └─────┴─────┘
+        >>> fn = read_and_filter_fntr((pl.col("b") % 2) == 0, read_fn)
+        >>> fn("df1")
+        shape: (2, 2)
+        ┌─────┬─────┐
+        │ a   ┆ b   │
+        │ --- ┆ --- │
+        │ i64 ┆ i64 │
+        ╞═════╪═════╡
+        │ 1   ┆ 4   │
+        │ 3   ┆ 6   │
+        └─────┴─────┘
+        >>> fn("df2")
+        shape: (1, 2)
+        ┌─────┬─────┐
+        │ a   ┆ b   │
+        │ --- ┆ --- │
+        │ i64 ┆ i64 │
+        ╞═════╪═════╡
+        │ 5   ┆ 8   │
+        └─────┴─────┘
+    """
+
    def read_and_filter(in_fp: Path) -> DF_T:
-        df = read_fn(in_fp)
-        return df.filter(pl.col("patient_id").isin(patients))
+        return read_fn(in_fp).filter(filter_expr)


Update default argument handling for read_fn.

The read_fn argument in read_and_filter_fntr is a partial function. It's better to avoid using function calls in default arguments due to Python's handling of mutable defaults. Consider defining the default outside the function.

# Define the default read function outside default_read_fn = partial(pl.scan_parquet, glob=False) def read_and_filter_fntr(filter_expr: pl.Expr, read_fn: Callable[[Path], DF_T] = default_read_fn) -> Callable[[Path], DF_T]:

coderabbitai · 2024-08-10T22:47:28Z

src/MEDS_transforms/mapreduce/mapper.py

+def map_over(
+    cfg: DictConfig,
+    compute_fn: COMPUTE_FN_T | None = None,
+    read_fn: Callable[[Path], DF_T] = partial(pl.scan_parquet, glob=False),
+    write_fn: Callable[[DF_T, Path], None] = write_lazyframe,
+    shard_iterator_fntr: SHARD_ITR_FNTR_T = shard_iterator,
+) -> list[Path]:
+    stage_init(cfg)
+
+    start = datetime.now()
+
+    train_only = cfg.stage_cfg.get("train_only", False)
+    split_fp = Path(cfg.input_dir) / "metadata" / "patient_split.parquet"
+
+    shards, includes_only_train = shard_iterator_fntr(cfg)
+
+    if train_only:
+        if includes_only_train:
+            logger.info(
+                f"Processing train split only via shard prefix. Not filtering with {str(split_fp.resolve())}."
+            )
+        elif split_fp.exists():
+            logger.info(f"Processing train split only by filtering read dfs via {str(split_fp.resolve())}")
+            train_patients = (
+                pl.scan_parquet(split_fp)
+                .filter(pl.col("split") == "train")
+                .select(pl.col("patient_id"))
+                .collect()
+                .to_list()
+            )
+            read_fn = read_and_filter_fntr(train_patients, read_fn)
+        else:
+            raise FileNotFoundError(
+                f"Train split requested, but shard prefixes can't be used and "
+                f"patient split file not found at {str(split_fp.resolve())}."
+            )
+    elif includes_only_train:
+        raise ValueError("All splits should be used, but shard iterator is returning only train splits?!?")
+
+    if is_match_revise(cfg.stage_cfg):
+        compute_fn = match_revise_fntr(cfg, cfg.stage_cfg, compute_fn)
+    else:
+        compute_fn = bind_compute_fn(cfg, cfg.stage_cfg, compute_fn)
+
    all_out_fps = []
-    for in_fp, out_fp in shard_iterator_fntr(cfg):
+    for in_fp, out_fp in shards:


Refactor map_over to avoid function calls in default arguments.

The read_fn argument uses partial. Define this outside the function to avoid potential issues with mutable defaults.

# Define the default read function outside default_read_fn = partial(pl.scan_parquet, glob=False) def map_over( cfg: DictConfig, compute_fn: COMPUTE_FN_T | None = None, read_fn: Callable[[Path], DF_T] = default_read_fn, write_fn: Callable[[DF_T, Path], None] = write_lazyframe, shard_iterator_fntr: SHARD_ITR_FNTR_T = shard_iterator, ) -> list[Path]:

Tools

Ruff

574-574: Do not perform function call partial in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

coderabbitai · 2024-08-10T22:47:28Z

README.md

+        - The time since the last event of a certain type.
+        - The patient's age as of each unique timepoint.
+        - The time-of-day of each event.
+        - Adding a "dummy" event to the dataset for each patient that occurs at the end of the observation


Use "who" instead of "that" for referring to people.

Use "who" when referring to people for better grammatical correctness.

- Adding a "dummy" event to the dataset for each patient that occurs at the end of the observation period. + Adding a "dummy" event to the dataset for each patient who occurs at the end of the observation period.

Committable suggestion was skipped due to low confidence.

Tools

LanguageTool

[style] ~315-~315: Consider using “who” when you are referring to a person instead of an object.
Context: ..." event to the dataset for each patient that occurs at the end of the observation ...

(THAT_WHO)

coderabbitai · 2024-08-10T22:47:28Z

README.md

 2. Figure out how to ensure that each pre-processing step reads from the right prior files. Likely need some
-   kind of a "prior stage name" config variable.
+    kind of a "prior stage name" config variable.


Correct usage of "kind of".

The phrase "kind of a" should be corrected to "kind of" for proper grammar.

- Likely need some kind of a "prior stage name" config variable. + Likely need some kind of "prior stage name" config variable.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

kind of a "prior stage name" config variable.

Likely need some kind of "prior stage name" config variable.

Tools

LanguageTool

[grammar] ~446-~446: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.)
Context: ...right prior files. Likely need some kind of a "prior stage name" config variable. ##...

(KIND_OF_A)

coderabbitai · 2024-08-10T22:47:28Z

README.md

-   stage are expressible in the `code` and `numeric_values` columns of the dataset, which helps
-   standardize further downstream use.
+    should be included during code grouping and modeling operations. The goal of this step is to ensure that
+    the only columns that need be processed going into the pre-processing, tokenization, and tensorization


Add missing preposition for clarity.

A preposition is missing, which can affect the sentence's clarity.

- ensure that the only columns that need be processed going into the pre-processing, tokenization, and tensorization stage are expressible in the `code` and `numeric_values` columns of the dataset + ensure that the only columns that need to be processed going into the pre-processing, tokenization, and tensorization stage are expressible in the `code` and `numeric_values` columns of the dataset

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

the only columns that need be processed going into the pre-processing, tokenization, and tensorization

ensure that the only columns that need to be processed going into the pre-processing, tokenization, and tensorization stage are expressible in the `code` and `numeric_values` columns of the dataset

Tools

LanguageTool

[uncategorized] ~328-~328: Possible missing preposition found.
Context: ...ure that the only columns that need be processed going into the pre-processing...

(AI_HYDRA_LEO_MISSING_TO)

coderabbitai · 2024-08-10T22:47:28Z

README.md

 2. This pipeline preferentially uses disk over memory and compute through aggressive caching. You should
-   have sufficient disk space to store multiple copies of your raw dataset comfortably.
+    have sufficient disk space to store multiple copies of your raw dataset comfortably.
 3. This pipeline can be run on a single machine or across many worker nodes on a cluster provided the worker


Add missing comma for clarity.

A comma is needed to separate clauses for better readability.

- This pipeline can be run on a single machine or across many worker nodes on a cluster provided the worker nodes have access to a distributed file system. + This pipeline can be run on a single machine or across many worker nodes on a cluster, provided the worker nodes have access to a distributed file system.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

3. This pipeline can be run on a single machine or across many worker nodes on a cluster provided the worker

3. This pipeline can be run on a single machine or across many worker nodes on a cluster, provided the worker

Tools

LanguageTool

[uncategorized] ~207-~207: Possible missing comma found.
Context: ...achine or across many worker nodes on a cluster provided the worker nodes have acce...

(AI_HYDRA_LEO_MISSING_COMMA)

coderabbitai · 2024-08-10T22:47:28Z

README.md

+    stage's script in sequence.
 - The user can leverage arbitrary scheduling systems (e.g., Slurm, LSF, Kubernetes, etc.) to run each stage
-  in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's
-  script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the
-  `submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically
-  required a distributed file system to work correctly, as these scripts use manually created file locks to
-  avoid duplicative work.
+    in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's
+    script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the
+    `submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically
+    required a distributed file system to work correctly, as these scripts use manually created file locks to
+    avoid duplicative work.
 - The user can run each stage in parallel on a single machine by launching multiple copies of the same
-  script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a
-  significant speedup depending on the machine configuration as sharding ensures that parallelism can be used
-  with minimal file read contention.
+    script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a
+    significant speedup depending on the machine configuration as sharding ensures that parallelism can be used
+    with minimal file read contention.


Add missing commas for clarity.

There are missing commas in the sentences, which can affect readability.

- The user can run each stage in parallel on a single machine by launching multiple copies of the same script in different terminal sessions or all at once via the Hydra `joblib` launcher. + The user can run each stage in parallel on a single machine by launching multiple copies of the same script, in different terminal sessions, or all at once via the Hydra `joblib` launcher.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

stage's script in sequence.

- The user can leverage arbitrary scheduling systems (e.g., Slurm, LSF, Kubernetes, etc.) to run each stage

in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's

script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the

`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically

required a distributed file system to work correctly, as these scripts use manually created file locks to

avoid duplicative work.

in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's

script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the

`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically

required a distributed file system to work correctly, as these scripts use manually created file locks to

avoid duplicative work.

- The user can run each stage in parallel on a single machine by launching multiple copies of the same

script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a

significant speedup depending on the machine configuration as sharding ensures that parallelism can be used

with minimal file read contention.

script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a

significant speedup depending on the machine configuration as sharding ensures that parallelism can be used

with minimal file read contention.

stage's script in sequence.

- The user can leverage arbitrary scheduling systems (e.g., Slurm, LSF, Kubernetes, etc.) to run each stage

in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's

script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the

`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically

required a distributed file system to work correctly, as these scripts use manually created file locks to

avoid duplicative work.

- The user can run each stage in parallel on a single machine by launching multiple copies of the same

script, in different terminal sessions, or all at once via the Hydra `joblib` launcher. This can result in a

significant speedup depending on the machine configuration as sharding ensures that parallelism can be used

with minimal file read contention.

Tools

LanguageTool

[grammar] ~103-~103: The modal verb ‘will’ requires the verb’s base form.
Context: ...rker jobs. Note this will typically required a distributed file system to work corre...

(MD_BASEFORM)

[uncategorized] ~106-~106: Possible missing comma found.
Context: ...e same script in different terminal sessions or all at once via the Hydra joblib l...

(AI_HYDRA_LEO_MISSING_COMMA)

coderabbitai

Actionable comments posted: 3

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5e47f0c and e0d1ecb.

Files selected for processing (1)

src/MEDS_transforms/reshard_to_split.py (1 hunks)

Additional context used

Ruff

src/MEDS_transforms/reshard_to_split.py

118-118: Function definition does not bind loop variable in_fps

(B023)

Additional comments not posted (1)

src/MEDS_transforms/reshard_to_split.py (1)

4-18: Imports look good.

The imports are necessary for the functionality provided in the file.

coderabbitai · 2024-08-10T23:33:55Z

src/MEDS_transforms/reshard_to_split.py

+    if shards_fp.is_file():
+        if cfg.do_overwrite:
+            logger.warning(f"Overwriting {str(shards_fp.resolve())}")
+            shards_fp.unlink()
+        else:
+            raise FileExistsError(f"{str(shards_fp.resolve())} already exists.")
+
+    shards_fp.write_text(json.dumps(new_sharded_splits))
+
+    output_dir = Path(cfg.stage_cfg.output_dir)
+
+    orig_shards_iter, include_only_train = shard_iterator(cfg, out_suffix="")
+    if include_only_train:
+        raise ValueError("This stage does not support include_only_train=True")
+
+    orig_shards_iter = [(in_fp, out_fp.relative_to(output_dir)) for in_fp, out_fp in orig_shards_iter]
+
+    new_shards = shuffle_shards(list(new_sharded_splits.keys()), cfg)
+    new_shards_iter = [(shard_name, output_dir / shard_name) for shard_name in new_shards]
+
+    # Step 1: Sub-sharding stage
+    logger.info("Starting sub-sharding")
+    subshard_fps = defaultdict(list)
+
+    for in_fp, orig_shard_name in orig_shards_iter:
+        for subshard_name, out_dir in new_shards_iter:
+            out_fp = out_dir / f"{orig_shard_name}.parquet"
+            subshard_fps[subshard_name].append(out_fp)
+            patients = new_sharded_splits[subshard_name]
+
+            if not patients:
+                raise ValueError(f"No patients found for {subshard_name}!")
+
+            logger.info(
+                f"Sub-sharding {str(in_fp.resolve())} to {len(patients)} patients in {str(out_fp.resolve())}"
+            )
+
+            rwlock_wrap(
+                in_fp,
+                out_fp,
+                read_and_filter_fntr(pl.col("patient_id").is_in(patients), pl.scan_parquet),
+                write_lazyframe,
+                identity_fn,
+                do_return=False,
+                do_overwrite=cfg.do_overwrite,
+            )
+
+    logger.info("Merging sub-shards")
+    for subshard_name, subshard_dir in new_shards_iter:
+        in_dir = subshard_dir
+        in_fps = subshard_fps[subshard_name]
+        out_fp = subshard_dir.with_suffix(".parquet")
+
+        logger.info(f"Merging {subshard_dir}/**/*.parquet into {str(out_fp.resolve())}")
+
+        if not subshard_fps:
+            raise ValueError(f"No subshards found for {subshard_name}!")
+
+        if out_fp.is_file():
+            logger.info(f"Output file {str(out_fp.resolve())} already exists. Skipping.")
+            continue
+
+        while not (all(fp.is_file() for fp in in_fps) or out_fp.is_file()):
+            logger.info("Waiting to begin merging for all sub-shard files to be written...")
+            time.sleep(cfg.polling_time)
+
+        def read_fn(fp: Path) -> pl.LazyFrame:
+            return pl.concat([pl.scan_parquet(fp, glob=False) for fp in in_fps], how="diagonal_relaxed").sort(
+                by=["patient_id", "time"], maintain_order=True, multithreaded=False
+            )
+
+        logger.info(f"Merging files to {str(out_fp.resolve())}")
+        result_computed, _ = rwlock_wrap(
+            in_dir,
+            out_fp,
+            read_fn,
+            write_lazyframe,
+            identity_fn,
+            do_return=False,
+            do_overwrite=cfg.do_overwrite,
+        )
+
+        if result_computed:
+            logger.info(f"Cleaning up subsharded files in {str(subshard_dir.resolve())}/*.")
+            for fp in in_fps:
+                if fp.exists():
+                    fp.unlink()
+            try:
+                for root, dirs, files in subshard_dir.walk(top_down=False):
+                    walked_dir = root.relative_to(subshard_dir)
+                    if files:
+                        raise FileExistsError(f"Files found in {walked_dir} after cleanup!: {files}")
+                    for d in dirs:
+                        (root / d).rmdir()
+                subshard_dir.rmdir()
+            except OSError as e:
+                contents_str = "\n".join([str(f) for f in subshard_dir.iterdir()])
+                raise ValueError(f"Could not remove {str(subshard_dir)}. Contents:\n{contents_str}") from e
+
+    logger.info(f"Done with {cfg.stage}")


Consider modularizing the main function.

The main function is lengthy and could benefit from being broken down into smaller, more focused functions. This would improve readability and maintainability.

Tools

Ruff

118-118: Function definition does not bind loop variable in_fps

(B023)

coderabbitai · 2024-08-10T23:33:55Z

src/MEDS_transforms/reshard_to_split.py

+            rwlock_wrap(
+                in_fp,
+                out_fp,
+                read_and_filter_fntr(pl.col("patient_id").is_in(patients), pl.scan_parquet),
+                write_lazyframe,
+                identity_fn,
+                do_return=False,
+                do_overwrite=cfg.do_overwrite,
+            )


Add error handling to rwlock_wrap usage.

Consider adding error handling for potential issues during file operations to ensure robustness.

Also applies to: 123-131

coderabbitai · 2024-08-10T23:33:55Z

src/MEDS_transforms/reshard_to_split.py

+        def read_fn(fp: Path) -> pl.LazyFrame:
+            return pl.concat([pl.scan_parquet(fp, glob=False) for fp in in_fps], how="diagonal_relaxed").sort(
+                by=["patient_id", "time"], maintain_order=True, multithreaded=False
+            )


Resolve the loop variable binding issue.

The static analysis tool indicates a potential issue with the function definition not binding the loop variable in_fps. Ensure that fp is correctly used within read_fn.

- return pl.concat([pl.scan_parquet(fp, glob=False) for fp in in_fps], how="diagonal_relaxed").sort( + return pl.concat([pl.scan_parquet(fp, glob=False) for fp in [fp]], how="diagonal_relaxed").sort(

Committable suggestion was skipped due to low confidence.

Tools

Ruff

118-118: Function definition does not bind loop variable in_fps

(B023)

…s on disk.

coderabbitai

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e0d1ecb and 2d1c4cf.

Files selected for processing (1)

src/MEDS_transforms/reshard_to_split.py (1 hunks)

Additional context used

Ruff

src/MEDS_transforms/reshard_to_split.py

120-120: Function definition does not bind loop variable in_fps

(B023)

Additional comments not posted (4)

src/MEDS_transforms/reshard_to_split.py (4)
1-18: Header and imports look good.

The shebang, docstring, and imports are clear and appropriate for the file's functionality.

90-98: Add error handling to rwlock_wrap usage.

Consider adding error handling for potential issues during file operations to ensure robustness.

119-122: Resolve the loop variable binding issue.

The static analysis tool indicates a potential issue with the function definition not binding the loop variable in_fps. Ensure that fp is correctly used within read_fn.

Tools

Ruff

120-120: Function definition does not bind loop variable in_fps

(B023)

135-150: Ensure robust cleanup logic.

The cleanup logic attempts to remove files and directories. Ensure that appropriate error handling is in place for potential issues like permission errors.

Verification successful

Ensure comprehensive error handling in cleanup logic.

The code includes a try block for handling potential issues during directory cleanup. Ensure that the corresponding except block is robust enough to handle exceptions such as permission errors or other filesystem-related issues.

Review the except block following the try statement to ensure comprehensive error handling.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify cleanup logic robustness by checking for error handling in file system operations.

# Test: Search for error handling in cleanup logic. Expect: Appropriate error handling mechanisms.
rg --type python -A 5 $'try:\n    for root, dirs, files in subshard_dir.walk(top_down=False):'
Length of output: 286

Script:
#!/bin/bash
# Description: Re-verify cleanup logic robustness by checking for error handling in file system operations with multiline support.

# Test: Search for error handling in cleanup logic. Expect: Appropriate error handling mechanisms.
rg --type python --multiline -A 5 $'try:\n    for root, dirs, files in subshard_dir.walk(top_down=False):'
Length of output: 137

Script:
#!/bin/bash
# Description: Re-verify cleanup logic robustness by checking for error handling in file system operations.

# Test: Search for error handling in cleanup logic. Expect: Appropriate error handling mechanisms.
rg --multiline -t py -A 5 $'try:\n    for root, dirs, files in subshard_dir.walk(top_down=False):'
Length of output: 98

Script:
#!/bin/bash
# Description: Broaden the search to identify any try-except blocks related to file system operations.

# Test: Search for try-except blocks in the specified file to ensure error handling is present.
rg --multiline -t py -A 5 'try:' src/MEDS_transforms/reshard_to_split.py
Length of output: 396

coderabbitai · 2024-08-11T02:42:42Z

src/MEDS_transforms/reshard_to_split.py

+def main(cfg: DictConfig):
+    """Re-shard a MEDS cohort to in a manner that subdivides patient splits."""
+
+    stage_init(cfg)
+
+    splits_file = Path(cfg.input_dir) / "metadata" / "patient_splits.parquet"
+    splits_df = pl.read_parquet(splits_file, use_pyarrow=True)
+    splits_map = defaultdict(list)
+    for pt_id, sp in splits_df.iter_rows():
+        splits_map[sp].append(pt_id)
+
+    new_sharded_splits = shard_patients(
+        patients=splits_df["patient_id"].to_numpy(),
+        n_patients_per_shard=cfg.stage_cfg.n_patients_per_shard,
+        external_splits=splits_map,
+        split_fracs_dict=None,
+        seed=cfg.get("seed", 1),
+    )
+
+    output_dir = Path(cfg.stage_cfg.output_dir)
+
+    # Write shards to file
+    if "shards_map_fp" in cfg:
+        shards_fp = Path(cfg.shards_map_fp)
+    else:
+        shards_fp = output_dir / ".shards.json"
+
+    if shards_fp.is_file():
+        if cfg.do_overwrite:
+            logger.warning(f"Overwriting {str(shards_fp.resolve())}")
+            shards_fp.unlink()
+        else:
+            old_shards_map = json.loads(shards_fp.read_text())
+            if old_shards_map != new_sharded_splits:
+                raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")
+
+    shards_fp.write_text(json.dumps(new_sharded_splits))
+
+    output_dir = Path(cfg.stage_cfg.output_dir)
+
+    orig_shards_iter, include_only_train = shard_iterator(cfg, out_suffix="")
+    if include_only_train:
+        raise ValueError("This stage does not support include_only_train=True")
+
+    orig_shards_iter = [(in_fp, out_fp.relative_to(output_dir)) for in_fp, out_fp in orig_shards_iter]
+
+    new_shards = shuffle_shards(list(new_sharded_splits.keys()), cfg)
+    new_shards_iter = [(shard_name, output_dir / shard_name) for shard_name in new_shards]
+
+    # Step 1: Sub-sharding stage
+    logger.info("Starting sub-sharding")
+    subshard_fps = defaultdict(list)
+
+    for in_fp, orig_shard_name in orig_shards_iter:
+        for subshard_name, out_dir in new_shards_iter:
+            out_fp = out_dir / f"{orig_shard_name}.parquet"
+            subshard_fps[subshard_name].append(out_fp)
+            patients = new_sharded_splits[subshard_name]
+
+            if not patients:
+                raise ValueError(f"No patients found for {subshard_name}!")
+
+            logger.info(
+                f"Sub-sharding {str(in_fp.resolve())} to {len(patients)} patients in {str(out_fp.resolve())}"
+            )
+
+            rwlock_wrap(
+                in_fp,
+                out_fp,
+                read_and_filter_fntr(pl.col("patient_id").is_in(patients), pl.scan_parquet),
+                write_lazyframe,
+                identity_fn,
+                do_return=False,
+                do_overwrite=cfg.do_overwrite,
+            )


Consider modularizing the main function.

The main function is lengthy and could benefit from being broken down into smaller, more focused functions. This would improve readability and maintainability.

coderabbitai · 2024-08-11T02:42:42Z

src/MEDS_transforms/reshard_to_split.py

+    if shards_fp.is_file():
+        if cfg.do_overwrite:
+            logger.warning(f"Overwriting {str(shards_fp.resolve())}")
+            shards_fp.unlink()
+        else:
+            old_shards_map = json.loads(shards_fp.read_text())
+            if old_shards_map != new_sharded_splits:
+                raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")


Add error handling for file existence and differences.

The current logic raises a FileExistsError if the shard map differs. Consider logging the differences for better debugging.

- raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.") + logger.error(f"Shard map differs: {old_shards_map} vs {new_sharded_splits}") + raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if shards_fp.is_file():

if cfg.do_overwrite:

logger.warning(f"Overwriting {str(shards_fp.resolve())}")

shards_fp.unlink()

else:

old_shards_map = json.loads(shards_fp.read_text())

if old_shards_map != new_sharded_splits:

raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")

if shards_fp.is_file():

if cfg.do_overwrite:

logger.warning(f"Overwriting {str(shards_fp.resolve())}")

shards_fp.unlink()

else:

old_shards_map = json.loads(shards_fp.read_text())

if old_shards_map != new_sharded_splits:

logger.error(f"Shard map differs: {old_shards_map} vs {new_sharded_splits}")

raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")

mmcdermott · 2024-08-11T04:11:56Z

Not yet ready -- see #139

…arallel_mode Fixes brittle reduce stage checking in Reshard stage.

mmcdermott and others added 30 commits July 29, 2024 17:33

Added a start at a quantile implementation in the existing MR framework.

f9d7a0e

Merge branch 'dev' into 51_add_quantile_aggregation

7685436

Got the reducer to return reasonable numbers at least.

9c3b497

Merged.

2d87cae

Added doctests and parametrization.

4b3cfec

VERY preliminary code. Just committing so as to not lose changes. Not…

e721506

…hing works yet.

Merge branch 'dev' into 101_match_revise

10b741a

Merge branch 'dev' into alt_51_raw_values

76ce942

Merged.

209abc1

Merge branch 'dev' into alt_51_raw_values

d898cb4

Renamed to 'code_modifiers'

bb88c1c

Merge branch 'dev' into 101_match_revise

5a39b7c

Merge branch 'dev' into alt_51_raw_values

949867a

Very preliminary implementation of match and revise

8e1a78d

Started adding documentation and tests; not yet complete

e260088

Tests added for everything except bind compute function.

8750b93

Added all doctests.

8e418ab

Added an integration test via the filter_measurement transform; also …

f19ecc7

…an example of loading MEDS_transforms configs as defaults dynamically.

Attempted to update test workflow, add codecov badge, and add a workf…

39aebbc

…low for pushing to pypi. The pypi push workflow requires local tagging via 'git tag 0.0.3' then 'git push origin 0.0.3'

Merge pull request #123 from mmcdermott/update_CI_and_README

33ebdd9

Updated workflows and README.

Merge branch 'dev' into alt_51_raw_values

cff193b

Merge branch 'dev' into 101_match_revise

3b7aae0

Fix split_and_shard_patients when the full split definition is provided

c9c581f

External splits go after internal splits

0b924b3

Merge branch 'refs/heads/dev' into fix-external-split

f352654

Added mkdocs documentation starter code.

3c69775

Added readthedocs file.

ebabad7

Added docs badge.

3840385

Updated tests workflow to ignore docs.

8075963

Updated docs to move basic docs into readthedocs.

1dedfad

mmcdermott added 7 commits August 10, 2024 17:14

Updated test.

6ec7a8f

started file; very much not ready yet.

dd9ba08

Improved some documentation and error handling for splitting patients.

4cc65b4

Merge branch 'dev' into 134_reshard_by_split

ba9c3b4

Implemented preliminary version of reshard to split.

d70128f

fixed some more minor errors; test fails for content reasons now.

0db3a23

code was actually fine; it was a test error.

829cede

Merge pull request #135 from mmcdermott/134_reshard_by_split

5e47f0c

Adds a reshard-by-split stage as a transform.

mmcdermott marked this pull request as ready for review August 10, 2024 22:41

coderabbitai bot reviewed Aug 10, 2024

View reviewed changes

coderabbitai bot had a problem deploying to pypi August 10, 2024 22:47 Failure

Corrected a small parallelism issue in reshard.

e0d1ecb

coderabbitai bot reviewed Aug 10, 2024

View reviewed changes

mmcdermott added 2 commits August 10, 2024 22:39

Made reshard not error out if the new split is identical to the shard…

ac1587e

…s on disk.

Fixed formatting.

2d1c4cf

coderabbitai bot reviewed Aug 11, 2024

View reviewed changes

mmcdermott closed this Aug 11, 2024

mmcdermott added 8 commits August 11, 2024 07:58

Move more to locked computation.

7b02df8

changes in progress to improve robustness

b1378bc

changes in progress to improve robustness

3a95794

Made file checker wait for valid parquet files

d9ed79a

Removed unneeded line.

d1f5953

Made rwlock more robust and eliminated unused return mode.

e0c0f46

Made resharding use rwlock wrap

3fc909e

Merge pull request #143 from mmcdermott/139_fix_reshard_to_split_in_p…

cda4f06

…arallel_mode Fixes brittle reduce stage checking in Reshard stage.

mmcdermott reopened this Aug 11, 2024

mmcdermott closed this Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release candidate for 0.0.4 #138

Release candidate for 0.0.4 #138

mmcdermott commented Aug 10, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 10, 2024 •

edited

Loading

Review failed

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

codecov-commenter commented Aug 10, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot left a comment

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot Aug 10, 2024

coderabbitai bot left a comment

coderabbitai bot Aug 11, 2024

coderabbitai bot Aug 11, 2024

mmcdermott commented Aug 11, 2024

	literal or a input column reference) is interpreted either as the specified string literal or as the
	literal or an input column reference) is interpreted either as the specified string literal or as the

	kind of a "prior stage name" config variable.
	Likely need some kind of "prior stage name" config variable.

	the only columns that need be processed going into the pre-processing, tokenization, and tensorization
	ensure that the only columns that need to be processed going into the pre-processing, tokenization, and tensorization stage are expressible in the `code` and `numeric_values` columns of the dataset

	3. This pipeline can be run on a single machine or across many worker nodes on a cluster provided the worker
	3. This pipeline can be run on a single machine or across many worker nodes on a cluster, provided the worker

Release candidate for 0.0.4 #138

Release candidate for 0.0.4 #138

Conversation

mmcdermott commented Aug 10, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Aug 10, 2024 • edited Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

codecov-commenter commented Aug 10, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 10, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Aug 11, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 11, 2024

Choose a reason for hiding this comment

mmcdermott commented Aug 11, 2024

mmcdermott commented Aug 10, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 10, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov-commenter commented Aug 10, 2024 •

edited

Loading