Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release candidate for 0.0.4 #138

Closed
wants to merge 83 commits into from
Closed

Release candidate for 0.0.4 #138

wants to merge 83 commits into from

Conversation

mmcdermott
Copy link
Owner

@mmcdermott mmcdermott commented Aug 10, 2024

Summary by CodeRabbit

  • New Features

    • Introduced .editorconfig for consistent coding style across the project.
    • Added automated workflows for building, publishing, and testing Python packages.
    • Implemented a script to generate code reference pages for documentation.
    • Added a utility for re-sharding a MEDS cohort into subsharded splits.
  • Bug Fixes

    • Enhanced test functions to validate new data handling logic and configurations.
  • Documentation

    • Improved README files and added new documentation for data processing pipeline and configuration.
    • Introduced new files for better documentation structure, including mkdocs.yml.
    • Launched a new pipeline_configuration.md for detailing the data processing framework.
  • Refactor

    • Streamlined the handling of mandatory data types and columns in the codebase.
    • Improved modularity and clarity in various processing functions and configurations.
    • Enhanced filtering and processing logic in data transformation workflows.
  • Style

    • Updated formatting and consistency across multiple documentation and code files.
  • Tests

    • Added new test cases for transforming and reshaping patient data, enhancing coverage and validation.
    • Updated testing framework to accommodate new data structures and configurations.

mmcdermott and others added 30 commits July 29, 2024 17:33
…an example of loading MEDS_transforms configs as defaults dynamically.
…low for pushing to pypi. The pypi push workflow requires local tagging via 'git tag 0.0.3' then 'git push origin 0.0.3'
Copy link
Contributor

coderabbitai bot commented Aug 10, 2024

Caution

Review failed

The pull request is closed.

Walkthrough

The recent changes significantly enhance coding standards, automation, and data processing capabilities within the project. Key updates include the introduction of an .editorconfig for consistent formatting, improved CI/CD workflows for building and testing Python packages, and new scripts for efficient data sharding. These modifications streamline operations, bolster documentation, and enhance testing frameworks, leading to improved maintainability and collaboration.

Changes

Files Change Summary
.editorconfig Established coding style guidelines for consistency, including character encoding, line endings, and indentation rules.
.github/workflows/ Added automated workflows for publishing and testing Python packages; improved test result reporting in the testing workflow.
.pre-commit-config.yaml Updated exclusion patterns and added new markdown formatting hooks to enhance pre-commit checks.
.readthedocs.yaml Configured automated documentation builds on Read the Docs with specified tools and requirements for MkDocs.
mkdocs.yml Created a configuration for the project documentation site, defining navigation and metadata with enhancements for markdown.
src/MEDS_transforms/__init__.py Introduced mandatory data structures for improved data integrity and handling.
src/MEDS_transforms/aggregate_code_metadata.py Added quantile calculation functionality to enhance data aggregation processes.
src/MEDS_transforms/configs/ Updated several YAML configuration files for data processing and sharding, including path adjustments and new processing stages.
src/MEDS_transforms/mapreduce/ Enhanced data filtering and processing logic with new utility functions and improved error handling.
src/MEDS_transforms/reshard_to_split.py Added a utility for re-sharding patient data into subsharded splits with configuration management and data integrity checks.
tests/ Introduced new tests for reshaping data shards and enhanced existing tests for clarity and structure.
docs/ Updated documentation to reflect new features and workflows, along with the addition of new reference files.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CI_CD as "CI/CD System"
    participant PackageRepo as "Package Repository"
    participant TestResults as "Testing Framework"
    
    User->>+CI_CD: Push code changes
    CI_CD->>+PackageRepo: Build and publish package
    CI_CD->>+TestResults: Run automated tests
    TestResults-->>-CI_CD: Return test results
    CI_CD-->>-User: Notify completion of workflow
Loading

🐰 In the garden where code does bloom,
A rabbit hops with joy, dispelling gloom.
New scripts and flows, oh what a delight,
With colors of success shining so bright!
For every change, a hop and a cheer,
In the world of coding, we have nothing to fear! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@codecov-commenter
Copy link

codecov-commenter commented Aug 10, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 90.60403% with 28 lines in your changes missing coverage. Please review.

Project coverage is 92.91%. Comparing base (d0e7930) to head (e0d1ecb).

✅ All tests successful. No failed tests found.

Files Patch % Lines
src/MEDS_transforms/reshard_to_split.py 80.45% 17 Missing ⚠️
src/MEDS_transforms/mapreduce/mapper.py 91.89% 6 Missing ⚠️
src/MEDS_transforms/mapreduce/utils.py 94.64% 3 Missing ⚠️
...EDS_transforms/extract/split_and_shard_patients.py 96.55% 1 Missing ⚠️
src/MEDS_transforms/transforms/tokenization.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #138      +/-   ##
==========================================
+ Coverage   92.48%   92.91%   +0.42%     
==========================================
  Files          24       25       +1     
  Lines        1664     1863     +199     
==========================================
+ Hits         1539     1731     +192     
- Misses        125      132       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Adds a reshard-by-split stage as a transform.
@mmcdermott mmcdermott marked this pull request as ready for review August 10, 2024 22:41
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Outside diff range, codebase verification and nitpick comments (19)
docs/javascripts/mathjax.js (1)

14-19: Ensure MathJax is initialized correctly on document events.

The subscription to the document event to clear and reset MathJax is a good practice to ensure that mathematical expressions are rendered correctly after dynamic content changes.

Consider adding comments to explain why these specific MathJax methods are called during the event subscription for better maintainability.

+// Clear and reset MathJax to ensure proper rendering of dynamic content
docs/gen_ref_pages.py (2)

1-2: Add module-level docstring.

The file starts with a docstring, but it could be expanded to explain the purpose and usage of the script in more detail.

- """Generate the code reference pages."""
+ """Generate the code reference pages for the project.

This script uses mkdocs_gen_files to create Markdown documentation files
for each Python module in the src directory. It handles special cases for
__init__.py and __main__.py files.
"""

34-35: Consider adding more context to identifiers.

The identifier ident is constructed from module path parts. Consider adding more context or comments to explain its purpose.

- ident = ".".join(parts)
+ # Construct the module identifier for mkdocs.
+ ident = ".".join(parts)
.github/workflows/python-build.yaml (2)

1-4: Clarify event trigger.

The workflow is triggered on every push. Consider specifying branches or tags to avoid unnecessary runs.

- on: push
+ on:
+   push:
+     branches:
+       - main
+     tags:
+       - 'v*.*.*'

16-21: Recommendation: Use a virtual environment for secure installation of build tools.

The current workflow does not utilize a virtual environment for installing the build tool, which could lead to conflicts with other packages. It is recommended to modify the workflow to create and activate a virtual environment before running the pip install command. This ensures the installation is isolated and secure.

  • Modify the workflow to include python3 -m venv venv && source venv/bin/activate before installing packages.
Analysis chain

Ensure secure installation of build tools.

The installation of build tools should be secure. Consider using a virtual environment or ensuring the installation command is safe.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the security of build tool installation.

# Test: Check if the build tool installation is performed in a secure manner.
# Expect: Installation should be isolated or verified to be secure.
python3 -m venv venv && source venv/bin/activate && pip install build --user

Length of output: 97


Script:

#!/bin/bash
# Search for virtual environment setup or secure installation practices in the workflow file.
rg 'python3 -m venv|pip install' .github/workflows/python-build.yaml -A 3

Length of output: 186

docs/pipeline_configuration.md (3)

3-7: Clarify the design philosophy.

The introduction is clear but could benefit from a more concise explanation of the design philosophy to enhance readability.

- MEDS Transform's key design philosophy is that you should be able to design the necessary data pre-processing
- and transformation pipeline you need simply by composing a series of simple, reusable, efficient, and
- configurable local transformations into a larger pipeline. This can allow researchers to balance having highly
- flexible, customizable pipelines with ensuring that all operations in the pipeline are simple, shareable,
- efficient, and easy to validate.
+ MEDS Transform's design philosophy emphasizes creating data pre-processing and transformation pipelines by composing simple, reusable, and efficient local transformations. This approach balances flexibility and customization with simplicity, shareability, efficiency, and ease of validation.

16-29: Address loose punctuation in stage descriptions.

Some stage descriptions have loose punctuation marks that can be tightened for better readability.

- 1. `stage_1`: A metadata map-reduce stage (e.g., counting the occurrence rates of the codes in the
-     data).
+ 1. `stage_1`: A metadata map-reduce stage (e.g., counting the occurrence rates of the codes in the data).
- 2. `stage_2`: A metadata-only processing stage (e.g., filtering the code dataframe to only codes
-     that occur more than 10 times).
+ 2. `stage_2`: A metadata-only processing stage (e.g., filtering the code dataframe to only codes that occur more than 10 times).
- 3. `stage_3`: A data processing stage (e.g., filtering the data to only rows with a code that is in the
-     current running metadata file, which, due to `stage_2`, are those codes that occur more than 10 times).
+ 3. `stage_3`: A data processing stage (e.g., filtering the data to only rows with a code in the current running metadata file, which, due to `stage_2`, are those codes that occur more than 10 times).
Tools
LanguageTool

[uncategorized] ~16-~16: Loose punctuation mark.
Context: ...s of the following stages: 1. stage_1: A metadata map-reduce stage (e.g., coun...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~18-~18: Loose punctuation mark.
Context: ...the codes in the data). 2. stage_2: A metadata-only processing stage (e.g.,...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~20-~20: Loose punctuation mark.
Context: ... occur more than 10 times). 3. stage_3: A data processing stage (e.g., filterin...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... occur more than 10 times). 4. stage_4: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~24-~24: Loose punctuation mark.
Context: ...al values in the data). 5. stage_5: A data processing stage (e.g., occludin...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~26-~26: Loose punctuation mark.
Context: ... deviations from the mean). 6. stage_6: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~28-~28: Loose punctuation mark.
Context: ...al values in the data). 7. stage_7: A data processing stage (e.g., normaliz...

(UNLIKELY_OPENING_PUNCTUATION)


33-81: Improve clarity and grammar in file management descriptions.

Some descriptions can be made clearer, and a missing comma in stage 4 needs to be addressed.

- 1. `stage_1`:
-     - As there is no preceding data stage, this stage will read the data in from `$INPUT_DIR/data`
-         (the `data` suffix is the default data directory for MEDS datasets).
+ 1. `stage_1`:
+     - This stage reads data from `$INPUT_DIR/data` (the `data` suffix is the default data directory for MEDS datasets) as there is no preceding data stage.
- 4. `stage_4`:
-     - This stage will read in the data from the `$COHORT_DIR/stage_3` directory as that is the prior data
-         processing stage.
+ 4. `stage_4`:
+     - This stage reads data from the `$COHORT_DIR/stage_3` directory, as that is the prior data processing stage.
Tools
LanguageTool

[uncategorized] ~33-~33: Loose punctuation mark.
Context: ...s in the following manner. 1. stage_1: - As there is no preceding data sta...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...1/codes.parquetdirectory. 2.stage_2`: - This stage will read in the metad...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~45-~45: Loose punctuation mark.
Context: ...2/codes.parquetdirectory. 3.stage_3`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~51-~51: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 4.stage_4`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~52-~52: Possible missing comma found.
Context: ...the data from the $COHORT_DIR/stage_3 directory as that is the prior data proce...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~60-~60: Loose punctuation mark.
Context: ...tage_4/codes.parquetfile. 5.stage_5`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~64-~64: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 6.stage_6`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~75-~75: Loose punctuation mark.
Context: ...n to the stage directories. 7. stage_7: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)

tests/utils.py (1)

107-107: Inconsistent usage of do_use_config_yaml in test cases

The do_use_config_yaml parameter is used in tests/transform_tester_base.py but is missing in tests/test_extract.py, where run_command is invoked multiple times. Please review the test cases to ensure consistent usage of this parameter where applicable.

  • tests/transform_tester_base.py: do_use_config_yaml is explicitly set.
  • tests/test_extract.py: do_use_config_yaml is not used.
Analysis chain

Ensure correct usage of do_use_config_yaml.

The new parameter do_use_config_yaml adds flexibility to the function. Ensure that its usage is consistent across all test cases.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of `do_use_config_yaml` in test cases.

# Test: Search for test cases using `run_command`. Expect: Proper usage of `do_use_config_yaml`.
rg --type python 'run_command' -A 3

Length of output: 66


Script:

#!/bin/bash
# Description: Verify consistent usage of `do_use_config_yaml` in test cases.

# Test: Search for test cases using `run_command`. Expect: Proper usage of `do_use_config_yaml`.
rg --type py 'run_command' -A 3

Length of output: 3436

src/MEDS_transforms/reshard_to_split.py (2)

45-55: Handle file existence more gracefully.

The code raises a FileExistsError if a file already exists and do_overwrite is not set. Consider logging this event before raising the error for better traceability.

+ logger.error(f"{str(shards_fp.resolve())} already exists and overwrite is not allowed.")
  raise FileExistsError(f"{str(shards_fp.resolve())} already exists.")

137-144: Clarify exception handling and cleanup logic.

The cleanup logic in the try block could benefit from more detailed logging. Consider adding logs for each step in the cleanup process to help diagnose potential issues.

+ logger.debug(f"Attempting to remove directory {str(subshard_dir.resolve())}.")
  subshard_dir.rmdir()
tests/test_reshard_to_split.py (1)

1-5: Clarify environment variable usage in comments.

The comment mentions setting an environment variable to use local scripts. Ensure this is clear and accurate for users running the tests.

- Set the bash env variable `DO_USE_LOCAL_SCRIPTS=1` to use the local py files, rather than the installed
- scripts.
+ Set the environment variable `DO_USE_LOCAL_SCRIPTS=1` to use local Python files instead of installed scripts.
eICU_Example/README.md (1)

13-13: Fix unordered list indentation.

Adjust the indentation of the unordered list items to meet markdownlint standards and improve readability.

-    - [ ] Identifying the pre-MEDS steps for eICU
+  - [ ] Identifying the pre-MEDS steps for eICU
-    - [ ] Test that it runs at all.
-    - [ ] Test that the output is as expected.
+  - [ ] Test that it runs at all.
+  - [ ] Test that the output is as expected.
-    - [ ] Sub-sharding
-    - [ ] Patient split gathering
-    - [ ] Event extraction
-    - [ ] Merging
+  - [ ] Sub-sharding
+  - [ ] Patient split gathering
+  - [ ] Event extraction
+  - [ ] Merging
-    - [ ] Basic validation
-    - [ ] Detailed validation
+  - [ ] Basic validation
+  - [ ] Detailed validation

Also applies to: 15-16, 21-24, 26-27

Tools
Markdownlint

13-13: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

src/MEDS_transforms/transforms/tokenization.py (1)

237-241: Ensure necessary directories are created before writing files.

The directories "schemas" and "event_seqs" do not exist, which could lead to runtime errors when writing files. Ensure that the code creates these directories if they do not already exist before writing to them.

  • Consider adding logic to create the directories before file operations.
Analysis chain

Simplify output file path management.

The direct handling of shards_single_output simplifies file path management, reducing complexity. Ensure that all necessary directories exist before writing to prevent runtime errors.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the necessary directories for output files exist.

# Test: Check for the existence of required directories. Expect: Directories should exist or be created.
for dir in "schemas" "event_seqs"; do
  if [ ! -d "$dir" ]; then
    echo "Directory $dir does not exist."
  fi
done

Length of output: 284

docs/preprocessing_operation_prototypes.md (1)

80-80: Consider simplifying phrasing.

The phrase "prior to" could be simplified to "before" for conciseness.

- patient data filters should be applied prior to aggregation.
+ patient data filters should be applied before aggregation.
Tools
LanguageTool

[style] ~80-~80: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ... patient data filters should be applied prior to aggregation. 2. What aggregation functi...

(EN_WORDINESS_PREMIUM_PRIOR_TO)

src/MEDS_transforms/extract/README.md (1)

255-255: Consider revising for professionalism.

The phrase "feel free to" is common and can be replaced with a more formal alternative.

- Feel free to reach out if you have any questions or concerns about this.
+ Please reach out if you have any questions or concerns about this.
Tools
LanguageTool

[style] ~255-~255: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... extraction tool for your raw data. Feel free to reach out if you have any questions or ...

(FEEL_FREE_TO_STYLE_ME)

src/MEDS_transforms/mapreduce/mapper.py (1)

Line range hint 447-569:
Avoid using partial in default arguments.

The compute_fn argument in bind_compute_fn uses partial. It's recommended to avoid function calls in default arguments. Define the partial outside the function.

# Define the default compute function outside
default_compute_fn = partial(pl.scan_parquet, glob=False)

def bind_compute_fn(cfg: DictConfig, stage_cfg: DictConfig, compute_fn: ANY_COMPUTE_FN_T = default_compute_fn) -> COMPUTE_FN_T:
README.md (2)

25-33: Improve consistency in installation instructions.

The installation instructions are clear, but ensure consistency in formatting and phrasing across bullet points for better readability.

- For a pypi installation, install with `pip install MEDS-transforms`.
+ For a PyPI installation, use `pip install MEDS-transforms`.

40-78: Enhance clarity in the design philosophy section.

The design philosophy section is well-structured. Consider adding examples or links to relevant sections for each philosophy point to enhance understanding.

- This representation is the Medical Event Data Standard (MEDS) format, and in this repository we use it in the "flat" format, where data is organized as rows of `patient_id`, `time`, `code`, `numeric_value` columns.
+ This representation is the Medical Event Data Standard (MEDS) format. In this repository, we use the "flat" format, where data is organized as rows of `patient_id`, `time`, `code`, and `numeric_value` columns.
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d0e7930 and 5e47f0c.

Files selected for processing (42)
  • .editorconfig (1 hunks)
  • .github/workflows/python-build.yaml (1 hunks)
  • .github/workflows/tests.yaml (1 hunks)
  • .pre-commit-config.yaml (3 hunks)
  • .readthedocs.yaml (1 hunks)
  • MIMIC-IV_Example/README.md (2 hunks)
  • README.md (8 hunks)
  • docs/gen_ref_pages.py (1 hunks)
  • docs/index.md (1 hunks)
  • docs/javascripts/mathjax.js (1 hunks)
  • docs/pipeline_configuration.md (1 hunks)
  • docs/preprocessing_operation_prototypes.md (7 hunks)
  • eICU_Example/README.md (10 hunks)
  • eICU_Example/configs/table_preprocessors.yaml (1 hunks)
  • mkdocs.yml (1 hunks)
  • pyproject.toml (2 hunks)
  • src/MEDS_transforms/init.py (2 hunks)
  • src/MEDS_transforms/aggregate_code_metadata.py (10 hunks)
  • src/MEDS_transforms/configs/extract.yaml (1 hunks)
  • src/MEDS_transforms/configs/preprocess.yaml (2 hunks)
  • src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml (1 hunks)
  • src/MEDS_transforms/configs/stage_configs/split_and_shard_patients.yaml (1 hunks)
  • src/MEDS_transforms/extract/README.md (6 hunks)
  • src/MEDS_transforms/extract/init.py (2 hunks)
  • src/MEDS_transforms/extract/convert_to_sharded_events.py (2 hunks)
  • src/MEDS_transforms/extract/extract_code_metadata.py (2 hunks)
  • src/MEDS_transforms/extract/finalize_MEDS_metadata.py (2 hunks)
  • src/MEDS_transforms/extract/merge_to_MEDS_cohort.py (2 hunks)
  • src/MEDS_transforms/extract/split_and_shard_patients.py (7 hunks)
  • src/MEDS_transforms/mapreduce/mapper.py (4 hunks)
  • src/MEDS_transforms/mapreduce/utils.py (3 hunks)
  • src/MEDS_transforms/reshard_to_split.py (1 hunks)
  • src/MEDS_transforms/transforms/tokenization.py (3 hunks)
  • src/MEDS_transforms/utils.py (8 hunks)
  • tests/test_add_time_derived_measurements.py (1 hunks)
  • tests/test_extract.py (1 hunks)
  • tests/test_filter_measurements.py (2 hunks)
  • tests/test_occlude_outliers.py (1 hunks)
  • tests/test_reorder_measurements.py (1 hunks)
  • tests/test_reshard_to_split.py (1 hunks)
  • tests/transform_tester_base.py (7 hunks)
  • tests/utils.py (2 hunks)
Files skipped from review due to trivial changes (11)
  • .editorconfig
  • .readthedocs.yaml
  • MIMIC-IV_Example/README.md
  • docs/index.md
  • eICU_Example/configs/table_preprocessors.yaml
  • src/MEDS_transforms/configs/extract.yaml
  • src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml
  • src/MEDS_transforms/configs/stage_configs/split_and_shard_patients.yaml
  • tests/test_add_time_derived_measurements.py
  • tests/test_occlude_outliers.py
  • tests/test_reorder_measurements.py
Additional context used
LanguageTool
docs/pipeline_configuration.md

[uncategorized] ~16-~16: Loose punctuation mark.
Context: ...s of the following stages: 1. stage_1: A metadata map-reduce stage (e.g., coun...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~18-~18: Loose punctuation mark.
Context: ...the codes in the data). 2. stage_2: A metadata-only processing stage (e.g.,...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~20-~20: Loose punctuation mark.
Context: ... occur more than 10 times). 3. stage_3: A data processing stage (e.g., filterin...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... occur more than 10 times). 4. stage_4: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~24-~24: Loose punctuation mark.
Context: ...al values in the data). 5. stage_5: A data processing stage (e.g., occludin...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~26-~26: Loose punctuation mark.
Context: ... deviations from the mean). 6. stage_6: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~28-~28: Loose punctuation mark.
Context: ...al values in the data). 7. stage_7: A data processing stage (e.g., normaliz...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~33-~33: Loose punctuation mark.
Context: ...s in the following manner. 1. stage_1: - As there is no preceding data sta...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...1/codes.parquetdirectory. 2.stage_2`: - This stage will read in the metad...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~45-~45: Loose punctuation mark.
Context: ...2/codes.parquetdirectory. 3.stage_3`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~51-~51: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 4.stage_4`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~52-~52: Possible missing comma found.
Context: ...the data from the $COHORT_DIR/stage_3 directory as that is the prior data proce...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~60-~60: Loose punctuation mark.
Context: ...tage_4/codes.parquetfile. 5.stage_5`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~64-~64: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 6.stage_6`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~75-~75: Loose punctuation mark.
Context: ...n to the stage directories. 7. stage_7: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[grammar] ~82-~82: The verb form ‘reduce’ does not seem to be suitable in this context.
Context: ...ssing stages (which do not have reduce stages).

(HAVE_VB)

docs/preprocessing_operation_prototypes.md

[style] ~80-~80: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ... patient data filters should be applied prior to aggregation. 2. What aggregation functi...

(EN_WORDINESS_PREMIUM_PRIOR_TO)

src/MEDS_transforms/extract/README.md

[misspelling] ~83-~83: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...t (again either a string literal or a input column reference) is interpreted ...

(EN_A_VS_AN)


[style] ~255-~255: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... extraction tool for your raw data. Feel free to reach out if you have any questions or ...

(FEEL_FREE_TO_STYLE_ME)

README.md

[grammar] ~103-~103: The modal verb ‘will’ requires the verb’s base form.
Context: ...rker jobs. Note this will typically required a distributed file system to work corre...

(MD_BASEFORM)


[uncategorized] ~106-~106: Possible missing comma found.
Context: ...e same script in different terminal sessions or all at once via the Hydra joblib l...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~181-~181: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ... and external ontologies? Will they need to specify select key-event concepts to id...

(REP_NEED_TO_VB)


[uncategorized] ~207-~207: Possible missing comma found.
Context: ...achine or across many worker nodes on a cluster provided the worker nodes have acce...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~306-~306: This phrase is redundant. Consider using “outside”.
Context: ... - Removing numeric values that fall outside of pre-specified, per-code ranges (e.g., f...

(OUTSIDE_OF)


[style] ~315-~315: Consider using “who” when you are referring to a person instead of an object.
Context: ..." event to the dataset for each patient that occurs at the end of the observation ...

(THAT_WHO)


[style] ~324-~324: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... each code occurs in the dataset. - Computing appropriate numeric bins for each code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[uncategorized] ~328-~328: Possible missing preposition found.
Context: ...ure that the only columns that need be processed going into the pre-processing...

(AI_HYDRA_LEO_MISSING_TO)


[grammar] ~446-~446: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.)
Context: ...right prior files. Likely need some kind of a "prior stage name" config variable. ##...

(KIND_OF_A)

Ruff
src/MEDS_transforms/reshard_to_split.py

116-116: Function definition does not bind loop variable in_fps

(B023)

src/MEDS_transforms/extract/split_and_shard_patients.py

22-22: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

src/MEDS_transforms/mapreduce/utils.py

247-250: Use ternary operator add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now()) instead of if-else-block

Replace if-else-block with add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())

(SIM108)

src/MEDS_transforms/mapreduce/mapper.py

574-574: Do not perform function call partial in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

Markdownlint
eICU_Example/README.md

13-13: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


15-15: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


16-16: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


21-21: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


22-22: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


23-23: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


24-24: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


26-26: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


27-27: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

Additional comments not posted (61)
docs/javascripts/mathjax.js (1)

1-12: MathJax configuration looks good.

The configuration for inline and display math is standard, and the use of processEscapes and processEnvironments is appropriate for handling TeX expressions.

src/MEDS_transforms/extract/__init__.py (2)

3-3: Imports are appropriate for the new functionality.

The import of MANDATORY_COLUMNS and MANDATORY_TYPES is necessary for constructing MEDS_DATA_MANDATORY_TYPES dynamically.


16-16: Dynamic construction of MEDS_DATA_MANDATORY_TYPES improves maintainability.

Using dictionary comprehension to construct MEDS_DATA_MANDATORY_TYPES from MANDATORY_COLUMNS and MANDATORY_TYPES enhances flexibility and reduces the risk of errors when updating mandatory fields.

src/MEDS_transforms/__init__.py (4)

4-4: Import of polars is appropriate for data type handling.

The inclusion of polars is necessary for defining data types in MANDATORY_TYPES, which enhances type safety.


15-15: Definition of MANDATORY_COLUMNS ensures data integrity.

Listing essential fields in MANDATORY_COLUMNS helps maintain data integrity by enforcing the presence of these fields in the dataset.


17-24: Use of polars types in MANDATORY_TYPES enhances type safety.

Defining expected data types with polars types ensures that data processing is consistent and reliable.


26-33: Mapping deprecated names maintains backward compatibility.

The DEPRECATED_NAMES dictionary helps transition to new naming conventions while maintaining backward compatibility with older datasets.

src/MEDS_transforms/configs/preprocess.yaml (1)

4-4: Ensure compatibility with the new stage reshard_to_split.

The addition of reshard_to_split in the stage_configs section modifies the data processing pipeline. Verify that all necessary components are in place to support this new stage.

Verification successful

The reshard_to_split stage is supported and configured properly.

The reshard_to_split stage is referenced in various parts of the codebase, including configuration files, test scripts, and as an entry point in pyproject.toml. This indicates that the necessary components are in place to support this new stage.

  • Entry Point: Defined in pyproject.toml as MEDS_transform-reshard_to_split.
  • Tests: Present in tests/test_reshard_to_split.py.
  • Configuration: Parameters set in src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Ensure compatibility with the new stage `reshard_to_split`.

# Test: Search for the `reshard_to_split` function or class definition. Expect: At least one definition.
ast-grep --lang python --pattern $'def reshard_to_split($_):\n  $$$'

Length of output: 68


Script:

#!/bin/bash
# Broaden the search for any occurrences of `reshard_to_split` in the codebase.
rg 'reshard_to_split' -A 3

Length of output: 1869

.github/workflows/tests.yaml (2)

36-36: Enhance test reporting with JUnit XML.

The addition of --junitxml=junit.xml in the pytest command enhances test reporting by generating a JUnit XML report, which is useful for CI/CD integration.


42-46: Robust test result upload to Codecov.

The new step to upload test results to Codecov, conditioned on the previous steps not being cancelled, adds robustness to the workflow. Ensure the CODECOV_TOKEN is securely managed.

mkdocs.yml (1)

1-45: Comprehensive MkDocs configuration.

The new mkdocs.yml file is well-structured, utilizing the Material theme and various extensions to enhance the documentation. Ensure that all referenced files and scripts, like docs/gen_ref_pages.py, exist and are correctly configured.

docs/gen_ref_pages.py (3)

12-30: Ensure correct handling of special cases.

The logic correctly skips __main__.py files and adjusts paths for __init__.py files. Ensure that this behavior aligns with your documentation requirements.

Verification successful

Handling of __init__.py and __main__.py files is correct.

The code correctly processes __init__.py files by converting them to index.md and skips __main__.py files as intended. The current implementation aligns with the documentation requirements.

  • __init__.py files found in the src directory will be converted to index.md.
  • No __main__.py files were found, confirming that none are incorrectly processed.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the handling of __init__.py and __main__.py files in the documentation generation.

# Test: List all __init__.py and __main__.py files in the src directory.
# Expect: __main__.py files are excluded from documentation, __init__.py files are converted to index.md.
fd --type f --name '__init__.py' src/
fd --type f --name '__main__.py' src/

Length of output: 577


Script:

#!/bin/bash
# Description: Verify the handling of __init__.py and __main__.py files in the documentation generation.

# Test: List all __init__.py and __main__.py files in the src directory.
# Expect: __main__.py files are excluded from documentation, __init__.py files are converted to index.md.
fd '__init__.py' src/
fd '__main__.py' src/

Length of output: 334


32-32: Verify navigation structure.

The navigation structure is built using mkdocs_gen_files.Nav. Ensure that the generated structure meets the project's documentation needs.


37-40: Check for file path correctness.

Ensure that the paths used for writing files and setting edit paths are correct and that they align with the project's structure.

pyproject.toml (2)

50-50: Verify new script entry.

The new script MEDS_transform-reshard_to_split is added. Ensure that the function it points to exists and is correctly implemented.


30-34: Verify documentation dependencies.

The new documentation dependencies enhance the project's documentation capabilities. Verify that the specified versions are compatible and meet the project's needs.

Verification successful

Documentation Dependencies Verified

All specified documentation dependencies are available on PyPI and compatible with the project's requirements.

  • mkdocs==1.6.0
  • mkdocs-material==9.5.31
  • mkdocstrings==0.25.2
  • mkdocs-gen-files==0.5.0
  • mkdocs-literate-nav==0.6.1
  • mkdocs-section-index==0.3.9
  • mkdocs-git-authors-plugin==0.9.0
  • mkdocs-git-revision-date-localized-plugin==1.2.6
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the compatibility of new documentation dependencies.

# Test: Check if the specified versions of mkdocs and related plugins are available.
# Expect: All specified versions should be available for installation.
pip install mkdocs==1.6.0 mkdocs-material==9.5.31 mkdocstrings[python,shell]==0.25.2 mkdocs-gen-files==0.5.0 mkdocs-literate-nav==0.6.1 mkdocs-section-index==0.3.9 mkdocs-git-authors-plugin==0.9.0 mkdocs-git-revision-date-localized-plugin==1.2.6

Length of output: 320


Script:

#!/bin/bash
# Description: Check the availability of specified package versions on PyPI.

# List of packages and their versions
packages=(
  "mkdocs==1.6.0"
  "mkdocs-material==9.5.31"
  "mkdocstrings[python,shell]==0.25.2"
  "mkdocs-gen-files==0.5.0"
  "mkdocs-literate-nav==0.6.1"
  "mkdocs-section-index==0.3.9"
  "mkdocs-git-authors-plugin==0.9.0"
  "mkdocs-git-revision-date-localized-plugin==1.2.6"
)

# Check each package version on PyPI
for pkg in "${packages[@]}"; do
  name_version=$(echo $pkg | sed 's/==/-/g' | sed 's/\[.*\]//g')
  curl -s "https://pypi.org/pypi/${name_version}/json" | jq '.info.version'
done

Length of output: 2130


Script:

#!/bin/bash
# Description: Check the availability of specified package versions on PyPI.

# List of packages and their versions
packages=(
  "mkdocs==1.6.0"
  "mkdocs-material==9.5.31"
  "mkdocstrings==0.25.2"
  "mkdocs-gen-files==0.5.0"
  "mkdocs-literate-nav==0.6.1"
  "mkdocs-section-index==0.3.9"
  "mkdocs-git-authors-plugin==0.9.0"
  "mkdocs-git-revision-date-localized-plugin==1.2.6"
)

# Check each package version on PyPI
for pkg in "${packages[@]}"; do
  name_version=$(echo $pkg | sed 's/==/-/g')
  package_name=$(echo $name_version | cut -d'-' -f1)
  version=$(echo $name_version | cut -d'-' -f2)
  curl -s "https://pypi.org/pypi/${package_name}/${version}/json" | jq '.info.version'
done

Length of output: 2945


Script:

#!/bin/bash
# Description: Check the availability of specified package versions on PyPI.

# List of packages and their versions
packages=(
  "mkdocs==1.6.0"
  "mkdocs-material==9.5.31"
  "mkdocstrings==0.25.2"
  "mkdocs-gen-files==0.5.0"
  "mkdocs-literate-nav==0.6.1"
  "mkdocs-section-index==0.3.9"
  "mkdocs-git-authors-plugin==0.9.0"
  "mkdocs-git-revision-date-localized-plugin==1.2.6"
)

# Check each package version on PyPI
for pkg in "${packages[@]}"; do
  package_name=$(echo $pkg | cut -d'=' -f1)
  version=$(echo $pkg | cut -d'=' -f3)
  curl -s "https://pypi.org/pypi/${package_name}/${version}/json" | jq '.info.version'
done

Length of output: 2434

.github/workflows/python-build.yaml (4)

72-78: Verify Sigstore signing configuration.

The Sigstore signing step should be correctly configured and secure. Verify the inputs and ensure that the signing process is reliable.


78-95: Check GitHub Release creation.

Ensure that the GitHub Release creation step is correctly configured and that all necessary artifacts are uploaded.

Verification successful

GitHub Release creation is correctly configured.

The workflow configuration for creating a GitHub Release and uploading artifacts is properly set up. The dist/** pattern ensures that all necessary distribution files and signatures are uploaded to the release.

  • The github-release job downloads artifacts, signs them, and uploads them to the GitHub release.
  • The use of ${{ github.ref_name }} and ${{ github.repository }} in the gh commands is appropriate within the GitHub Actions environment.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the configuration for GitHub Release creation.

# Test: Check if the GitHub Release step uploads all necessary artifacts.
# Expect: All distribution files and signatures should be uploaded.
gh release view '${{ github.ref_name }}' --repo '${{ github.repository }}'

Length of output: 148


Script:

#!/bin/bash
# Search for the GitHub Release creation and artifact upload steps in the workflow file
# to verify if all necessary artifacts are included.

# Find the workflow file
fd 'python-build.yaml' --type f --search-path .github/workflows --exec cat

Length of output: 2813


30-33: Confirm tag-based publishing.

The workflow publishes to PyPI only on tag pushes. Ensure that this condition is correctly implemented and tested.

Verification successful

Tag-based publishing condition is correctly implemented.

The publish-to-pypi job in the GitHub Actions workflow is configured to trigger only on tag pushes using the condition startsWith(github.ref, 'refs/tags/'). This ensures that publishing to PyPI occurs as expected when a tag is pushed.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify tag-based publishing condition.

# Test: Check if the workflow correctly identifies tag pushes.
# Expect: Workflow should trigger publishing only on tag pushes.
git tag v0.0.4 && git push origin v0.0.4

Length of output: 132


Script:

#!/bin/bash
# Verify the 'publish-to-pypi' job's condition in the GitHub Actions workflow file.
# We are looking for the 'if' condition that should trigger the job only on tag pushes.
rg -A 3 'publish-to-pypi:' .github/workflows/python-build.yaml

Length of output: 224


11-15: Verify Python version setup.

Ensure that the specified Python version (3.12) is compatible with all dependencies and the project requirements.

.pre-commit-config.yaml (3)

129-129: LGTM! Improved readability of nbqa-flake8 arguments.

The reformatting of arguments into a single line enhances readability without affecting functionality.


4-4: Verify the new exclusion pattern.

Ensure that the updated exclusion pattern aligns with the project's requirements and does not exclude important files from pre-commit checks.

Verification successful

Exclusion pattern verified successfully.

The exclusion pattern in .pre-commit-config.yaml correctly excludes the specified files (docs/index.md, MIMIC-IV_Example/README.md, and eICU_Example/README.md) without affecting other critical files. No issues found with the current configuration.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the exclusion pattern in `.pre-commit-config.yaml`.

# Test: List files that match the exclusion pattern. Expect: No critical files should be excluded.
rg --files --glob '!docs/index.md' --glob '!MIMIC-IV_Example/README.md' --glob '!eICU_Example/README.md'

Length of output: 3835


101-103: Verify compatibility of new markdown hooks.

Ensure that the newly added hooks (mdformat-mkdocs, mdformat-toc, mdformat-admon) are compatible with the project's markdown files and do not introduce formatting issues.

tests/utils.py (2)

7-7: LGTM! Necessary import for configuration handling.

The import of OmegaConf is necessary for handling YAML configuration files in the updated run_command function.


114-166: Review error handling and resource cleanup.

The error handling and resource cleanup logic have been improved. Ensure that temporary directories are always cleaned up to prevent resource leaks.

src/MEDS_transforms/reshard_to_split.py (4)

4-17: Consider removing unused imports.

The imports from hydra, loguru, and omegaconf are appropriate for configuration management and logging. However, ensure that all imported modules are used in the file.


87-95: Verify the filtering logic within rwlock_wrap.

Ensure that the filtering logic applied with read_and_filter_fntr is correct and that it efficiently filters the data based on patient IDs.

Verification successful

Filtering logic within rwlock_wrap is correctly implemented.

The function read_and_filter_fntr is used to apply a filter expression to data read from a file. In the context of rwlock_wrap, it effectively filters the data based on the specified patient IDs using the expression pl.col("patient_id").is_in(patients). The implementation aligns with the intended functionality.

  • src/MEDS_transforms/reshard_to_split.py: Usage of read_and_filter_fntr with patient ID filtering.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the filtering logic applied in rwlock_wrap.

# Test: Search for read_and_filter_fntr usage to confirm filtering logic.
rg --type python $'read_and_filter_fntr'

Length of output: 70


Script:

#!/bin/bash
# Search for the definition and usage of read_and_filter_fntr to verify the filtering logic.
rg 'read_and_filter_fntr' -A 5

Length of output: 2990


28-32: Check for potential data loading issues.

The code reads a Parquet file into a DataFrame. Ensure that the file path is correct and that the file exists in the expected location.


21-23: Ensure correct configuration path usage.

The hydra.main decorator uses PREPROCESS_CONFIG_YAML for configuration. Verify that this path is correct and accessible in the runtime environment.

tests/test_reshard_to_split.py (1)

193-202: Ensure comprehensive test coverage.

The test function test_reshard_to_split uses single_stage_transform_tester to validate the script. Ensure that this covers all critical paths and edge cases of the script.

tests/test_filter_measurements.py (1)

211-228: Ensure filtering logic is accurately tested.

The new test function test_match_revise_filter_measurements uses specific filtering criteria. Verify that these criteria align with the intended logic and cover necessary edge cases.

Verification successful

Filtering logic in test function is correctly aligned with implementation.

The test function test_match_revise_filter_measurements accurately reflects the filtering logic defined in the source code. It tests various codes and patient thresholds, ensuring comprehensive coverage of the filtering criteria.

  • The filtering function filter_measurements_fntr in src/MEDS_transforms/filters/filter_measurements.py aligns with the test's logic.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the filtering logic in test_match_revise_filter_measurements aligns with the intended logic.

# Test: Check for additional filtering logic or criteria related to filter_measurements.
rg --type python $'filter_measurements'

Length of output: 69


Script:

#!/bin/bash
# Description: Verify the filtering logic in test_match_revise_filter_measurements aligns with the intended logic.

# Test: Search for the term 'filter_measurements' across all Python files to find related logic or criteria.
rg 'filter_measurements' --glob '*.py' -A 5

Length of output: 7435

src/MEDS_transforms/extract/finalize_MEDS_metadata.py (2)

149-149: Streamline initialization of shards_map_fp.

The direct assignment of shards_map_fp within the main function simplifies the code by reducing unnecessary return values from stage_init(cfg). This is a good practice for improving code readability and maintainability.


196-196: Ensure existence of cfg.shards_map_fp.

Before using Path(cfg.shards_map_fp), verify that cfg.shards_map_fp is a valid and expected path. This prevents potential runtime errors if the configuration is incorrect.

src/MEDS_transforms/transforms/tokenization.py (2)

232-235: Check for unsupported configurations.

Raising a ValueError when include_only_train is true ensures unsupported configurations are flagged early. This is a good practice to prevent unexpected behavior during execution.


20-20: Update import statements.

The import of shard_iterator from MEDS_transforms.mapreduce.utils suggests a shift in utility functions. Ensure that shard_iterator is the correct choice for the intended functionality.

Verification successful

Import statement is correct.

The shard_iterator function is correctly imported from MEDS_transforms.mapreduce.utils and is used across multiple files in the codebase. No updates to the import statement are necessary.

  • src/MEDS_transforms/mapreduce/utils.py: Contains the definition of shard_iterator.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of `shard_iterator` in the codebase.

# Test: Search for the `shard_iterator` function definition. Expect: A valid function definition.
ast-grep --lang python --pattern $'def shard_iterator($_):'

Length of output: 60


Script:

#!/bin/bash
# Searching for any occurrences of `shard_iterator` in the codebase to verify its existence and usage.
rg 'shard_iterator'

Length of output: 2577

src/MEDS_transforms/extract/merge_to_MEDS_cohort.py (2)

12-12: Ensure consistency with new shard iterator.

The import of shard_iterator_by_shard_map replaces the previous shard_iterator. Ensure that this change is consistent with the intended behavior of the map_over function and that it aligns with the overall data processing logic.

Verification successful

Consistent usage of shard_iterator_by_shard_map verified.

The shard_iterator_by_shard_map function is correctly used as an argument in the map_over function within merge_to_MEDS_cohort.py, ensuring consistency with the intended data processing logic.

  • merge_to_MEDS_cohort.py: map_over uses shard_iterator_by_shard_map as shard_iterator_fntr.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `shard_iterator_by_shard_map` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage with the intended behavior.
rg --type python 'shard_iterator_by_shard_map'

Length of output: 77


Script:

#!/bin/bash
# Description: Verify the usage of `shard_iterator_by_shard_map` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage with the intended behavior.
rg --type py 'shard_iterator_by_shard_map'

Length of output: 776


Script:

#!/bin/bash
# Description: Verify the usage of `map_over` in relation to `shard_iterator_by_shard_map`.

# Test: Search for the `map_over` function in the `merge_to_MEDS_cohort.py` file.
rg 'map_over' src/MEDS_transforms/extract/merge_to_MEDS_cohort.py -A 5

Length of output: 363


241-241: Verify the impact of iterator function change.

The shard_iterator_fntr parameter has been updated to use shard_iterator_by_shard_map. Verify that this change does not introduce any issues in the map_over function's behavior.

docs/preprocessing_operation_prototypes.md (1)

45-46: Formatting improvements enhance readability.

The formatting adjustments improve the document's visual consistency and readability. No functional content changes are made.

Also applies to: 74-74, 82-83, 113-113, 155-155, 203-203, 205-205, 210-211, 222-222, 230-233

src/MEDS_transforms/extract/split_and_shard_patients.py (3)

108-108: Initialize splits consistently.

The splits variable is initialized with external_splits, which is appropriate. Ensure this initialization is consistent with the intended logic for handling splits.


111-139: Improved handling of split fractions and warnings.

The restructuring of logic to handle split fractions and warnings about empty splits enhances robustness. Ensure that the warning messages are clear and informative.


268-272: Improved logging and file handling.

The changes to path handling and logging improve clarity and maintainability. The use of shards_map_fp for file operations is a good practice.

src/MEDS_transforms/extract/README.md (1)

8-10: Formatting changes improve readability.

The adjustments to indentation and alignment enhance the document's visual clarity.

Also applies to: 32-36, 38-44, 80-90, 92-93, 104-104, 220-221, 248-250, 252-255

tests/transform_tester_base.py (3)

303-305: New parameters enhance configurability.

The addition of do_use_config_yaml, input_shards_map, and input_splits_map allows for more flexible input configurations.


366-374: Streamlined command execution logic.

The use of run_command_kwargs to pass arguments improves readability and maintainability.


320-336: Ensure correct default handling for input maps.

The logic for defaulting input_shards_map and input_splits_map is correctly implemented, but verify that these defaults align with expected usage scenarios.

Verification successful

Defaults for input_shards_map and input_splits_map are correctly handled and tested.

The search results indicate that there are test cases where input_shards_map and input_splits_map are not explicitly provided, allowing the defaults to be used. This confirms that the default handling logic is being tested as expected.

  • Test cases in tests/test_tokenization.py, tests/test_reorder_measurements.py, tests/test_tensorization.py, tests/test_fit_vocabulary_indices.py, tests/test_filter_measurements.py, tests/test_normalization.py, tests/test_occlude_outliers.py, and tests/test_filter_patients.py do not provide these maps, implying default usage.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify usage of `input_shards_map` and `input_splits_map` in tests.

# Test: Search for test cases using `single_stage_transform_tester` to ensure defaults are correctly applied.
rg --type python 'single_stage_transform_tester' -A 10

Length of output: 85


Script:

#!/bin/bash
# Description: Verify usage of `input_shards_map` and `input_splits_map` in tests.

# Test: Search for test cases using `single_stage_transform_tester` to ensure defaults are correctly applied.
rg --type py 'single_stage_transform_tester' -A 10

Length of output: 14376

src/MEDS_transforms/utils.py (2)

Line range hint 225-352: Improved stage configuration logic.

The refined handling of metadata stages and train_only logic enhances configurability. Verify that all stages are handled correctly.

Verification successful

Verification of populate_stage logic improvements.

The populate_stage function is tested through doctests within src/MEDS_transforms/utils.py, which cover various stage configurations and expected outcomes. These examples verify the function's behavior and logic improvements. Consider adding more comprehensive unit tests for additional coverage.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify correctness of stage configuration logic in `populate_stage`.

# Test: Search for test cases or usage examples of `populate_stage` to ensure all stages are configured correctly.
rg --type python 'populate_stage' -A 10

Length of output: 70


Script:

#!/bin/bash
# Description: Verify correctness of stage configuration logic in `populate_stage`.

# Test: Search for test cases or usage examples of `populate_stage` to ensure all stages are configured correctly.
rg --type py 'populate_stage' -A 10

Length of output: 6119


Line range hint 61-98: Simplified return type improves clarity.

The change to return only three Path objects reduces complexity. Ensure that dependent code is updated accordingly.

Verification successful

Simplified return type is consistent with current usage.

The change to return only three Path objects in stage_init aligns with its usage across the codebase, where the function's return values are unpacked into three variables. No updates to dependent code are necessary.

  • src/MEDS_transforms/reshard_to_split.py
  • src/MEDS_transforms/mapreduce/mapper.py
  • src/MEDS_transforms/extract/split_and_shard_patients.py
  • src/MEDS_transforms/extract/finalize_MEDS_metadata.py
  • src/MEDS_transforms/extract/convert_to_sharded_events.py
  • src/MEDS_transforms/extract/extract_code_metadata.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify usage of `stage_init` to ensure dependent code is updated.

# Test: Search for all usages of `stage_init` to check for consistency with the new return type.
rg --type python 'stage_init' -A 5

Length of output: 65


Script:

#!/bin/bash
# Description: Verify usage of `stage_init` to ensure dependent code is updated.

# Test: Search for all usages of `stage_init` to check for consistency with the new return type.
rg --type py 'stage_init' -A 5

Length of output: 6872

src/MEDS_transforms/extract/extract_code_metadata.py (2)

352-352: Change in unpacking of stage_init return values is appropriate.

The stage_init function now returns three values, and the change in unpacking aligns with the updated return signature. Ensure that the removed value is not used elsewhere in the code.


20-20: Verify the new import path for cfg_to_expr.

The import path has changed, which may indicate a refactor. Ensure that this path is correct and consistent with the project's module organization.

Verification successful

Import Path Verified: cfg_to_expr

The import path for cfg_to_expr in src/MEDS_transforms/extract/extract_code_metadata.py is correct. The function is defined in src/MEDS_transforms/parser.py, matching the import statement. No issues found.

  • src/MEDS_transforms/parser.py: Function cfg_to_expr is correctly defined here.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and correctness of the new import path for `cfg_to_expr`.

# Test: Check if the `cfg_to_expr` function is correctly imported from the new location.
ast-grep --lang python --pattern $'def cfg_to_expr($_) { $$$ }'

Length of output: 514


Script:

#!/bin/bash
# Description: Verify the existence and correctness of the new import path for `cfg_to_expr`.

# Test: Check if the `cfg_to_expr` function is correctly defined in the new location.
ast-grep --lang python --pattern 'def cfg_to_expr($$):'

Length of output: 57


Script:

#!/bin/bash
# Description: Search for the `cfg_to_expr` function definition across the entire codebase.

# Test: Find any definition of the `cfg_to_expr` function in the codebase.
rg 'def cfg_to_expr' -A 2

Length of output: 278

src/MEDS_transforms/mapreduce/utils.py (3)

182-182: Consistent return types in rwlock_wrap.

The return statements have been updated to consistently return tuples, enhancing code consistency.

Also applies to: 216-216


261-474: Enhanced functionality in shard_iterator.

The function now returns a tuple with input-output pairs and a boolean for training shards, improving its interface and usability.


477-560: shard_iterator_by_shard_map function is robust and well-integrated.

The function correctly iterates over shards based on a shard map file and handles configuration errors effectively.

tests/test_extract.py (1)

444-447: Update to shards_fp in test_extraction is appropriate.

The change from splits_fp to shards_fp reflects an update in data organization and is correctly integrated into the test.

src/MEDS_transforms/mapreduce/mapper.py (2)

228-240: Clarify the return type of is_match_revise.

The function is_match_revise now has a return type annotation -> bool. This is a good practice as it improves readability and helps with static analysis.


242-289: Ensure comprehensive validation in validate_match_revise.

The function validate_match_revise performs thorough validation of the stage_cfg. The error messages are clear, which aids in debugging.

README.md (1)

3-4: Verify badge links and tokens.

Ensure that the URLs for the badges are correct and that the tokens are valid. This is crucial for accurate status display.

Verification successful

Badge links and tokens are valid.

Both badge URLs returned HTTP 200 status codes, confirming their accessibility and validity. No issues were found with the badge links or tokens.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the URLs and tokens for the badges in the README.md file.

# Test: Check if the badge URLs are accessible. Expect: HTTP 200 status code.
curl -I https://codecov.io/gh/mmcdermott/MEDS_transforms/graph/badge.svg?token=5RORKQOZF9
curl -I https://readthedocs.org/projects/meds-transforms/badge/?version=latest

Length of output: 3136

src/MEDS_transforms/aggregate_code_metadata.py (3)

63-66: Document the new quantile aggregation clearly.

The addition of VALUES_QUANTILES to the METADATA_FN class is well-documented. Ensure that the configuration file is updated to include examples of how to specify quantiles.


601-618: Test the integration of quantile aggregation in reducer_fntr.

The integration of quantile aggregation in the reducer_fntr function appears correct. Ensure that tests cover scenarios where quantiles are part of the aggregation configuration.


105-150: Ensure correctness of quantile_reducer.

The quantile_reducer function is correctly implemented to calculate specified quantiles. Ensure that it handles edge cases, such as empty datasets, gracefully.

src/MEDS_transforms/extract/convert_to_sharded_events.py (2)

20-20: Simplified import statement.

The removal of parentheses around imported functions improves readability without affecting functionality.


683-685: Improved clarity by reading shards directly.

The change to read shards directly from cfg.shards_map_fp instead of using shards_map_fn from stage_init enhances clarity and reduces dependencies.

`relative_table_file_stem.event_name.code` field. This field can be a string literal, a reference to an
input column (denoted by the `col(...)` syntax), or a list of same. If it is a list, the output code will
be a `"//"` separated string of each field in the list. Each field in the list (again either a string
literal or a input column reference) is interpreted either as the specified string literal or as the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct article usage.

Use "an" instead of "a" before a vowel sound.

- or a input column reference
+ or an input column reference
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
literal or a input column reference) is interpreted either as the specified string literal or as the
literal or an input column reference) is interpreted either as the specified string literal or as the
Tools
LanguageTool

[misspelling] ~83-~83: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...t (again either a string literal or a input column reference) is interpreted ...

(EN_A_VS_AN)

@@ -19,7 +19,7 @@ def shard_patients[
patients: np.ndarray,
n_patients_per_shard: int = 50000,
external_splits: dict[str, Sequence[SUBJ_ID_T]] | None = None,
split_fracs_dict: dict[str, float] = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid mutable default arguments.

Using mutable data structures like dictionaries as default arguments can lead to unexpected behavior. Replace with None and initialize within the function.

- split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
+ split_fracs_dict: dict[str, float] | None = None,

Initialize within the function:

if split_fracs_dict is None:
    split_fracs_dict = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}
Tools
Ruff

22-22: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Comment on lines +224 to +259
def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]:
"""Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration.

Args:
shards: The list of shards to shuffle.
cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following
keys (some are optional, as marked below):
- `worker` (optional): The worker ID for the MR worker; this is also used to seed the
randomization process. If not provided, the randomization process will be unseeded.

Returns:
The shuffled list of shards.

Examples:
>>> cfg = DictConfig({"worker": 1})
>>> shards = ["train/0", "train/1", "tuning", "held_out"]
>>> shuffle_shards(shards, cfg)
['train/1', 'held_out', 'tuning', 'train/0']
>>> cfg = DictConfig({"worker": 2})
>>> shuffle_shards(shards, cfg)
['tuning', 'held_out', 'train/1', 'train/0']
"""

if "worker" in cfg:
add_str = str(cfg["worker"])
else:
add_str = str(datetime.now())

shard_keys = []
for shard in shards:
shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest()
if shard_hash in shard_keys:
raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!")
shard_keys.append(int(shard_hash, 16))

return [shard for _, shard in sorted(zip(shard_keys, shards))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shuffle_shards function is well-implemented.

The function shuffles shards deterministically based on the worker ID, and uses SHA-256 to avoid hash collisions. Consider using a ternary operator for add_str assignment for clarity.

-    if "worker" in cfg:
-        add_str = str(cfg["worker"])
-    else:
-        add_str = str(datetime.now())
+    add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]:
"""Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration.
Args:
shards: The list of shards to shuffle.
cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following
keys (some are optional, as marked below):
- `worker` (optional): The worker ID for the MR worker; this is also used to seed the
randomization process. If not provided, the randomization process will be unseeded.
Returns:
The shuffled list of shards.
Examples:
>>> cfg = DictConfig({"worker": 1})
>>> shards = ["train/0", "train/1", "tuning", "held_out"]
>>> shuffle_shards(shards, cfg)
['train/1', 'held_out', 'tuning', 'train/0']
>>> cfg = DictConfig({"worker": 2})
>>> shuffle_shards(shards, cfg)
['tuning', 'held_out', 'train/1', 'train/0']
"""
if "worker" in cfg:
add_str = str(cfg["worker"])
else:
add_str = str(datetime.now())
shard_keys = []
for shard in shards:
shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest()
if shard_hash in shard_keys:
raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!")
shard_keys.append(int(shard_hash, 16))
return [shard for _, shard in sorted(zip(shard_keys, shards))]
def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]:
"""Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration.
Args:
shards: The list of shards to shuffle.
cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following
keys (some are optional, as marked below):
- `worker` (optional): The worker ID for the MR worker; this is also used to seed the
randomization process. If not provided, the randomization process will be unseeded.
Returns:
The shuffled list of shards.
Examples:
>>> cfg = DictConfig({"worker": 1})
>>> shards = ["train/0", "train/1", "tuning", "held_out"]
>>> shuffle_shards(shards, cfg)
['train/1', 'held_out', 'tuning', 'train/0']
>>> cfg = DictConfig({"worker": 2})
>>> shuffle_shards(shards, cfg)
['tuning', 'held_out', 'train/1', 'train/0']
"""
add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())
shard_keys = []
for shard in shards:
shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest()
if shard_hash in shard_keys:
raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!")
shard_keys.append(int(shard_hash, 16))
return [shard for _, shard in sorted(zip(shard_keys, shards))]
Tools
Ruff

247-250: Use ternary operator add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now()) instead of if-else-block

Replace if-else-block with add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())

(SIM108)

Comment on lines +157 to +219
def read_and_filter_fntr(filter_expr: pl.Expr, read_fn: Callable[[Path], DF_T]) -> Callable[[Path], DF_T]:
"""Create a function that reads a DataFrame from a file and filters it based on a given expression.

This is specified as a functor in this way to allow it to modify arbitrary other read functions for use in
different mapreduce pipelines.

Args:
filter_expr: The filter expression to apply to the DataFrame.
read_fn: The read function to use to read the DataFrame.

Returns:
A function that reads a DataFrame from a file and filters it based on the given expression.

Examples:
>>> dfs = {
... "df1": pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}),
... "df2": pl.DataFrame({"a": [4, 5, 6], "b": [7, 8, 9]})
... }
>>> read_fn = lambda key: dfs[key]
>>> fn = read_and_filter_fntr((pl.col("a") % 2) == 0, read_fn)
>>> fn("df1")
shape: (1, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2 ┆ 5 │
└─────┴─────┘
>>> fn("df2")
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 4 ┆ 7 │
│ 6 ┆ 9 │
└─────┴─────┘
>>> fn = read_and_filter_fntr((pl.col("b") % 2) == 0, read_fn)
>>> fn("df1")
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 4 │
│ 3 ┆ 6 │
└─────┴─────┘
>>> fn("df2")
shape: (1, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 5 ┆ 8 │
└─────┴─────┘
"""

def read_and_filter(in_fp: Path) -> DF_T:
df = read_fn(in_fp)
return df.filter(pl.col("patient_id").isin(patients))
return read_fn(in_fp).filter(filter_expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update default argument handling for read_fn.

The read_fn argument in read_and_filter_fntr is a partial function. It's better to avoid using function calls in default arguments due to Python's handling of mutable defaults. Consider defining the default outside the function.

# Define the default read function outside
default_read_fn = partial(pl.scan_parquet, glob=False)

def read_and_filter_fntr(filter_expr: pl.Expr, read_fn: Callable[[Path], DF_T] = default_read_fn) -> Callable[[Path], DF_T]:

Comment on lines +571 to +616
def map_over(
cfg: DictConfig,
compute_fn: COMPUTE_FN_T | None = None,
read_fn: Callable[[Path], DF_T] = partial(pl.scan_parquet, glob=False),
write_fn: Callable[[DF_T, Path], None] = write_lazyframe,
shard_iterator_fntr: SHARD_ITR_FNTR_T = shard_iterator,
) -> list[Path]:
stage_init(cfg)

start = datetime.now()

train_only = cfg.stage_cfg.get("train_only", False)
split_fp = Path(cfg.input_dir) / "metadata" / "patient_split.parquet"

shards, includes_only_train = shard_iterator_fntr(cfg)

if train_only:
if includes_only_train:
logger.info(
f"Processing train split only via shard prefix. Not filtering with {str(split_fp.resolve())}."
)
elif split_fp.exists():
logger.info(f"Processing train split only by filtering read dfs via {str(split_fp.resolve())}")
train_patients = (
pl.scan_parquet(split_fp)
.filter(pl.col("split") == "train")
.select(pl.col("patient_id"))
.collect()
.to_list()
)
read_fn = read_and_filter_fntr(train_patients, read_fn)
else:
raise FileNotFoundError(
f"Train split requested, but shard prefixes can't be used and "
f"patient split file not found at {str(split_fp.resolve())}."
)
elif includes_only_train:
raise ValueError("All splits should be used, but shard iterator is returning only train splits?!?")

if is_match_revise(cfg.stage_cfg):
compute_fn = match_revise_fntr(cfg, cfg.stage_cfg, compute_fn)
else:
compute_fn = bind_compute_fn(cfg, cfg.stage_cfg, compute_fn)

all_out_fps = []
for in_fp, out_fp in shard_iterator_fntr(cfg):
for in_fp, out_fp in shards:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor map_over to avoid function calls in default arguments.

The read_fn argument uses partial. Define this outside the function to avoid potential issues with mutable defaults.

# Define the default read function outside
default_read_fn = partial(pl.scan_parquet, glob=False)

def map_over(
    cfg: DictConfig,
    compute_fn: COMPUTE_FN_T | None = None,
    read_fn: Callable[[Path], DF_T] = default_read_fn,
    write_fn: Callable[[DF_T, Path], None] = write_lazyframe,
    shard_iterator_fntr: SHARD_ITR_FNTR_T = shard_iterator,
) -> list[Path]:
Tools
Ruff

574-574: Do not perform function call partial in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

- The time since the last event of a certain type.
- The patient's age as of each unique timepoint.
- The time-of-day of each event.
- Adding a "dummy" event to the dataset for each patient that occurs at the end of the observation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use "who" instead of "that" for referring to people.

Use "who" when referring to people for better grammatical correctness.

- Adding a "dummy" event to the dataset for each patient that occurs at the end of the observation period.
+ Adding a "dummy" event to the dataset for each patient who occurs at the end of the observation period.

Committable suggestion was skipped due to low confidence.

Tools
LanguageTool

[style] ~315-~315: Consider using “who” when you are referring to a person instead of an object.
Context: ..." event to the dataset for each patient that occurs at the end of the observation ...

(THAT_WHO)

2. Figure out how to ensure that each pre-processing step reads from the right prior files. Likely need some
kind of a "prior stage name" config variable.
kind of a "prior stage name" config variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct usage of "kind of".

The phrase "kind of a" should be corrected to "kind of" for proper grammar.

- Likely need some kind of a "prior stage name" config variable.
+ Likely need some kind of "prior stage name" config variable.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
kind of a "prior stage name" config variable.
Likely need some kind of "prior stage name" config variable.
Tools
LanguageTool

[grammar] ~446-~446: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.)
Context: ...right prior files. Likely need some kind of a "prior stage name" config variable. ##...

(KIND_OF_A)

stage are expressible in the `code` and `numeric_values` columns of the dataset, which helps
standardize further downstream use.
should be included during code grouping and modeling operations. The goal of this step is to ensure that
the only columns that need be processed going into the pre-processing, tokenization, and tensorization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add missing preposition for clarity.

A preposition is missing, which can affect the sentence's clarity.

- ensure that the only columns that need be processed going into the pre-processing, tokenization, and tensorization stage are expressible in the `code` and `numeric_values` columns of the dataset
+ ensure that the only columns that need to be processed going into the pre-processing, tokenization, and tensorization stage are expressible in the `code` and `numeric_values` columns of the dataset
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
the only columns that need be processed going into the pre-processing, tokenization, and tensorization
ensure that the only columns that need to be processed going into the pre-processing, tokenization, and tensorization stage are expressible in the `code` and `numeric_values` columns of the dataset
Tools
LanguageTool

[uncategorized] ~328-~328: Possible missing preposition found.
Context: ...ure that the only columns that need be processed going into the pre-processing...

(AI_HYDRA_LEO_MISSING_TO)

2. This pipeline preferentially uses disk over memory and compute through aggressive caching. You should
have sufficient disk space to store multiple copies of your raw dataset comfortably.
have sufficient disk space to store multiple copies of your raw dataset comfortably.
3. This pipeline can be run on a single machine or across many worker nodes on a cluster provided the worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add missing comma for clarity.

A comma is needed to separate clauses for better readability.

- This pipeline can be run on a single machine or across many worker nodes on a cluster provided the worker nodes have access to a distributed file system.
+ This pipeline can be run on a single machine or across many worker nodes on a cluster, provided the worker nodes have access to a distributed file system.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
3. This pipeline can be run on a single machine or across many worker nodes on a cluster provided the worker
3. This pipeline can be run on a single machine or across many worker nodes on a cluster, provided the worker
Tools
LanguageTool

[uncategorized] ~207-~207: Possible missing comma found.
Context: ...achine or across many worker nodes on a cluster provided the worker nodes have acce...

(AI_HYDRA_LEO_MISSING_COMMA)

Comment on lines +98 to +108
stage's script in sequence.
- The user can leverage arbitrary scheduling systems (e.g., Slurm, LSF, Kubernetes, etc.) to run each stage
in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's
script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the
`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically
required a distributed file system to work correctly, as these scripts use manually created file locks to
avoid duplicative work.
in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's
script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the
`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically
required a distributed file system to work correctly, as these scripts use manually created file locks to
avoid duplicative work.
- The user can run each stage in parallel on a single machine by launching multiple copies of the same
script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a
significant speedup depending on the machine configuration as sharding ensures that parallelism can be used
with minimal file read contention.
script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a
significant speedup depending on the machine configuration as sharding ensures that parallelism can be used
with minimal file read contention.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add missing commas for clarity.

There are missing commas in the sentences, which can affect readability.

- The user can run each stage in parallel on a single machine by launching multiple copies of the same script in different terminal sessions or all at once via the Hydra `joblib` launcher.
+ The user can run each stage in parallel on a single machine by launching multiple copies of the same script, in different terminal sessions, or all at once via the Hydra `joblib` launcher.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
stage's script in sequence.
- The user can leverage arbitrary scheduling systems (e.g., Slurm, LSF, Kubernetes, etc.) to run each stage
in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's
script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the
`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically
required a distributed file system to work correctly, as these scripts use manually created file locks to
avoid duplicative work.
in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's
script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the
`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically
required a distributed file system to work correctly, as these scripts use manually created file locks to
avoid duplicative work.
- The user can run each stage in parallel on a single machine by launching multiple copies of the same
script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a
significant speedup depending on the machine configuration as sharding ensures that parallelism can be used
with minimal file read contention.
script in different terminal sessions or all at once via the Hydra `joblib` launcher. This can result in a
significant speedup depending on the machine configuration as sharding ensures that parallelism can be used
with minimal file read contention.
stage's script in sequence.
- The user can leverage arbitrary scheduling systems (e.g., Slurm, LSF, Kubernetes, etc.) to run each stage
in parallel on a cluster, either by manually constructing the appropriate worker scripts to run each stage's
script and simply launching as many worker jobs as is desired or by using Hydra launchers such as the
`submitit` launcher to automate the creation of appropriate scheduler worker jobs. Note this will typically
required a distributed file system to work correctly, as these scripts use manually created file locks to
avoid duplicative work.
- The user can run each stage in parallel on a single machine by launching multiple copies of the same
script, in different terminal sessions, or all at once via the Hydra `joblib` launcher. This can result in a
significant speedup depending on the machine configuration as sharding ensures that parallelism can be used
with minimal file read contention.
Tools
LanguageTool

[grammar] ~103-~103: The modal verb ‘will’ requires the verb’s base form.
Context: ...rker jobs. Note this will typically required a distributed file system to work corre...

(MD_BASEFORM)


[uncategorized] ~106-~106: Possible missing comma found.
Context: ...e same script in different terminal sessions or all at once via the Hydra joblib l...

(AI_HYDRA_LEO_MISSING_COMMA)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5e47f0c and e0d1ecb.

Files selected for processing (1)
  • src/MEDS_transforms/reshard_to_split.py (1 hunks)
Additional context used
Ruff
src/MEDS_transforms/reshard_to_split.py

118-118: Function definition does not bind loop variable in_fps

(B023)

Additional comments not posted (1)
src/MEDS_transforms/reshard_to_split.py (1)

4-18: Imports look good.

The imports are necessary for the functionality provided in the file.

Comment on lines 21 to 150
if shards_fp.is_file():
if cfg.do_overwrite:
logger.warning(f"Overwriting {str(shards_fp.resolve())}")
shards_fp.unlink()
else:
raise FileExistsError(f"{str(shards_fp.resolve())} already exists.")

shards_fp.write_text(json.dumps(new_sharded_splits))

output_dir = Path(cfg.stage_cfg.output_dir)

orig_shards_iter, include_only_train = shard_iterator(cfg, out_suffix="")
if include_only_train:
raise ValueError("This stage does not support include_only_train=True")

orig_shards_iter = [(in_fp, out_fp.relative_to(output_dir)) for in_fp, out_fp in orig_shards_iter]

new_shards = shuffle_shards(list(new_sharded_splits.keys()), cfg)
new_shards_iter = [(shard_name, output_dir / shard_name) for shard_name in new_shards]

# Step 1: Sub-sharding stage
logger.info("Starting sub-sharding")
subshard_fps = defaultdict(list)

for in_fp, orig_shard_name in orig_shards_iter:
for subshard_name, out_dir in new_shards_iter:
out_fp = out_dir / f"{orig_shard_name}.parquet"
subshard_fps[subshard_name].append(out_fp)
patients = new_sharded_splits[subshard_name]

if not patients:
raise ValueError(f"No patients found for {subshard_name}!")

logger.info(
f"Sub-sharding {str(in_fp.resolve())} to {len(patients)} patients in {str(out_fp.resolve())}"
)

rwlock_wrap(
in_fp,
out_fp,
read_and_filter_fntr(pl.col("patient_id").is_in(patients), pl.scan_parquet),
write_lazyframe,
identity_fn,
do_return=False,
do_overwrite=cfg.do_overwrite,
)

logger.info("Merging sub-shards")
for subshard_name, subshard_dir in new_shards_iter:
in_dir = subshard_dir
in_fps = subshard_fps[subshard_name]
out_fp = subshard_dir.with_suffix(".parquet")

logger.info(f"Merging {subshard_dir}/**/*.parquet into {str(out_fp.resolve())}")

if not subshard_fps:
raise ValueError(f"No subshards found for {subshard_name}!")

if out_fp.is_file():
logger.info(f"Output file {str(out_fp.resolve())} already exists. Skipping.")
continue

while not (all(fp.is_file() for fp in in_fps) or out_fp.is_file()):
logger.info("Waiting to begin merging for all sub-shard files to be written...")
time.sleep(cfg.polling_time)

def read_fn(fp: Path) -> pl.LazyFrame:
return pl.concat([pl.scan_parquet(fp, glob=False) for fp in in_fps], how="diagonal_relaxed").sort(
by=["patient_id", "time"], maintain_order=True, multithreaded=False
)

logger.info(f"Merging files to {str(out_fp.resolve())}")
result_computed, _ = rwlock_wrap(
in_dir,
out_fp,
read_fn,
write_lazyframe,
identity_fn,
do_return=False,
do_overwrite=cfg.do_overwrite,
)

if result_computed:
logger.info(f"Cleaning up subsharded files in {str(subshard_dir.resolve())}/*.")
for fp in in_fps:
if fp.exists():
fp.unlink()
try:
for root, dirs, files in subshard_dir.walk(top_down=False):
walked_dir = root.relative_to(subshard_dir)
if files:
raise FileExistsError(f"Files found in {walked_dir} after cleanup!: {files}")
for d in dirs:
(root / d).rmdir()
subshard_dir.rmdir()
except OSError as e:
contents_str = "\n".join([str(f) for f in subshard_dir.iterdir()])
raise ValueError(f"Could not remove {str(subshard_dir)}. Contents:\n{contents_str}") from e

logger.info(f"Done with {cfg.stage}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider modularizing the main function.

The main function is lengthy and could benefit from being broken down into smaller, more focused functions. This would improve readability and maintainability.

Tools
Ruff

118-118: Function definition does not bind loop variable in_fps

(B023)

Comment on lines 88 to 96
rwlock_wrap(
in_fp,
out_fp,
read_and_filter_fntr(pl.col("patient_id").is_in(patients), pl.scan_parquet),
write_lazyframe,
identity_fn,
do_return=False,
do_overwrite=cfg.do_overwrite,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling to rwlock_wrap usage.

Consider adding error handling for potential issues during file operations to ensure robustness.

Also applies to: 123-131

Comment on lines 117 to 120
def read_fn(fp: Path) -> pl.LazyFrame:
return pl.concat([pl.scan_parquet(fp, glob=False) for fp in in_fps], how="diagonal_relaxed").sort(
by=["patient_id", "time"], maintain_order=True, multithreaded=False
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve the loop variable binding issue.

The static analysis tool indicates a potential issue with the function definition not binding the loop variable in_fps. Ensure that fp is correctly used within read_fn.

- return pl.concat([pl.scan_parquet(fp, glob=False) for fp in in_fps], how="diagonal_relaxed").sort(
+ return pl.concat([pl.scan_parquet(fp, glob=False) for fp in [fp]], how="diagonal_relaxed").sort(

Committable suggestion was skipped due to low confidence.

Tools
Ruff

118-118: Function definition does not bind loop variable in_fps

(B023)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e0d1ecb and 2d1c4cf.

Files selected for processing (1)
  • src/MEDS_transforms/reshard_to_split.py (1 hunks)
Additional context used
Ruff
src/MEDS_transforms/reshard_to_split.py

120-120: Function definition does not bind loop variable in_fps

(B023)

Additional comments not posted (4)
src/MEDS_transforms/reshard_to_split.py (4)

1-18: Header and imports look good.

The shebang, docstring, and imports are clear and appropriate for the file's functionality.


90-98: Add error handling to rwlock_wrap usage.

Consider adding error handling for potential issues during file operations to ensure robustness.


119-122: Resolve the loop variable binding issue.

The static analysis tool indicates a potential issue with the function definition not binding the loop variable in_fps. Ensure that fp is correctly used within read_fn.

Tools
Ruff

120-120: Function definition does not bind loop variable in_fps

(B023)


135-150: Ensure robust cleanup logic.

The cleanup logic attempts to remove files and directories. Ensure that appropriate error handling is in place for potential issues like permission errors.

Verification successful

Ensure comprehensive error handling in cleanup logic.

The code includes a try block for handling potential issues during directory cleanup. Ensure that the corresponding except block is robust enough to handle exceptions such as permission errors or other filesystem-related issues.

  • Review the except block following the try statement to ensure comprehensive error handling.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify cleanup logic robustness by checking for error handling in file system operations.

# Test: Search for error handling in cleanup logic. Expect: Appropriate error handling mechanisms.
rg --type python -A 5 $'try:\n    for root, dirs, files in subshard_dir.walk(top_down=False):'

Length of output: 286


Script:

#!/bin/bash
# Description: Re-verify cleanup logic robustness by checking for error handling in file system operations with multiline support.

# Test: Search for error handling in cleanup logic. Expect: Appropriate error handling mechanisms.
rg --type python --multiline -A 5 $'try:\n    for root, dirs, files in subshard_dir.walk(top_down=False):'

Length of output: 137


Script:

#!/bin/bash
# Description: Re-verify cleanup logic robustness by checking for error handling in file system operations.

# Test: Search for error handling in cleanup logic. Expect: Appropriate error handling mechanisms.
rg --multiline -t py -A 5 $'try:\n    for root, dirs, files in subshard_dir.walk(top_down=False):'

Length of output: 98


Script:

#!/bin/bash
# Description: Broaden the search to identify any try-except blocks related to file system operations.

# Test: Search for try-except blocks in the specified file to ensure error handling is present.
rg --multiline -t py -A 5 'try:' src/MEDS_transforms/reshard_to_split.py

Length of output: 396

Comment on lines 24 to 98
def main(cfg: DictConfig):
"""Re-shard a MEDS cohort to in a manner that subdivides patient splits."""

stage_init(cfg)

splits_file = Path(cfg.input_dir) / "metadata" / "patient_splits.parquet"
splits_df = pl.read_parquet(splits_file, use_pyarrow=True)
splits_map = defaultdict(list)
for pt_id, sp in splits_df.iter_rows():
splits_map[sp].append(pt_id)

new_sharded_splits = shard_patients(
patients=splits_df["patient_id"].to_numpy(),
n_patients_per_shard=cfg.stage_cfg.n_patients_per_shard,
external_splits=splits_map,
split_fracs_dict=None,
seed=cfg.get("seed", 1),
)

output_dir = Path(cfg.stage_cfg.output_dir)

# Write shards to file
if "shards_map_fp" in cfg:
shards_fp = Path(cfg.shards_map_fp)
else:
shards_fp = output_dir / ".shards.json"

if shards_fp.is_file():
if cfg.do_overwrite:
logger.warning(f"Overwriting {str(shards_fp.resolve())}")
shards_fp.unlink()
else:
old_shards_map = json.loads(shards_fp.read_text())
if old_shards_map != new_sharded_splits:
raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")

shards_fp.write_text(json.dumps(new_sharded_splits))

output_dir = Path(cfg.stage_cfg.output_dir)

orig_shards_iter, include_only_train = shard_iterator(cfg, out_suffix="")
if include_only_train:
raise ValueError("This stage does not support include_only_train=True")

orig_shards_iter = [(in_fp, out_fp.relative_to(output_dir)) for in_fp, out_fp in orig_shards_iter]

new_shards = shuffle_shards(list(new_sharded_splits.keys()), cfg)
new_shards_iter = [(shard_name, output_dir / shard_name) for shard_name in new_shards]

# Step 1: Sub-sharding stage
logger.info("Starting sub-sharding")
subshard_fps = defaultdict(list)

for in_fp, orig_shard_name in orig_shards_iter:
for subshard_name, out_dir in new_shards_iter:
out_fp = out_dir / f"{orig_shard_name}.parquet"
subshard_fps[subshard_name].append(out_fp)
patients = new_sharded_splits[subshard_name]

if not patients:
raise ValueError(f"No patients found for {subshard_name}!")

logger.info(
f"Sub-sharding {str(in_fp.resolve())} to {len(patients)} patients in {str(out_fp.resolve())}"
)

rwlock_wrap(
in_fp,
out_fp,
read_and_filter_fntr(pl.col("patient_id").is_in(patients), pl.scan_parquet),
write_lazyframe,
identity_fn,
do_return=False,
do_overwrite=cfg.do_overwrite,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider modularizing the main function.

The main function is lengthy and could benefit from being broken down into smaller, more focused functions. This would improve readability and maintainability.

Comment on lines 51 to 58
if shards_fp.is_file():
if cfg.do_overwrite:
logger.warning(f"Overwriting {str(shards_fp.resolve())}")
shards_fp.unlink()
else:
old_shards_map = json.loads(shards_fp.read_text())
if old_shards_map != new_sharded_splits:
raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling for file existence and differences.

The current logic raises a FileExistsError if the shard map differs. Consider logging the differences for better debugging.

- raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")
+ logger.error(f"Shard map differs: {old_shards_map} vs {new_sharded_splits}")
+ raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if shards_fp.is_file():
if cfg.do_overwrite:
logger.warning(f"Overwriting {str(shards_fp.resolve())}")
shards_fp.unlink()
else:
old_shards_map = json.loads(shards_fp.read_text())
if old_shards_map != new_sharded_splits:
raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")
if shards_fp.is_file():
if cfg.do_overwrite:
logger.warning(f"Overwriting {str(shards_fp.resolve())}")
shards_fp.unlink()
else:
old_shards_map = json.loads(shards_fp.read_text())
if old_shards_map != new_sharded_splits:
logger.error(f"Shard map differs: {old_shards_map} vs {new_sharded_splits}")
raise FileExistsError(f"{str(shards_fp.resolve())} already exists and shard map differs.")

@mmcdermott
Copy link
Owner Author

Not yet ready -- see #139

@mmcdermott mmcdermott closed this Aug 11, 2024
@mmcdermott mmcdermott reopened this Aug 11, 2024
@mmcdermott mmcdermott closed this Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants