Normalization stage is checking for aggregate_code_metadata/codes.parqet columns and metadata/codes.parquet columns in data/codes.parquet #147

Oufattole · 2024-08-11T22:35:45Z

The normalization stage is failing for me because there is no data/codes.parquet file.

When I try to copy over the metadata/codes/parquet file:
cp "${MEDS_DIR}/data/metadata/codes.parquet" "${MEDS_DIR}/data/codes.parquet"
I get an error that there is no values/sum column

And when I try to copy over the aggregate_code_metadata/codes.parquet:
cp "${MEDS_DIR}/aggregate_code_metadata/codes.parquet" "${MEDS_DIR}/data/codes.parquet"
I get an error that there is no "code/vocab_index" column.

What worked for me as a temporary solution was to spin up a simple hydra script to generate a code/vocab_index column:

import hydra
from hydra.core.config_store import ConfigStore
import polars as pl
from loguru import logger
from omegaconf import DictConfig, MISSING

@dataclass
class Config:
    meds_dir: str = MISSING

cs = ConfigStore.instance()
# Registering the Config class with the name `postgresql` with the config group `db`
cs.store(name="config", node=Config)

@hydra.main(version_base=None, config_name="config")
def main(cfg: Config):
    meds_dir = Path(cfg.meds_dir)
    df = pl.read_parquet(meds_dir / "aggregate_code_metadata/codes.parquet")
    df.with_row_index("code/vocab_index").write_parquet(meds_dir / "data/codes.parquet")
    logger.info("Done adding code/vocab_index column to codes.parquet!")


if __name__ == "__main__":
    main()

This issue exists on the dev branch and on release 0.0.4

The text was updated successfully, but these errors were encountered:

mmcdermott · 2024-08-14T03:27:37Z

I think this line should be removed: https://github.com/mmcdermott/MEDS_transforms/blob/158_fix_typing_issue/src/MEDS_transforms/configs/stage_configs/fit_vocabulary_indices.yaml#L4

that may not be the entire problem, but I suspect it is part

mmcdermott · 2024-08-14T03:29:55Z

I believe this line: https://github.com/mmcdermott/MEDS_transforms/blob/158_fix_typing_issue/src/MEDS_transforms/utils.py#L307 should point to "reducer_output_dir" not "output_dir"

mmcdermott · 2024-08-14T03:30:24Z

And clearly a multi-stage, multi-metadata stage integration test is also needed, not just singleton stage testers.

mmcdermott · 2024-08-14T12:34:54Z

Subsidiary issues:

Multi-stage integration tests for pre-processing stages in sequence should be added. #160 (for testing)
Metadata input dir may be being set improperly to the last metadata stage's output directory instead of the reducer_output_dir #161 (for reducer output directory)

…161 or #147

mmcdermott · 2024-08-15T02:04:06Z

Fixed by #167 and verified with a full, E2E preprocess pipeline integration test.

Oufattole added the bug Something isn't working label Aug 11, 2024

mmcdermott added priority:critical A critical priority issue that should be solved and pushed to a new minor version release ASAP. MEDS-Transform Issues for the data pre-processing transformations in MEDS_transforms labels Aug 14, 2024

mmcdermott added the Testing label Aug 14, 2024

mmcdermott mentioned this issue Aug 14, 2024

Metadata input dir may be being set improperly to the last metadata stage's output directory instead of the reducer_output_dir #161

Closed

mmcdermott added a commit that referenced this issue Aug 14, 2024

Added a multi-stage test which currently, appropriately, fails due to #…

138eb1e

…161 or #147

mmcdermott mentioned this issue Aug 14, 2024

Adds a multi-stage integration test for pre-processing. #167

Merged

4 tasks

mmcdermott linked a pull request Aug 15, 2024 that will close this issue

Adds a multi-stage integration test for pre-processing. #167

Merged

4 tasks

mmcdermott closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization stage is checking for aggregate_code_metadata/codes.parqet columns and metadata/codes.parquet columns in data/codes.parquet #147

Normalization stage is checking for aggregate_code_metadata/codes.parqet columns and metadata/codes.parquet columns in data/codes.parquet #147

Oufattole commented Aug 11, 2024

mmcdermott commented Aug 14, 2024

mmcdermott commented Aug 14, 2024

mmcdermott commented Aug 14, 2024

mmcdermott commented Aug 14, 2024

mmcdermott commented Aug 15, 2024

Normalization stage is checking for aggregate_code_metadata/codes.parqet columns and metadata/codes.parquet columns in data/codes.parquet #147

Normalization stage is checking for aggregate_code_metadata/codes.parqet columns and metadata/codes.parquet columns in data/codes.parquet #147

Comments

Oufattole commented Aug 11, 2024

mmcdermott commented Aug 14, 2024

mmcdermott commented Aug 14, 2024

mmcdermott commented Aug 14, 2024

mmcdermott commented Aug 14, 2024

mmcdermott commented Aug 15, 2024