Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization stage is checking for aggregate_code_metadata/codes.parqet columns and metadata/codes.parquet columns in data/codes.parquet #147

Closed
Tracked by #167
Oufattole opened this issue Aug 11, 2024 · 5 comments · Fixed by #167
Labels
bug Something isn't working MEDS-Transform Issues for the data pre-processing transformations in MEDS_transforms priority:critical A critical priority issue that should be solved and pushed to a new minor version release ASAP. Testing

Comments

@Oufattole
Copy link
Collaborator

The normalization stage is failing for me because there is no data/codes.parquet file.

When I try to copy over the metadata/codes/parquet file:
cp "${MEDS_DIR}/data/metadata/codes.parquet" "${MEDS_DIR}/data/codes.parquet"
I get an error that there is no values/sum column

And when I try to copy over the aggregate_code_metadata/codes.parquet:
cp "${MEDS_DIR}/aggregate_code_metadata/codes.parquet" "${MEDS_DIR}/data/codes.parquet"
I get an error that there is no "code/vocab_index" column.

What worked for me as a temporary solution was to spin up a simple hydra script to generate a code/vocab_index column:

import hydra
from hydra.core.config_store import ConfigStore
import polars as pl
from loguru import logger
from omegaconf import DictConfig, MISSING

@dataclass
class Config:
    meds_dir: str = MISSING

cs = ConfigStore.instance()
# Registering the Config class with the name `postgresql` with the config group `db`
cs.store(name="config", node=Config)

@hydra.main(version_base=None, config_name="config")
def main(cfg: Config):
    meds_dir = Path(cfg.meds_dir)
    df = pl.read_parquet(meds_dir / "aggregate_code_metadata/codes.parquet")
    df.with_row_index("code/vocab_index").write_parquet(meds_dir / "data/codes.parquet")
    logger.info("Done adding code/vocab_index column to codes.parquet!")


if __name__ == "__main__":
    main()

This issue exists on the dev branch and on release 0.0.4

@Oufattole Oufattole added the bug Something isn't working label Aug 11, 2024
@mmcdermott mmcdermott added priority:critical A critical priority issue that should be solved and pushed to a new minor version release ASAP. MEDS-Transform Issues for the data pre-processing transformations in MEDS_transforms labels Aug 14, 2024
@mmcdermott
Copy link
Owner

I think this line should be removed: https://github.com/mmcdermott/MEDS_transforms/blob/158_fix_typing_issue/src/MEDS_transforms/configs/stage_configs/fit_vocabulary_indices.yaml#L4

that may not be the entire problem, but I suspect it is part

@mmcdermott
Copy link
Owner

I believe this line: https://github.com/mmcdermott/MEDS_transforms/blob/158_fix_typing_issue/src/MEDS_transforms/utils.py#L307 should point to "reducer_output_dir" not "output_dir"

@mmcdermott
Copy link
Owner

And clearly a multi-stage, multi-metadata stage integration test is also needed, not just singleton stage testers.

@mmcdermott
Copy link
Owner

Fixed by #167 and verified with a full, E2E preprocess pipeline integration test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working MEDS-Transform Issues for the data pre-processing transformations in MEDS_transforms priority:critical A critical priority issue that should be solved and pushed to a new minor version release ASAP. Testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants