Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Seeding] Create dummy csv generator #341

Merged
merged 41 commits into from
Nov 9, 2024
Merged

Conversation

anchit-chandran
Copy link
Contributor

@anchit-chandran anchit-chandran commented Oct 29, 2024

Signed-off-by: anchit-chandran anchit97123@gmail.com

Overview

Adds create_csv manage.py cmd to create csv of dummy pts, saving in a folder whose contents are gitignored (only takes <5secs to generate >10k rows.

Docstrings have documentation, snippet:

Example use:

    python manage.py create_csv \
        --pts=5 \
        --visits="CDCD DHPC ACDC CDCD" \
        --hb_target=T
        
    Will generate a csv file with 5 patients, each with 12 visits, with the visit encoding provided.
    The HbA1c target range for each visit will be set to 'TARGET'. 
    The resulting csv will have 5 * 12 = 60 rows (one for each visit).

Code changes

  • Adds create_csv file

Documentation changes (done or required as a result of this PR)

In docstrings

Related Issues

https://github.com/orgs/rcpch/projects/13/views/1?pane=issue&itemId=85172188&issue=rcpch%7Cnational-paediatric-diabetes-audit%7C328

Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
@anchit-chandran anchit-chandran marked this pull request as ready for review October 29, 2024 19:13
@eatyourpeas
Copy link
Member

This all looks good. A few small suggestions.

For the seed options, I think possibly the -T need not be mandatory and have a randomised option that mixes the target types, that might accept a percentage parameter (ie 50 would be 50% of the patients generated would have A or W). In a similar way other optional parameters might be age range and PDU, perhaps sex and diabetes type. For the visits, might be good to prescribe whether a visit does or does not have errors, possibly does not need to prescribe what the errors are - but that may be too hard.

My guess is when we ask people to test, they will want to upload spreadsheets to try it out. We should give them a choice of different ones that we pregenerate, some with errors and some without, and my guess is the parameters above will be the kind of thing they are interested in. So for example, we might generate a csv with 150 patients, 50% boys/50% girls, where 90% have T1DM, 9% have T2DM and the rest have a mix of MODY/CFRD with a range of 10-15 visits each. Of the 150, 15% might have above target HbA1c, 5% well above, the rest in range. In the visits, we could say 20% have an error of some kind, so that they could then correct the error in the csv and reupload and see that the submission is overwritten and the error is corrected.

I hope that makes sense.

Also I am finding that when I try and upload the spreadsheet generated i get a bunch of parse errors - these may all be related to the heading issues with extra spaces etc that @dc2007git is working on at the moment.

@anchit-chandran
Copy link
Contributor Author

anchit-chandran commented Oct 30, 2024

For the seed options, I think possibly the -T need not be mandatory and have a randomised option that mixes the target types, that might accept a percentage parameter (ie 50 would be 50% of the patients generated would have A or W). In a similar way other optional parameters might be age range and PDU, perhaps sex and diabetes type.

definitely agreed! That's WIP: #329

For the visits, might be good to prescribe whether a visit does or does not have errors, possibly does not need to prescribe what the errors are - but that may be too hard.

Currently no way to specify errors through generator (all visittypes result in valid measures). Woud you be able to advise classic errors values for current VisitTypes @eatyourpeas .

I'll update VisitType enum to include eg VisitType.ClinicError, and can update the _get_measures_for_visit_type fn so we return those error values in for the Visit model. Collated:

Gather all the measures for a clinic visit. These include
        - height
        - weight
        - HbA1c
        - treatment
        - CGM
        - BP
Gather all the measures for an annual review visit.
        These dictate the measures that are taken for an annual review visit:
            - "foot_examination_observation_date"
            - "retinal_screening_result"
            - "retinal_screening_observation_date"
            - "albumin_creatinine_ratio"
            - "albumin_creatinine_ratio_date"
            - "albuminuria_stage"
            - "total_cholesterol"
            - "total_cholesterol_date"
            - "thyroid_function_date"
            - "thyroid_treatment_status"
            - "coeliac_screen_date"
            - "gluten_free_diet"
            - "smoking_status"
            - "smoking_cessation_referral_date"
            - "carbohydrate_counting_level_three_education_date"
            - "flu_immunisation_recommended_date"
            - "ketone_meter_training"
            - "sick_day_rules_training_date"
Generates random dietician observations for a child.
        Allocate the visit date to the date of the observation.

        Returns dict of:
            dietician_additional_appointment_offered: int
            dietician_additional_appointment_date: date
Generates random psychological screening observations for a child.
        Allocate the visit date to the date of the observation.


            psychological_screening_assessment_date: date
            psychological_additional_support_status: int
Generates random hospital admission observations for a child.
        Allocate the visit date to the date of the observation.

 "hospital_admission_date": hospital_admission_date,
            "hospital_discharge_date": hospital_discharge_date,
            "hospital_admission_reason": hospital_admission_reason,
            "dka_additional_therapies": dka_additional_therapies,
            "hospital_admission_other": hospital_admission_other,

@anchit-chandran
Copy link
Contributor Author

Also I am finding that when I try and upload the spreadsheet generated i get a bunch of parse errors - these may all be related to the heading issues with extra spaces etc that @dc2007git is working on at the moment.

What do errors look like? I'm renaming columns to have those weird extra spaces here https://github.com/rcpch/national-paediatric-diabetes-audit/pull/341/files#diff-ae3c6e15c8db6e9b82c8e395e9afb5b7ce7c4ff9878b6567caf41a222b1b2949R248

@eatyourpeas
Copy link
Member

The first error I got was a typo in the clean method in the visit_form which i have fixed above

The new error looks to be getting a True when it is expecting an int

    return int(value)
django-1   |            ^^^^^^^^^^
django-1   | ValueError: invalid literal for int() with base 10: 'True'
django-1   | 
django-1   | During handling of the above exception, another exception occurred:
django-1   | 
django-1   | Traceback (most recent call last):
django-1   |   File "/app/project/npda/views/home.py", line 64, in home
django-1   |     csv_upload(
django-1   |   File "/app/project/npda/general_functions/csv_upload.py", line 293, in csv_upload
django-1   |     (patient_form, transfer_fields, visits) = validate_rows(rows)
django-1   |                                               ^^^^^^^^^^^^^^^^^^^
django-1   |   File "/app/project/npda/general_functions/csv_upload.py", line 240, in validate_rows
django-1   |     visits = rows.apply(
django-1   |              ^^^^^^^^^^^
django-1   |   File "/usr/local/lib/python3.12/site-packages/pandas/core/frame.py", line 10374, in apply
django-1   |     return op.apply().__finalize__(self, method="apply")
django-1   |            ^^^^^^^^^^
django-1   |   File "/usr/local/lib/python3.12/site-packages/pandas/core/apply.py", line 916, in apply
django-1   |     return self.apply_standard()
django-1   |            ^^^^^^^^^^^^^^^^^^^^^
django-1   |   File "/usr/local/lib/python3.12/site-packages/pandas/core/apply.py", line 1063, in apply_standard
django-1   |     results, res_index = self.apply_series_generator()
django-1   |                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django-1   |   File "/usr/local/lib/python3.12/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
django-1   |     results[i] = self.func(v, *self.args, **self.kwargs)
django-1   |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django-1   |   File "/app/project/npda/general_functions/csv_upload.py", line 241, in <lambda>
django-1   |     lambda row: validate_visit_using_form(patient_form.instance, row),
django-1   |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django-1   |   File "/app/project/npda/general_functions/csv_upload.py", line 179, in validate_visit_using_form
django-1   |     fields = row_to_dict(
django-1   |              ^^^^^^^^^^^^
django-1   |   File "/app/project/npda/general_functions/csv_upload.py", line 139, in row_to_dict
django-1   |     model_field: csv_value_to_model_value(
django-1   |                  ^^^^^^^^^^^^^^^^^^^^^^^^^
django-1   |   File "/app/project/npda/general_functions/csv_upload.py", line 133, in csv_value_to_model_value
django-1   |     return model_field.to_python(value)
django-1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django-1   |   File "/usr/local/lib/python3.12/site-packages/django/db/models/fields/__init__.py", line 2142, in to_python
django-1   |     raise exceptions.ValidationError(
django-1   | django.core.exceptions.ValidationError: ['“True” value must be an integer.']
django-1   | 
django-1   | During handling of the above exception, another exception occurred:
django-1   | 
django-1   | Traceback (most recent call last):
django-1   |   File "/usr/local/lib/python3.12/site-packages/django/core/handlers/exception.py", line 55, in inner
django-1   |     response = get_response(request)
django-1   |                ^^^^^^^^^^^^^^^^^^^^^
django-1   |   File "/usr/local/lib/python3.12/site-packages/django/core/handlers/base.py", line 197, in _get_response
django-1   |     response = wrapped_callback(request, *callback_args, **callback_kwargs)
django-1   |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django-1   |   File "/app/project/npda/views/decorators.py", line 33, in wrapper
django-1   |     return view(request, *args, **kwargs)
django-1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django-1   |   File "/app/project/npda/views/home.py", line 85, in home
django-1   |     errors = error_list(error)
django-1   |              ^^^^^^^^^^^^^^^^^
django-1   |   File "/app/project/npda/views/home.py", line 33, in error_list
django-1   |     for field, errors in wrapper_error.error_dict.items():
django-1   |                          ^^^^^^^^^^^^^^^^^^^^^^^^
django-1   | AttributeError: 'ValidationError' object has no attribute 'error_dict'. Did you mean: 'error_list'?
django-1   | ERROR [django.server] "POST /home HTTP/1.1" 500 179886

@mbarton
Copy link
Member

mbarton commented Oct 31, 2024

I'll take a look at that second exception, seems like it's barfing trying to return unexpected errors. Raised #345

@mbarton
Copy link
Member

mbarton commented Oct 31, 2024

After this PR you should see what field and row are causing the issue: #346

Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
…rse_dates': 'Observation Date: Thyroid Function'`)

Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
Signed-off-by: anchit-chandran <anchit97123@gmail.com>
@mbarton
Copy link
Member

mbarton commented Nov 8, 2024

I tried this out again against live and got another crash, unrelated to the generator. Fixing here #357

@mbarton
Copy link
Member

mbarton commented Nov 8, 2024

The first error I got was a typo in the clean method in the visit_form which i have fixed above

@eatyourpeas I've pulled that commit out to its own PR and added a unit test #360

Copy link
Member

@mbarton mbarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK now #360 is merged I'm able to upload a CSV file generated from this PR:

docker compose exec django python manage.py create_csv --pts 2 --visits "CDCD DHPC ACDC CDCD" --hb_target A

I get some validation errors.

Screenshot from 2024-11-08 15-50-49

They might be incorrect or rules we want to remove but let's address that in future PRs off live as this one has got quite big.

@anchit-chandran happy to merge?

@mbarton
Copy link
Member

mbarton commented Nov 8, 2024

The new error looks to be getting a True when it is expecting an int

Yes I think the values for closed_loop_system are sometimes coming out as True or False rather than the appropriate numbers (I guess related to #328?).

I've raised #362 which gives you back the column name and row index which makes debugging much easier

@eatyourpeas
Copy link
Member

I have hopefully fixed the datatypes clashing that we have had by defining explicitly what datatypes should be for each column in the Pandas read_csv method.
I have somehow introduced some broken tests in a previous PR which is now in live somehow so will fix here and then hopefully merge

Copy link
Member

@eatyourpeas eatyourpeas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is now passing all tests and it is possible to upload a csv generated by the generator. I have done this (hope this is ok Anchit) but introducing so dtype constraints at the time of dataframe creation. I also discovered that the patient_generator_extended was randomizing booleans for closed_loop rather than the constants so that was generating an error on import. I am finding that the generator does not seem to be equally randomizing sexes but returning lots of Nones or unknowns.
Finally, none of the visit data seems to be saving but no errors are being raised.
This PR has become kinda complicated now so I am going to approve this and raise 2 new issues:

  1. randomization of sex in generator - hopefully a simple fix
  2. visit data not saving on upload

@eatyourpeas eatyourpeas merged commit 99140d2 into live Nov 9, 2024
1 check passed
@eatyourpeas eatyourpeas deleted the anchit/kpis/single-pt branch November 9, 2024 15:05
@mbarton
Copy link
Member

mbarton commented Nov 9, 2024

Seen on STAGING (created by @anchit-chandran and merged by @eatyourpeas 5 minutes and 39 seconds ago) Please check your changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants