Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use determinisitic provider id #81

Merged
merged 3 commits into from
Oct 31, 2024

Conversation

lawrenceadams
Copy link
Collaborator

This PR cleans up code imported from upstream OHDSI/SyntheaETL SQL code, where models can produce non-deterministic output for location and provider tables.

For both tables, output is now ordered by:

  • state
  • city
  • zip
  • patient/provider id

Perhaps this is too complicated, open to other ideas! I think this is somewhat more logical than doing: city > state > zip > _id, but may not be correct to US-centric norms

 - upstream ohdsi example can have non-determinisitic cases: use `patient_id` to make reproducible
 - order by state > city > zip > id (I think most logical form)
@lawrenceadams lawrenceadams linked an issue Oct 8, 2024 that may be closed by this pull request
models/omop/location.sql Outdated Show resolved Hide resolved
@@ -1,5 +1,5 @@
SELECT
row_number() OVER (ORDER BY (SELECT null)) AS provider_id
row_number() OVER (ORDER BY provider_state, provider_city, provider_zip, provider_id) AS provider_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just use provider_id? it should be the PK from the source Synthea table right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we could ~ I was going for something that would order somewhat logically (rather than UUIDv4s), but that's again not necessary

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh i see! just for readability/neatness. i'm fine to keep it this way then :)

@lawrenceadams
Copy link
Collaborator Author

I'll cleanup this branch when #83 is determined on this best approach :)

@lawrenceadams lawrenceadams marked this pull request as draft October 23, 2024 21:39
@lawrenceadams lawrenceadams changed the title refactor: use determinisitic provider and location id refactor: use determinisitic provider id Oct 31, 2024
@lawrenceadams lawrenceadams marked this pull request as ready for review October 31, 2024 07:31
@lawrenceadams
Copy link
Collaborator Author

As a result of #83 this only affects provider now - I think ORDER BY provider_state, provider_city, provider_zip, provider_id might be a bit overkill - maybe no provider_id is needed? But in reality it doesn't matter too much and we need to sort by something...

Copy link
Collaborator

@katy-sadowski katy-sadowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! Let's go with this for now!

@lawrenceadams lawrenceadams merged commit 1c148f2 into main Oct 31, 2024
@lawrenceadams lawrenceadams deleted the 77-non-deterministic-use-of-row_number branch October 31, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-deterministic use of row_number()
2 participants