Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Data Checks: Refactor column class for ingest & distribution #1281

Merged
merged 10 commits into from
Nov 27, 2024

Conversation

sf-dcp
Copy link
Contributor

@sf-dcp sf-dcp commented Nov 25, 2024

Small PR to refactor existing Column objects in ingest & distribution code into one.

Why?

For our own sanity when implementing data checks on source & product data.

Easiest to go commit by commit.

Major changes:

  1. Create base models.dataset.Column class.
    • Note, I added is_required attribute which will be used in validation.
  2. Revise old ingest and metadata Column objects to inherit from the new class
    • for metadata, remove a code snippet used in creating metadata raw templates
  3. Move metadata_v2.Checks class to models.dataset
  4. Appease mypy
    • fix inconsistencies in data_type allowed values

Data checks will be implemented in next PR.

@sf-dcp sf-dcp force-pushed the sf-refactor-column-obj-for-ingest-dist branch 13 times, most recently from a31739b to 3ff8c50 Compare November 26, 2024 22:57
@sf-dcp sf-dcp force-pushed the sf-refactor-column-obj-for-ingest-dist branch from 3ff8c50 to 724704d Compare November 26, 2024 23:11
@sf-dcp
Copy link
Contributor Author

sf-dcp commented Nov 26, 2024

There is one failing pytest that I'm not sure how to address -- would love an advice.

In dcpy.connectors.socrata.metadata module, we have a function (make_dcp_col) that generates metadata templates from socrata api where column data_type: <FILL ME IN>. So the test fails because isn't a valid data_type value.

One quick solution is to add value to the list of allowed values for data_type.

This is probably not something we will be using in the future; however, it could be useful for other agencies... But is it worth revising the validator to accommodate the edge case?

@sf-dcp sf-dcp marked this pull request as ready for review November 26, 2024 23:23
@fvankrieken
Copy link
Contributor

There is one failing pytest that I'm not sure how to address -- would love an advice.

In dcpy.connectors.socrata.metadata module, we have a function (make_dcp_col) that generates metadata templates from socrata api where column data_type: <FILL ME IN>. So the test fails because isn't a valid data_type value.

One quick solution is to add value to the list of allowed values for data_type.

This is probably not something we will be using in the future; however, it could be useful for other agencies... But is it worth revising the validator to accommodate the edge case?

Is there a way to avoid running validators at runtime in pydantic? Pass an extra arg to the constructor or something along those lines

data_type: COLUMN_TYPES | None = None
description: str | None = None
is_required: bool = True
checks: Checks | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gut impression here - seems that checks should maybe be a list? Where then checks might take form

- name (or id): {name}
  parameter: value

Or something like that. Could just be a list of strings too. Will think more (and look at our old discussions) and revisit tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet, I'm planning to focus on Checks in next PR. In this PR, I just moved the existing Checks obj Alex implemented to this file. If I don't do that, I run into an error where imports between dataset.py & metadata_v2.py are circular.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that makes sense to keep for later, since we're not actually attempting to implement checks yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def agree with an eventual list of checks

c.override(column_defaults[c.id, c.data_type])
if c.data_type and (c.id, c.data_type) in column_defaults
else c
(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit, sorry - ✂️ these parentheses?

@@ -54,7 +54,7 @@ def make_dcp_col(c: pub.Socrata.Responses.Column) -> md.DatasetColumn:
dcp_col["values"] = [
{"value": s["item"], "description": FILL_ME_IN_PLACEHOLDER} for s in samples
]
md.DatasetColumn._validate_data_type = False
# md.DatasetColumn._validate_data_type = False # legacy attribute used during migration, no longer there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - this is why the error is popping up now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is code to create a placeholder, it seems fine to leave, no? And resolves our issue.

Or maybe it could even go in line 58? Not sure how this arg works. but maybe

md.DatasetColumn(**dcp_col, _validate_data_type = False)

would work too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And per the commit message/comment - this I think is explicitly for the purpose of avoiding this error - I don't think it has to do with anything from migration from v1 to v2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is the reason for the error. If I revert the code back, it should solve the error. Counter q: would we won't this attribute _validate_data_type in the parent Column obj or metadata.Column?

Would love to get @alexrichey 's thoughts too on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would. We only want to not validate in this case, when we're generating it with placeholders. Other than that, we should always validate at runtime on instantiation

Copy link
Contributor Author

@sf-dcp sf-dcp Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me try the placeholder + the solution Finn referenced... will report back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried the construct fn. It's now deprecated and the new method is model_construct().

Sooo it seems like this method does no validation on input attributes & its values. For example, you can instantiate a model without any keys or random keys like below:

image

Using this seems more problematic because we can potentially generate templates with invalid keys... @fvankrieken, I think we should go with the text field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, wait. We won't generate a template with invalid keys like I said. But the risk is generating templates without required fields

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. At least having keys validated would be nice.

I'm not horribly concerned though - this is a util that's designed to create an invalid model, and this seems like a one-liner command to run it without error. Tests can make sure that structure is as expected. In terms of how a user would use it, this seems like less of a source of error to me than autofilling data_types that users might then forget to update. And this is slightly better than say just a dict in that it still shows developers what is expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Added an assert statement for key validation!

@fvankrieken
Copy link
Contributor

Couple little notes but this seems very sensible

@sf-dcp
Copy link
Contributor Author

sf-dcp commented Nov 27, 2024

There is one failing pytest that I'm not sure how to address -- would love an advice.
In dcpy.connectors.socrata.metadata module, we have a function (make_dcp_col) that generates metadata templates from socrata api where column data_type: <FILL ME IN>. So the test fails because isn't a valid data_type value.
One quick solution is to add value to the list of allowed values for data_type.
This is probably not something we will be using in the future; however, it could be useful for other agencies... But is it worth revising the validator to accommodate the edge case?

Is there a way to avoid running validators at runtime in pydantic? Pass an extra arg to the constructor or something along those lines

Need to look into this. Though mypy may still complain

Copy link
Contributor

@alexrichey alexrichey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! though did you mean to delete lpc_scenic_landmarks.yml just now?

@sf-dcp
Copy link
Contributor Author

sf-dcp commented Nov 27, 2024

LGTM! though did you mean to delete lpc_scenic_landmarks.yml just now?

🤦‍♀️ didn't commit this before switching to the current branch. ty!

@sf-dcp sf-dcp force-pushed the sf-refactor-column-obj-for-ingest-dist branch from 824abc6 to fde078e Compare November 27, 2024 16:51
Copy link

codecov bot commented Nov 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.41%. Comparing base (4a18ea9) to head (cd35d80).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1281   +/-   ##
=======================================
  Coverage   69.41%   69.41%           
=======================================
  Files         111      112    +1     
  Lines        5935     5935           
  Branches      661      660    -1     
=======================================
  Hits         4120     4120           
  Misses       1683     1683           
  Partials      132      132           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sf-dcp sf-dcp merged commit 3cb4a39 into main Nov 27, 2024
20 checks passed
@sf-dcp sf-dcp deleted the sf-refactor-column-obj-for-ingest-dist branch November 27, 2024 18:01

# model_construct() method doesn't perform validation on keys, need this sanity check here
# instance keys == column model keys below:
assert (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants