-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source Data Checks: Refactor column class for ingest & distribution #1281
Conversation
a31739b
to
3ff8c50
Compare
* Use new Column obj in ingest and distribution metadata
3ff8c50
to
724704d
Compare
There is one failing pytest that I'm not sure how to address -- would love an advice. In One quick solution is to add value to the list of allowed values for This is probably not something we will be using in the future; however, it could be useful for other agencies... But is it worth revising the validator to accommodate the edge case? |
Is there a way to avoid running validators at runtime in pydantic? Pass an extra arg to the constructor or something along those lines |
data_type: COLUMN_TYPES | None = None | ||
description: str | None = None | ||
is_required: bool = True | ||
checks: Checks | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gut impression here - seems that checks
should maybe be a list? Where then checks might take form
- name (or id): {name}
parameter: value
Or something like that. Could just be a list of strings too. Will think more (and look at our old discussions) and revisit tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure yet, I'm planning to focus on Checks in next PR. In this PR, I just moved the existing Checks obj Alex implemented to this file. If I don't do that, I run into an error where imports between dataset.py
& metadata_v2.py
are circular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that makes sense to keep for later, since we're not actually attempting to implement checks yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def agree with an eventual list of checks
c.override(column_defaults[c.id, c.data_type]) | ||
if c.data_type and (c.id, c.data_type) in column_defaults | ||
else c | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit, sorry - ✂️ these parentheses?
dcpy/connectors/socrata/metadata.py
Outdated
@@ -54,7 +54,7 @@ def make_dcp_col(c: pub.Socrata.Responses.Column) -> md.DatasetColumn: | |||
dcp_col["values"] = [ | |||
{"value": s["item"], "description": FILL_ME_IN_PLACEHOLDER} for s in samples | |||
] | |||
md.DatasetColumn._validate_data_type = False | |||
# md.DatasetColumn._validate_data_type = False # legacy attribute used during migration, no longer there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - this is why the error is popping up now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is code to create a placeholder, it seems fine to leave, no? And resolves our issue.
Or maybe it could even go in line 58? Not sure how this arg works. but maybe
md.DatasetColumn(**dcp_col, _validate_data_type = False)
would work too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And per the commit message/comment - this I think is explicitly for the purpose of avoiding this error - I don't think it has to do with anything from migration from v1 to v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is the reason for the error. If I revert the code back, it should solve the error. Counter q: would we won't this attribute _validate_data_type
in the parent Column
obj or metadata.Column
?
Would love to get @alexrichey 's thoughts too on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we would. We only want to not validate in this case, when we're generating it with placeholders. Other than that, we should always validate at runtime on instantiation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me try the placeholder + the solution Finn referenced... will report back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried the construct
fn. It's now deprecated and the new method is model_construct()
.
Sooo it seems like this method does no validation on input attributes & its values. For example, you can instantiate a model without any keys or random keys like below:
Using this seems more problematic because we can potentially generate templates with invalid keys... @fvankrieken, I think we should go with the text field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, wait. We won't generate a template with invalid keys like I said. But the risk is generating templates without required fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. At least having keys validated would be nice.
I'm not horribly concerned though - this is a util that's designed to create an invalid model, and this seems like a one-liner command to run it without error. Tests can make sure that structure is as expected. In terms of how a user would use it, this seems like less of a source of error to me than autofilling data_types that users might then forget to update. And this is slightly better than say just a dict in that it still shows developers what is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Added an assert statement for key validation!
Couple little notes but this seems very sensible |
Need to look into this. Though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! though did you mean to delete lpc_scenic_landmarks.yml
just now?
🤦♀️ didn't commit this before switching to the current branch. ty! |
824abc6
to
fde078e
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1281 +/- ##
=======================================
Coverage 69.41% 69.41%
=======================================
Files 111 112 +1
Lines 5935 5935
Branches 661 660 -1
=======================================
Hits 4120 4120
Misses 1683 1683
Partials 132 132 ☔ View full report in Codecov by Sentry. |
|
||
# model_construct() method doesn't perform validation on keys, need this sanity check here | ||
# instance keys == column model keys below: | ||
assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it
SmallPR to refactor existing Column objects in ingest & distribution code into one.Why?
For our own sanity when implementing data checks on source & product data.
Easiest to go commit by commit.
Major changes:
models.dataset.Column
class.is_required
attribute which will be used in validation.Column
objects to inherit from the new classmetadata_v2.Checks
class tomodels.dataset
data_type
allowed valuesData checks will be implemented in next PR.