-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source Data Checks: Refactor column class for ingest & distribution #1281
Changes from 5 commits
457841b
c91d7c1
3446b58
2580855
724704d
1d5a270
fde078e
6811321
8734140
cd35d80
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
from dcpy.models.base import SortedSerializedBase | ||
from typing import Literal | ||
|
||
COLUMN_TYPES = Literal[ | ||
"text", | ||
"integer", | ||
"decimal", | ||
"number", # TODO: Need to delete. Keeping it now for compatibility with metadata files | ||
"geometry", | ||
"bool", | ||
"bbl", | ||
"date", | ||
"datetime", | ||
] | ||
|
||
|
||
# TODO: extend/modify Checks model | ||
class Checks(SortedSerializedBase): | ||
is_primary_key: bool | None = None | ||
non_nullable: bool | None = None | ||
|
||
|
||
class Column(SortedSerializedBase, extra="forbid"): | ||
""" | ||
An extensible base class for defining column metadata in ingest and product templates. | ||
""" | ||
|
||
id: str | ||
data_type: COLUMN_TYPES | None = None | ||
description: str | None = None | ||
is_required: bool = True | ||
checks: Checks | None = None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gut impression here - seems that
Or something like that. Could just be a list of strings too. Will think more (and look at our old discussions) and revisit tomorrow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure yet, I'm planning to focus on Checks in next PR. In this PR, I just moved the existing Checks obj Alex implemented to this file. If I don't do that, I run into an error where imports between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I think that makes sense to keep for later, since we're not actually attempting to implement checks yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. def agree with an eventual list of checks |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,12 @@ | ||
from __future__ import annotations | ||
|
||
from pydantic import field_validator | ||
from pydantic import BaseModel | ||
from typing import Any, List, Literal, get_args | ||
from typing import Any, List | ||
import unicodedata | ||
|
||
from dcpy.utils.collections import deep_merge_dict as merge | ||
from dcpy.models.base import SortedSerializedBase, YamlWriter, TemplatedYamlReader | ||
from dcpy.models.dataset import Column, COLUMN_TYPES | ||
|
||
ERROR_MISSING_COLUMN = "MISSING COLUMN" | ||
|
||
|
@@ -48,50 +48,26 @@ class CustomizableBase(SortedSerializedBase, extra="forbid"): | |
|
||
|
||
# COLUMNS | ||
# TODO: move to share with ingest.validate | ||
class Checks(CustomizableBase): | ||
is_primary_key: bool | None = None | ||
non_nullable: bool | None = None | ||
|
||
|
||
# TODO: move to share with ingest.validate | ||
COLUMN_TYPES = Literal[ | ||
"text", "number", "integer", "decimal", "geometry", "bool", "bbl", "datetime" | ||
] | ||
|
||
|
||
class ColumnValue(CustomizableBase): | ||
_head_sort_order = ["value", "description"] | ||
|
||
value: str | ||
description: str | None = None | ||
|
||
|
||
class DatasetColumn(CustomizableBase): | ||
class DatasetColumn(Column): | ||
_head_sort_order = ["id", "name", "data_type", "description"] | ||
_tail_sort_order = ["example", "values", "custom"] | ||
_validate_data_type = ( | ||
True # override, to generate md where we don't know the data_type | ||
) | ||
|
||
# Note: id isn't intended to be overrideable, but is always required as a | ||
# pointer back to the original column, so it is required here. | ||
id: str | ||
# pointer back to the original column. | ||
name: str | None = None | ||
data_type: str | None = None | ||
data_source: str | None = None | ||
description: str | None = None | ||
notes: str | None = None | ||
example: str | None = None | ||
checks: Checks | None = None | ||
deprecated: bool | None = None | ||
values: list[ColumnValue] | None = None | ||
|
||
@field_validator("data_type") | ||
def _validate_colum_types(cls, v): | ||
if cls._validate_data_type: | ||
sf-dcp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
assert v in get_args(COLUMN_TYPES) | ||
return v | ||
custom: dict[str, Any] = {} | ||
|
||
def override(self, overrides: DatasetColumn) -> DatasetColumn: | ||
return DatasetColumn(**merge(self.model_dump(), overrides.model_dump())) | ||
|
@@ -374,11 +350,13 @@ def validate_consistency(self): | |
return errors | ||
|
||
def apply_column_defaults( | ||
self, column_defaults: dict[tuple[str, str], DatasetColumn] | ||
self, column_defaults: dict[tuple[str, COLUMN_TYPES], DatasetColumn] | ||
) -> list[DatasetColumn]: | ||
return [ | ||
c.override(column_defaults[c.id, c.data_type]) | ||
if c.data_type and (c.id, c.data_type) in column_defaults | ||
else c | ||
( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tiny nit, sorry - ✂️ these parentheses? |
||
c.override(column_defaults[c.id, c.data_type]) | ||
if c.data_type and (c.id, c.data_type) in column_defaults | ||
else c | ||
) | ||
for c in self.columns | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - this is why the error is popping up now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is code to create a placeholder, it seems fine to leave, no? And resolves our issue.
Or maybe it could even go in line 58? Not sure how this arg works. but maybe
would work too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And per the commit message/comment - this I think is explicitly for the purpose of avoiding this error - I don't think it has to do with anything from migration from v1 to v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is the reason for the error. If I revert the code back, it should solve the error. Counter q: would we won't this attribute
_validate_data_type
in the parentColumn
obj ormetadata.Column
?Would love to get @alexrichey 's thoughts too on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we would. We only want to not validate in this case, when we're generating it with placeholders. Other than that, we should always validate at runtime on instantiation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me try the placeholder + the solution Finn referenced... will report back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried the
construct
fn. It's now deprecated and the new method ismodel_construct()
.Sooo it seems like this method does no validation on input attributes & its values. For example, you can instantiate a model without any keys or random keys like below:
Using this seems more problematic because we can potentially generate templates with invalid keys... @fvankrieken, I think we should go with the text field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, wait. We won't generate a template with invalid keys like I said. But the risk is generating templates without required fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. At least having keys validated would be nice.
I'm not horribly concerned though - this is a util that's designed to create an invalid model, and this seems like a one-liner command to run it without error. Tests can make sure that structure is as expected. In terms of how a user would use it, this seems like less of a source of error to me than autofilling data_types that users might then forget to update. And this is slightly better than say just a dict in that it still shows developers what is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Added an assert statement for key validation!