Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Data Checks: Refactor column class for ingest & distribution #1281

Merged
merged 10 commits into from
Nov 27, 2024
2 changes: 1 addition & 1 deletion dcpy/connectors/socrata/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def make_dcp_col(c: pub.Socrata.Responses.Column) -> md.DatasetColumn:
dcp_col["values"] = [
{"value": s["item"], "description": FILL_ME_IN_PLACEHOLDER} for s in samples
]
md.DatasetColumn._validate_data_type = False
# md.DatasetColumn._validate_data_type = False # legacy attribute used during migration, no longer there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - this is why the error is popping up now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is code to create a placeholder, it seems fine to leave, no? And resolves our issue.

Or maybe it could even go in line 58? Not sure how this arg works. but maybe

md.DatasetColumn(**dcp_col, _validate_data_type = False)

would work too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And per the commit message/comment - this I think is explicitly for the purpose of avoiding this error - I don't think it has to do with anything from migration from v1 to v2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is the reason for the error. If I revert the code back, it should solve the error. Counter q: would we won't this attribute _validate_data_type in the parent Column obj or metadata.Column?

Would love to get @alexrichey 's thoughts too on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would. We only want to not validate in this case, when we're generating it with placeholders. Other than that, we should always validate at runtime on instantiation

Copy link
Contributor Author

@sf-dcp sf-dcp Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me try the placeholder + the solution Finn referenced... will report back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried the construct fn. It's now deprecated and the new method is model_construct().

Sooo it seems like this method does no validation on input attributes & its values. For example, you can instantiate a model without any keys or random keys like below:

image

Using this seems more problematic because we can potentially generate templates with invalid keys... @fvankrieken, I think we should go with the text field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, wait. We won't generate a template with invalid keys like I said. But the risk is generating templates without required fields

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. At least having keys validated would be nice.

I'm not horribly concerned though - this is a util that's designed to create an invalid model, and this seems like a one-liner command to run it without error. Tests can make sure that structure is as expected. In terms of how a user would use it, this seems like less of a source of error to me than autofilling data_types that users might then forget to update. And this is slightly better than say just a dict in that it still shows developers what is expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Added an assert statement for key validation!

return md.DatasetColumn(**dcp_col)


Expand Down
5 changes: 3 additions & 2 deletions dcpy/lifecycle/package/shapefiles.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,11 @@
DatasetAttributes,
DatasetColumn,
ColumnValue,
COLUMN_TYPES,
)
from dcpy.utils.logging import logger

_shapefile_to_dcpy_types = {
_shapefile_to_dcpy_types: dict[str, COLUMN_TYPES] = {
"OID": "integer",
"Integer": "integer",
"SmallInteger": "integer",
Expand All @@ -19,7 +20,7 @@
"String": "text",
"Date": "datetime",
"Geometry": "geometry",
"Boolean": "boolean",
"Boolean": "bool",
}


Expand Down
32 changes: 32 additions & 0 deletions dcpy/models/dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from dcpy.models.base import SortedSerializedBase
from typing import Literal

COLUMN_TYPES = Literal[
"text",
"integer",
"decimal",
"number", # TODO: Need to delete. Keeping it now for compatibility with metadata files
"geometry",
"bool",
"bbl",
"date",
"datetime",
]


# TODO: extend/modify Checks model
class Checks(SortedSerializedBase):
is_primary_key: bool | None = None
non_nullable: bool | None = None


class Column(SortedSerializedBase, extra="forbid"):
"""
An extensible base class for defining column metadata in ingest and product templates.
"""

id: str
data_type: COLUMN_TYPES | None = None
description: str | None = None
is_required: bool = True
checks: Checks | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gut impression here - seems that checks should maybe be a list? Where then checks might take form

- name (or id): {name}
  parameter: value

Or something like that. Could just be a list of strings too. Will think more (and look at our old discussions) and revisit tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet, I'm planning to focus on Checks in next PR. In this PR, I just moved the existing Checks obj Alex implemented to this file. If I don't do that, I run into an error where imports between dataset.py & metadata_v2.py are circular.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that makes sense to keep for later, since we're not actually attempting to implement checks yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def agree with an eventual list of checks

11 changes: 5 additions & 6 deletions dcpy/models/lifecycle/ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from dcpy.models.connectors import web, socrata
from dcpy.models import file
from dcpy.models.base import SortedSerializedBase
from dcpy.models.dataset import Column as BaseColumn, COLUMN_TYPES


class LocalFileSource(BaseModel, extra="forbid"):
Expand Down Expand Up @@ -77,12 +78,10 @@ class Ingestion(SortedSerializedBase):
processing_steps: list[ProcessingStep] = []


class Column(SortedSerializedBase):
id: str
data_type: Literal[
"text", "integer", "decimal", "geometry", "bool", "date", "datetime"
]
description: str | None = None
class Column(BaseColumn):
_head_sort_order = ["id", "data_type", "description"]

data_type: COLUMN_TYPES # override BaseColumn `data_type` to be required field
sf-dcp marked this conversation as resolved.
Show resolved Hide resolved


class Template(BaseModel, extra="forbid"):
Expand Down
44 changes: 11 additions & 33 deletions dcpy/models/product/dataset/metadata_v2.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
from __future__ import annotations

from pydantic import field_validator
from pydantic import BaseModel
from typing import Any, List, Literal, get_args
from typing import Any, List
import unicodedata

from dcpy.utils.collections import deep_merge_dict as merge
from dcpy.models.base import SortedSerializedBase, YamlWriter, TemplatedYamlReader
from dcpy.models.dataset import Column, COLUMN_TYPES

ERROR_MISSING_COLUMN = "MISSING COLUMN"

Expand Down Expand Up @@ -48,50 +48,26 @@ class CustomizableBase(SortedSerializedBase, extra="forbid"):


# COLUMNS
# TODO: move to share with ingest.validate
class Checks(CustomizableBase):
is_primary_key: bool | None = None
non_nullable: bool | None = None


# TODO: move to share with ingest.validate
COLUMN_TYPES = Literal[
"text", "number", "integer", "decimal", "geometry", "bool", "bbl", "datetime"
]


class ColumnValue(CustomizableBase):
_head_sort_order = ["value", "description"]

value: str
description: str | None = None


class DatasetColumn(CustomizableBase):
class DatasetColumn(Column):
_head_sort_order = ["id", "name", "data_type", "description"]
_tail_sort_order = ["example", "values", "custom"]
_validate_data_type = (
True # override, to generate md where we don't know the data_type
)

# Note: id isn't intended to be overrideable, but is always required as a
# pointer back to the original column, so it is required here.
id: str
# pointer back to the original column.
name: str | None = None
data_type: str | None = None
data_source: str | None = None
description: str | None = None
notes: str | None = None
example: str | None = None
checks: Checks | None = None
deprecated: bool | None = None
values: list[ColumnValue] | None = None

@field_validator("data_type")
def _validate_colum_types(cls, v):
if cls._validate_data_type:
sf-dcp marked this conversation as resolved.
Show resolved Hide resolved
assert v in get_args(COLUMN_TYPES)
return v
custom: dict[str, Any] = {}

def override(self, overrides: DatasetColumn) -> DatasetColumn:
return DatasetColumn(**merge(self.model_dump(), overrides.model_dump()))
Expand Down Expand Up @@ -374,11 +350,13 @@ def validate_consistency(self):
return errors

def apply_column_defaults(
self, column_defaults: dict[tuple[str, str], DatasetColumn]
self, column_defaults: dict[tuple[str, COLUMN_TYPES], DatasetColumn]
) -> list[DatasetColumn]:
return [
c.override(column_defaults[c.id, c.data_type])
if c.data_type and (c.id, c.data_type) in column_defaults
else c
(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit, sorry - ✂️ these parentheses?

c.override(column_defaults[c.id, c.data_type])
if c.data_type and (c.id, c.data_type) in column_defaults
else c
)
for c in self.columns
]
11 changes: 7 additions & 4 deletions dcpy/models/product/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
Metadata as DatasetMetadata,
DatasetColumn,
DatasetOrgProductAttributesOverride,
COLUMN_TYPES,
)
from dcpy.utils.collections import deep_merge_dict as merge

Expand Down Expand Up @@ -39,15 +40,15 @@ class ProductMetadata(SortedSerializedBase, extra="forbid"):
root_path: Path
metadata: ProductMetadataFile
template_vars: dict = {}
column_defaults: dict[tuple[str, str], DatasetColumn] = {}
column_defaults: dict[tuple[str, COLUMN_TYPES], DatasetColumn] = {}
org_attributes: DatasetOrgProductAttributesOverride

@classmethod
def from_path(
cls,
root_path: Path,
template_vars: dict = {},
column_defaults: dict[tuple[str, str], DatasetColumn] = {},
column_defaults: dict[tuple[str, COLUMN_TYPES], DatasetColumn] = {},
org_attributes: DatasetOrgProductAttributesOverride = DatasetOrgProductAttributesOverride(),
) -> ProductMetadata:
return ProductMetadata(
Expand Down Expand Up @@ -121,7 +122,7 @@ class OrgMetadata(SortedSerializedBase, extra="forbid"):
root_path: Path
template_vars: dict = Field(default_factory=dict)
metadata: OrgMetadataFile
column_defaults: dict[tuple[str, str], DatasetColumn]
column_defaults: dict[tuple[str, COLUMN_TYPES], DatasetColumn]

@classmethod
def get_string_snippets(cls, path: Path) -> dict:
Expand All @@ -136,7 +137,9 @@ def get_string_snippets(cls, path: Path) -> dict:
return yml

@classmethod
def get_column_defaults(cls, path: Path) -> dict[tuple[str, str], DatasetColumn]:
def get_column_defaults(
cls, path: Path
) -> dict[tuple[str, COLUMN_TYPES], DatasetColumn]:
c_path = path / "snippets" / "column_defaults.yml"
if not c_path.exists():
return {}
Expand Down
23 changes: 6 additions & 17 deletions dcpy/test/lifecycle/package/test_column_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,6 @@ def bbl(boro_code, block, lot):
def _fake_row(columns: list[md.DatasetColumn]):
row = {}

found_bbl_parts = {}
bbl_parts = {"boro_code", "block", "lot"}
found_bbl_name = ""
for c in columns:
if c.data_type == "bbl":
Expand All @@ -74,23 +72,14 @@ def _fake_row(columns: list[md.DatasetColumn]):
else:
val = fakes[c.data_type or ""]()
row[c.name] = val
if c.data_type in {"boro_code", "block", "lot"}:
found_bbl_parts[c.data_type] = val

# Construct a BBL from found parts, or generate a new one
# Generate a new bbl value
if found_bbl_name:
if set(found_bbl_parts.keys()) == bbl_parts:
row[found_bbl_name] = fakes["bbl"](
found_bbl_parts["boro_code"],
found_bbl_parts["block"],
found_bbl_parts["lot"],
)
else:
row[found_bbl_name] = fakes["bbl"](
fakes["boro_code"](),
fakes["block"](),
fakes["lot"](),
)
row[found_bbl_name] = fakes["bbl"](
fakes["boro_code"](),
fakes["block"](),
fakes["lot"](),
)

for c in columns:
if c.checks and not c.checks.non_nullable and random.choice([True, False]):
Expand Down
3 changes: 2 additions & 1 deletion dcpy/test/models/product/test_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from dcpy.models.product import metadata as md
from dcpy.models.product.dataset import metadata_v2 as ds_md
from dcpy.models import dataset


@pytest.fixture
Expand Down Expand Up @@ -178,7 +179,7 @@ def test_column_defaults_applied(dataset_with_snippets: ds_md.Metadata):
name="uid",
data_type="text",
data_source="Department of City Planning",
checks=ds_md.Checks(is_primary_key=True),
checks=dataset.Checks(is_primary_key=True),
),
ds_md.DatasetColumn(
id="bbl",
Expand Down
Loading