Source Data Checks: Refactor column class for ingest & distribution #1281

sf-dcp · 2024-11-25T19:45:52Z

~~Small~~ PR to refactor existing Column objects in ingest & distribution code into one.

Why?

For our own sanity when implementing data checks on source & product data.

Easiest to go commit by commit.

Major changes:

Create base models.dataset.Column class.
- Note, I added is_required attribute which will be used in validation.
Revise old ingest and metadata Column objects to inherit from the new class
- for metadata, remove a code snippet used in creating metadata raw templates
Move metadata_v2.Checks class to models.dataset
Appease mypy
- fix inconsistencies in data_type allowed values

Data checks will be implemented in next PR.

* Use new Column obj in ingest and distribution metadata

… to v2

…on.py`

dcpy/models/lifecycle/ingest.py

sf-dcp · 2024-11-26T23:22:42Z

There is one failing pytest that I'm not sure how to address -- would love an advice.

In dcpy.connectors.socrata.metadata module, we have a function (make_dcp_col) that generates metadata templates from socrata api where column data_type: <FILL ME IN>. So the test fails because isn't a valid data_type value.

One quick solution is to add value to the list of allowed values for data_type.

This is probably not something we will be using in the future; however, it could be useful for other agencies... But is it worth revising the validator to accommodate the edge case?

fvankrieken · 2024-11-26T23:27:52Z

There is one failing pytest that I'm not sure how to address -- would love an advice.

In dcpy.connectors.socrata.metadata module, we have a function (make_dcp_col) that generates metadata templates from socrata api where column data_type: <FILL ME IN>. So the test fails because isn't a valid data_type value.

One quick solution is to add value to the list of allowed values for data_type.

This is probably not something we will be using in the future; however, it could be useful for other agencies... But is it worth revising the validator to accommodate the edge case?

Is there a way to avoid running validators at runtime in pydantic? Pass an extra arg to the constructor or something along those lines

fvankrieken · 2024-11-27T02:04:13Z

dcpy/models/dataset.py

+    data_type: COLUMN_TYPES | None = None
+    description: str | None = None
+    is_required: bool = True
+    checks: Checks | None = None


Gut impression here - seems that checks should maybe be a list? Where then checks might take form

- name (or id): {name} parameter: value

Or something like that. Could just be a list of strings too. Will think more (and look at our old discussions) and revisit tomorrow

Not sure yet, I'm planning to focus on Checks in next PR. In this PR, I just moved the existing Checks obj Alex implemented to this file. If I don't do that, I run into an error where imports between dataset.py & metadata_v2.py are circular.

Yeah I think that makes sense to keep for later, since we're not actually attempting to implement checks yet.

def agree with an eventual list of checks

fvankrieken · 2024-11-27T02:07:14Z

dcpy/models/product/dataset/metadata_v2.py

-            c.override(column_defaults[c.id, c.data_type])
-            if c.data_type and (c.id, c.data_type) in column_defaults
-            else c
+            (


tiny nit, sorry - ✂️ these parentheses?

fvankrieken · 2024-11-27T02:10:42Z

dcpy/connectors/socrata/metadata.py

@@ -54,7 +54,7 @@ def make_dcp_col(c: pub.Socrata.Responses.Column) -> md.DatasetColumn:
        dcp_col["values"] = [
            {"value": s["item"], "description": FILL_ME_IN_PLACEHOLDER} for s in samples
        ]
-    md.DatasetColumn._validate_data_type = False
+    # md.DatasetColumn._validate_data_type = False # legacy attribute used during migration, no longer there


Ah - this is why the error is popping up now?

Given that this is code to create a placeholder, it seems fine to leave, no? And resolves our issue.

Or maybe it could even go in line 58? Not sure how this arg works. but maybe

md.DatasetColumn(**dcp_col, _validate_data_type = False)

would work too?

And per the commit message/comment - this I think is explicitly for the purpose of avoiding this error - I don't think it has to do with anything from migration from v1 to v2

Yep, this is the reason for the error. If I revert the code back, it should solve the error. Counter q: would we won't this attribute _validate_data_type in the parent Column obj or metadata.Column?

Would love to get @alexrichey 's thoughts too on this

I don't think we would. We only want to not validate in this case, when we're generating it with placeholders. Other than that, we should always validate at runtime on instantiation

let me try the placeholder + the solution Finn referenced... will report back

Tried the construct fn. It's now deprecated and the new method is model_construct().

Sooo it seems like this method does no validation on input attributes & its values. For example, you can instantiate a model without any keys or random keys like below:

Using this seems more problematic because we can potentially generate templates with invalid keys... @fvankrieken, I think we should go with the text field

Actually, wait. We won't generate a template with invalid keys like I said. But the risk is generating templates without required fields

Hmm. At least having keys validated would be nice.

I'm not horribly concerned though - this is a util that's designed to create an invalid model, and this seems like a one-liner command to run it without error. Tests can make sure that structure is as expected. In terms of how a user would use it, this seems like less of a source of error to me than autofilling data_types that users might then forget to update. And this is slightly better than say just a dict in that it still shows developers what is expected.

Makes sense. Added an assert statement for key validation!

fvankrieken · 2024-11-27T02:14:17Z

Couple little notes but this seems very sensible

sf-dcp · 2024-11-27T13:55:25Z

There is one failing pytest that I'm not sure how to address -- would love an advice.
In dcpy.connectors.socrata.metadata module, we have a function (make_dcp_col) that generates metadata templates from socrata api where column data_type: <FILL ME IN>. So the test fails because isn't a valid data_type value.
One quick solution is to add value to the list of allowed values for data_type.
This is probably not something we will be using in the future; however, it could be useful for other agencies... But is it worth revising the validator to accommodate the edge case?

Is there a way to avoid running validators at runtime in pydantic? Pass an extra arg to the constructor or something along those lines

Need to look into this. Though mypy may still complain

dcpy/models/product/dataset/metadata_v2.py

alexrichey

LGTM! though did you mean to delete lpc_scenic_landmarks.yml just now?

sf-dcp · 2024-11-27T16:34:36Z

LGTM! though did you mean to delete lpc_scenic_landmarks.yml just now?

🤦‍♀️ didn't commit this before switching to the current branch. ty!

…_dcp_col()`

codecov · 2024-11-27T16:55:34Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.41%. Comparing base (4a18ea9) to head (cd35d80).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1281   +/-   ##
=======================================
  Coverage   69.41%   69.41%           
=======================================
  Files         111      112    +1     
  Lines        5935     5935           
  Branches      661      660    -1     
=======================================
  Hits         4120     4120           
  Misses       1683     1683           
  Partials      132      132

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… template

fvankrieken · 2024-11-27T19:26:23Z

dcpy/connectors/socrata/metadata.py

+
+    # model_construct() method doesn't perform validation on keys, need this sanity check here
+    # instance keys == column model keys below:
+    assert (


sf-dcp force-pushed the sf-refactor-column-obj-for-ingest-dist branch 13 times, most recently from a31739b to 3ff8c50 Compare November 26, 2024 22:57

sf-dcp added 5 commits November 26, 2024 18:11

Create models.dataset.Column object & refactor:

457841b

* Use new Column obj in ingest and distribution metadata

metadata: remove legacy code used during migration from metadata v1…

c91d7c1

… to v2

Move Checks class from metadata_v2.py to dataset.py

3446b58

Appease mypy: Simplify creation of fake data in `test_column_validati…

2580855

…on.py`

Appease mypy: update data_type type in column defaults

724704d

sf-dcp force-pushed the sf-refactor-column-obj-for-ingest-dist branch from 3ff8c50 to 724704d Compare November 26, 2024 23:11

sf-dcp commented Nov 26, 2024

View reviewed changes

dcpy/models/lifecycle/ingest.py Show resolved Hide resolved

sf-dcp assigned damonmcc, fvankrieken and alexrichey Nov 26, 2024

sf-dcp marked this pull request as ready for review November 26, 2024 23:23

fvankrieken reviewed Nov 27, 2024

View reviewed changes

dcpy/models/product/dataset/metadata_v2.py Show resolved Hide resolved

alexrichey approved these changes Nov 27, 2024

View reviewed changes

sf-dcp added 2 commits November 27, 2024 11:48

POST REVIEW: revert file formatting to original code

1d5a270

POST REVIEW: use model_construct() method in `socrata.metadata.make…

fde078e

…_dcp_col()`

sf-dcp force-pushed the sf-refactor-column-obj-for-ingest-dist branch from 824abc6 to fde078e Compare November 27, 2024 16:51

sf-dcp added 3 commits November 27, 2024 12:47

POST REVIEW: add manual validation of keys when generating a metadata…

6811321

… template

formatting oopsi;

8734140

logical oopsi 2

cd35d80

sf-dcp merged commit 3cb4a39 into main Nov 27, 2024
20 checks passed

sf-dcp deleted the sf-refactor-column-obj-for-ingest-dist branch November 27, 2024 18:01

fvankrieken reviewed Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source Data Checks: Refactor column class for ingest & distribution #1281

Source Data Checks: Refactor column class for ingest & distribution #1281

sf-dcp commented Nov 25, 2024 •

edited

Loading

sf-dcp commented Nov 26, 2024 •

edited

Loading

fvankrieken commented Nov 26, 2024

fvankrieken Nov 27, 2024

sf-dcp Nov 27, 2024

fvankrieken Nov 27, 2024

alexrichey Nov 27, 2024

fvankrieken Nov 27, 2024

fvankrieken Nov 27, 2024

fvankrieken Nov 27, 2024

fvankrieken Nov 27, 2024

sf-dcp Nov 27, 2024

fvankrieken Nov 27, 2024

sf-dcp Nov 27, 2024 •

edited

Loading

sf-dcp Nov 27, 2024

sf-dcp Nov 27, 2024

fvankrieken Nov 27, 2024

sf-dcp Nov 27, 2024

fvankrieken commented Nov 27, 2024

sf-dcp commented Nov 27, 2024

alexrichey left a comment

sf-dcp commented Nov 27, 2024

codecov bot commented Nov 27, 2024 •

edited

Loading

fvankrieken Nov 27, 2024

Source Data Checks: Refactor column class for ingest & distribution #1281

Source Data Checks: Refactor column class for ingest & distribution #1281

Conversation

sf-dcp commented Nov 25, 2024 • edited Loading

Why?

Major changes:

sf-dcp commented Nov 26, 2024 • edited Loading

fvankrieken commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sf-dcp Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fvankrieken commented Nov 27, 2024

sf-dcp commented Nov 27, 2024

alexrichey left a comment

Choose a reason for hiding this comment

sf-dcp commented Nov 27, 2024

codecov bot commented Nov 27, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

sf-dcp commented Nov 25, 2024 •

edited

Loading

sf-dcp commented Nov 26, 2024 •

edited

Loading

sf-dcp Nov 27, 2024 •

edited

Loading

codecov bot commented Nov 27, 2024 •

edited

Loading