-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: change the validation logic for python_date_format #25510
Merged
john-bodley
merged 21 commits into
apache:master
from
mapledan:fix/dataset_datetime_format
Jan 23, 2024
Merged
Changes from 4 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
c4d70e4
fix: change the validation logic for python_date_format
mapledan 0108456
fix: strftime may receive None.
mapledan 0cb09cc
add unit tests
mapledan 7c7210f
re-run pre-commit
mapledan 2f05b7f
rewrite the unit test
mapledan 4c849e2
change the validate method to dateutil.parser.isoparse
mapledan dfb55e1
Adjust function return type
mapledan ff6ca17
`%Y/%m/%d` should be the raises case
mapledan baae852
Per the type hint value cannot be None
mapledan a6fa404
Merge branch 'master' of https://github.com/apache/superset into fix/…
mapledan 0f320a8
Add the low-level consistency.
mapledan 6f697c5
fix python_date_time maybe none
mapledan cefc744
fix python_date_format maybe none
mapledan e6f8e26
remove unnecessary test case
mapledan 7f3055a
Updating UPDATING.md
mapledan 20e484e
refactor: validation at the DAOs level
mapledan 6aee27c
refactor: validation at the DAOs level
mapledan 0269e95
fix(dataset): Ensure DATETIME FORMAT is ISO 8601 compliant
john-bodley 94b1482
remove unnecessary test case
mapledan b5ca468
pylint: fix invalid-name
mapledan daf90cb
Merge branch 'master' of https://github.com/apache/superset into fix/…
mapledan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# Licensed to the Apache Software Foundation (ASF) under one | ||
# or more contributor license agreements. See the NOTICE file | ||
# distributed with this work for additional information | ||
# regarding copyright ownership. The ASF licenses this file | ||
# to you under the Apache License, Version 2.0 (the | ||
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
# pylint: disable=import-outside-toplevel, invalid-name, unused-argument, redefined-outer-name | ||
import pytest | ||
from marshmallow import ValidationError | ||
|
||
|
||
# pylint: disable=too-few-public-methods | ||
@pytest.mark.parametrize( | ||
"payload", | ||
[ | ||
( | ||
{ | ||
"column_name": "insert_time", | ||
"filterable": True, | ||
"groupby": True, | ||
"python_date_format": None, | ||
} | ||
), | ||
( | ||
{ | ||
"column_name": "insert_time", | ||
"filterable": True, | ||
"groupby": True, | ||
"python_date_format": "epoch_ms", | ||
} | ||
), | ||
( | ||
{ | ||
"column_name": "insert_time", | ||
"filterable": True, | ||
"groupby": True, | ||
"python_date_format": "epoch_s", | ||
} | ||
), | ||
( | ||
{ | ||
"column_name": "insert_time", | ||
"filterable": True, | ||
"groupby": True, | ||
"python_date_format": "%Y/%m/%dT%H:%M:%S.%f", | ||
} | ||
), | ||
( | ||
{ | ||
"column_name": "insert_time", | ||
"filterable": True, | ||
"groupby": True, | ||
"python_date_format": "%Y%m%d", | ||
} | ||
), | ||
], | ||
) | ||
def test_datasets_schema_update_column_datetime_format(payload) -> None: | ||
from superset.datasets.schemas import DatasetColumnsPutSchema | ||
|
||
schema = DatasetColumnsPutSchema() | ||
|
||
try: | ||
schema.load(payload) | ||
except ValidationError as err: | ||
assert False, err.messages | ||
mapledan marked this conversation as resolved.
Show resolved
Hide resolved
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapledan would you mind using the dateutil.parser.isoparse method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can use the
dateutil.parser.isoparse
method to make it support a more accurate ISO 8601 format than the original regular expression validation.However, let me confirm again, does python_date_format only support the ISO 8601 format?
Because based on the description I read, it is written as follows: "If the timestamp format does not adhere to the ISO 8601 standard, you will need to define an expression and type for transforming the string into a date or timestamp."
That's why I chose to use
strftime
for validation to handle the user's defined expression.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapledan it should only support ISO 8601—in addition to
epoch_s
andepoch_ms
. The reason being is ISO 8601 guarantees lexigraphical ordering when comparing strings.This is mentioned throughout the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed the preceding sentence: 'which needs to adhere to the ISO 8601 standard to ensure that the lexicographical ordering coincides with the chronological ordering.'
In my test case, the
dateutil.parser.isoparse
strictly follows the ISO 8601 standard, so it does not allow YYYY/MM/DD format.This will impact PR (#24113) as it allows the slash format.
And I traced the python_date_format.
It is using by pandas.to_datetime in
superset.utils.core.normalize_dttm_col
.Which should we use as the validation, the
pandas.to_datetime
method, regex like(?P<date>%Y([-/]?%m([-/]?%d)?)?)([\sT](?P<time>%H(:%M(:%S(\.%f)?)?)?))?
, or thedateutil.parser.isoparse
method?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mapledan for flagging the other PR. Per the tooltip (see attached) it clearly states that the format should be ISO 8601, though the placeholder text uses
/
. Granted the/
(in addition to-
) does adhere to lexicographical ordering, but we should strictly adhere to the ISO 8601 standard for historical reasons.@jfrag1 this PR reverts the logic you defined in #24113. I've also authored #26017 which updates the placeholder text which likely lead people astray.
@mapledan feel free to merge the logic from #26017 into your PR if you think it makes sense to have both changes combined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we want to move towards strict ISO, we definitely need a migration to update existing
python_date_format
's that currently use/
. Without a migration, any existing datasets with columns that use/
will not be able to be updated without changing thepython_date_format
since they'll fail the validation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfrag1 a migration isn't possible as the string format is specified by the underly column in the dataset. Granted this change has the potential to churn users who were exposed to the relaxation in #24113, i.e., since 3.0 and thus they would need to update there dataset definition to use a SQL expression instead.
@michael-s-molina this is likely another great example of a breaking change and thus we likely need to wait until 4.0 to re-restrict the eligible formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@john-bodley #24113 didn't expose a new relaxation to users; that PR was a follow-up to #23678, which changed the endpoint used to update dataset definitions. The old endpoint did no validation at all for
python_date_format
, so there were people using slashes before that too (since the placeholder indicated to do so).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm not sure what you mean by "a migration isn't possible", can't we have a migration that updates the table that stores column definitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfrag1 the column in question isn't in the metadata database it's in the underlying analytical database(s), i.e., Trino, Hive, etc. where the temporal format is defined and thus Superset merely reflects how the data is defined. It's difficult to estimate the impact of non-strict ISO 8601 validation, though since #24113 it only impacts newly registered datasets where the format is
%Y/%m/%d[...]
. Organizations likely have many registered temporal columns where the format is complete nonsense pre-#23678.I think there's two option:
/
, i.e., what is currently implemented (for right or wrong)*. This is potentially risky as the ISO 8601 standard guarantees lexicographical ordering** whereas custom formatters don't.UPDATING.md
the (re)restriction that only ISO 8601 formats are acceptable (enforced by way of validation at the API and database level) and that dataset owners will need to use a SQL expression instead to convert their string columns of the form%Y/%m/%d
etc. to aDATE
,DATETIME
, etc. type.* Note due to lack of validation prior to #23678 (when we switched to the new RESTful API) there are likely organizations which have datasets which have a slew of formats in
table_columns.python_datetime_format
column which don't adhere to the current format and thus wont work as expected. I was able to confirm this in Airbnb's Superset metadata database. As part of 4.0 we should add a database level CHECK constraint for thetable_columns.python_datetime_format
column which will force admins and dataset owners to correct any violating temporal columns.** Lexicographical ordering is essential when we're filtering temporal string columns via the
>
,>=
,<
, and<=
operators.This is a classic "shift left" problem where the underlying validation should also reside in the database layer. This is mentioned in [SIP-99C] Proposal for model and business validation and something that has been somewhat of an achilles heal for community over the years.