Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: change the validation logic for python_date_format #25510

Merged
merged 21 commits into from
Jan 23, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 7 additions & 13 deletions superset/datasets/schemas.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
# specific language governing permissions and limitations
# under the License.
import json
import re
from datetime import datetime
from typing import Any

from flask_babel import lazy_gettext as _
Expand Down Expand Up @@ -43,18 +43,12 @@


def validate_python_date_format(value: str) -> None:
regex = re.compile(
r"""
^(
epoch_s|epoch_ms|
(?P<date>%Y([-/]%m([-/]%d)?)?)([\sT](?P<time>%H(:%M(:%S(\.%f)?)?)?))?
)$
""",
re.VERBOSE,
)
match = regex.match(value or "")
if not match:
raise ValidationError([_("Invalid date/timestamp format")])
if value in ("epoch_s", "epoch_ms"):
return
try:
datetime.now().strftime(value or "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapledan would you mind using the dateutil.parser.isoparse method instead?

Copy link
Contributor Author

@mapledan mapledan Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can use the dateutil.parser.isoparse method to make it support a more accurate ISO 8601 format than the original regular expression validation.
However, let me confirm again, does python_date_format only support the ISO 8601 format?
Because based on the description I read, it is written as follows: "If the timestamp format does not adhere to the ISO 8601 standard, you will need to define an expression and type for transforming the string into a date or timestamp."
That's why I chose to use strftime for validation to handle the user's defined expression.

Copy link
Member

@john-bodley john-bodley Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapledan it should only support ISO 8601—in addition to epoch_s and epoch_ms. The reason being is ISO 8601 guarantees lexigraphical ordering when comparing strings.

This is mentioned throughout the code.

Copy link
Contributor Author

@mapledan mapledan Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the preceding sentence: 'which needs to adhere to the ISO 8601 standard to ensure that the lexicographical ordering coincides with the chronological ordering.'

In my test case, the dateutil.parser.isoparse strictly follows the ISO 8601 standard, so it does not allow YYYY/MM/DD format.
This will impact PR (#24113) as it allows the slash format.
And I traced the python_date_format.
It is using by pandas.to_datetime in superset.utils.core.normalize_dttm_col.

Which should we use as the validation, thepandas.to_datetime method, regex like (?P<date>%Y([-/]?%m([-/]?%d)?)?)([\sT](?P<time>%H(:%M(:%S(\.%f)?)?)?))?, or the dateutil.parser.isoparse method?

Copy link
Member

@john-bodley john-bodley Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mapledan for flagging the other PR. Per the tooltip (see attached) it clearly states that the format should be ISO 8601, though the placeholder text uses /. Granted the / (in addition to -) does adhere to lexicographical ordering, but we should strictly adhere to the ISO 8601 standard for historical reasons.

Screenshot 2023-11-17 at 8 48 07 AM

@jfrag1 this PR reverts the logic you defined in #24113. I've also authored #26017 which updates the placeholder text which likely lead people astray.

@mapledan feel free to merge the logic from #26017 into your PR if you think it makes sense to have both changes combined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we want to move towards strict ISO, we definitely need a migration to update existing python_date_format's that currently use /. Without a migration, any existing datasets with columns that use / will not be able to be updated without changing the python_date_format since they'll fail the validation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfrag1 a migration isn't possible as the string format is specified by the underly column in the dataset. Granted this change has the potential to churn users who were exposed to the relaxation in #24113, i.e., since 3.0 and thus they would need to update there dataset definition to use a SQL expression instead.

@michael-s-molina this is likely another great example of a breaking change and thus we likely need to wait until 4.0 to re-restrict the eligible formats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-bodley #24113 didn't expose a new relaxation to users; that PR was a follow-up to #23678, which changed the endpoint used to update dataset definitions. The old endpoint did no validation at all for python_date_format, so there were people using slashes before that too (since the placeholder indicated to do so).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm not sure what you mean by "a migration isn't possible", can't we have a migration that updates the table that stores column definitions?

Copy link
Member

@john-bodley john-bodley Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfrag1 the column in question isn't in the metadata database it's in the underlying analytical database(s), i.e., Trino, Hive, etc. where the temporal format is defined and thus Superset merely reflects how the data is defined. It's difficult to estimate the impact of non-strict ISO 8601 validation, though since #24113 it only impacts newly registered datasets where the format is %Y/%m/%d[...]. Organizations likely have many registered temporal columns where the format is complete nonsense pre-#23678.

I think there's two option:

  1. Relax requiring the ISO 8601 standard to include /, i.e., what is currently implemented (for right or wrong)*. This is potentially risky as the ISO 8601 standard guarantees lexicographical ordering** whereas custom formatters don't.
  2. Wait until 4.0 and mention in UPDATING.md the (re)restriction that only ISO 8601 formats are acceptable (enforced by way of validation at the API and database level) and that dataset owners will need to use a SQL expression instead to convert their string columns of the form %Y/%m/%d etc. to a DATE, DATETIME, etc. type.

* Note due to lack of validation prior to #23678 (when we switched to the new RESTful API) there are likely organizations which have datasets which have a slew of formats in table_columns.python_datetime_format column which don't adhere to the current format and thus wont work as expected. I was able to confirm this in Airbnb's Superset metadata database. As part of 4.0 we should add a database level CHECK constraint for the table_columns.python_datetime_format column which will force admins and dataset owners to correct any violating temporal columns.

** Lexicographical ordering is essential when we're filtering temporal string columns via the >, >=, <, and <= operators.

This is a classic "shift left" problem where the underlying validation should also reside in the database layer. This is mentioned in [SIP-99C] Proposal for model and business validation and something that has been somewhat of an achilles heal for community over the years.

except ValueError as ex:
raise ValidationError([_("Invalid date/timestamp format")]) from ex


class DatasetColumnsPutSchema(Schema):
Expand Down
77 changes: 77 additions & 0 deletions tests/unit_tests/datasets/schema_tests.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# pylint: disable=import-outside-toplevel, invalid-name, unused-argument, redefined-outer-name
import pytest
from marshmallow import ValidationError


# pylint: disable=too-few-public-methods
@pytest.mark.parametrize(
"payload",
[
(
{
"column_name": "insert_time",
"filterable": True,
"groupby": True,
"python_date_format": None,
}
),
(
{
"column_name": "insert_time",
"filterable": True,
"groupby": True,
"python_date_format": "epoch_ms",
}
),
(
{
"column_name": "insert_time",
"filterable": True,
"groupby": True,
"python_date_format": "epoch_s",
}
),
(
{
"column_name": "insert_time",
"filterable": True,
"groupby": True,
"python_date_format": "%Y/%m/%dT%H:%M:%S.%f",
}
),
(
{
"column_name": "insert_time",
"filterable": True,
"groupby": True,
"python_date_format": "%Y%m%d",
}
),
],
)
def test_datasets_schema_update_column_datetime_format(payload) -> None:
from superset.datasets.schemas import DatasetColumnsPutSchema

schema = DatasetColumnsPutSchema()

try:
schema.load(payload)
except ValidationError as err:
assert False, err.messages
mapledan marked this conversation as resolved.
Show resolved Hide resolved
Loading