-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support all Spark patterns in cast(varchar as date) #5844
Conversation
✅ Deploy Preview for meta-velox ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
👷 Deploy Preview for meta-velox processing.
|
@mbasmanova PTAL. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marin-ma Does Presto support these patterns as well? Would you update documentation for CAST to add these new patterns?
CC: @kagamiori
I'm not sure about prestosql. Neither do I have an test environment to check. @kagamiori Could you help to check on prestosql? |
I just got a test environment for prestosql. Only |
@marin-ma Thank you for checking. Sounds like Presto and Spark behavior for cast(varchar as date) is different. Since, CastExpr is shared between the two, we would need to introduce a configuration property to control the behavior. Search for queryConfig.isCastToIntByTruncate() to see some examples. |
Double checked on my side too. These (except YYYY-MM-DD) are not supported in Presto. |
2cda6f8
to
2c5370e
Compare
@mbasmanova Updated & PTAL. Thanks! |
velox/type/TimestampConversion.cpp
Outdated
@@ -105,6 +105,8 @@ constexpr int32_t kCumulativeYearDays[] = { | |||
|
|||
namespace { | |||
|
|||
enum class ParseMode { STRICT, NON_STRICT, STANDARD_CAST, NON_STANDARD_CAST }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, review the coding guidelines and update accordingly.
velox/type/TimestampConversion.h
Outdated
/// `[+-]YYYY*-[M]M-[D]DT*` | ||
/// | ||
/// Throws VeloxUserError if the format or date is invalid. | ||
int32_t castFromDateString(const char* buf, size_t len, bool isNonStandardCast); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to extract this change into a separate PR and add tests.
Summary: This is a splitting PR from #5844 Since `fromDateString` cannot properly handle different cast behaviors and returns int64_t that may cause overflow of Date type, this PR add `castFromDateString` as a helper function, which can be used for handle different cast from string to date behavior from sparksql and prestosql. Below patterns are considered valid to cast from string to date in spark sql functions: Year only: "YYYY" Year and month only: "YYYY-MM" Digits after trailing spaces: "YYYY-MM-DD 123" Trailing character 'T': "YYYY-MM-DDT" Digits after 'T': "YYYY-MM-DDT123123" "[+-]" before any valid pattern. Below patterns are valid in presto sql: "[+-]YYYY-MM-DD" Pull Request resolved: #5994 Reviewed By: kgpai Differential Revision: D48234502 Pulled By: bikramSingh91 fbshipit-source-id: 1e773692cad438bc5d3b948b0ab33e5f39f89823
@bikramSingh91 Could you help to review? |
@bikramSingh91 Could you help to review this PR? This one is a follow-up of #5994 Thanks! |
…#5994) Summary: This is a splitting PR from facebookincubator#5844 Since `fromDateString` cannot properly handle different cast behaviors and returns int64_t that may cause overflow of Date type, this PR add `castFromDateString` as a helper function, which can be used for handle different cast from string to date behavior from sparksql and prestosql. Below patterns are considered valid to cast from string to date in spark sql functions: Year only: "YYYY" Year and month only: "YYYY-MM" Digits after trailing spaces: "YYYY-MM-DD 123" Trailing character 'T': "YYYY-MM-DDT" Digits after 'T': "YYYY-MM-DDT123123" "[+-]" before any valid pattern. Below patterns are valid in presto sql: "[+-]YYYY-MM-DD" Pull Request resolved: facebookincubator#5994 Reviewed By: kgpai Differential Revision: D48234502 Pulled By: bikramSingh91 fbshipit-source-id: 1e773692cad438bc5d3b948b0ab33e5f39f89823
@mbasmanova Could you help to review again? |
@@ -651,6 +658,68 @@ TEST_F(CastExprTest, invalidDate) { | |||
"date", {"2012-Oct-23"}, {0}, true, false, VARCHAR(), DATE()); | |||
} | |||
|
|||
TEST_F(CastExprTest, stringToDate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you move these tests to TEST_F(CastExprTest, date)
DATE()); | ||
} | ||
|
||
TEST_F(CastExprTest, stringToDateInvalid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you move these tests to TEST_F(CastExprTest, invalidDate)
velox/docs/configs.rst
Outdated
- bool | ||
- false | ||
- This flags allows the cast from string to date accept patterns other than the "[+-](YYYY-MM-DD)" format. | ||
Valid patterns include (YYYY, YYYY-MM, YYYY-MM-DD), and any patterns prefixed with [+-]". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also mention other cases like
Digits after trailing spaces: "YYYY-MM-DD 123"
Trailing character 'T': "YYYY-MM-DDT"
Digits after 'T': "YYYY-MM-DDT123123"
for (bool conf : {true, false}) { | ||
setCastStringToDateNonStandard(conf); | ||
testCast<std::string, int32_t>( | ||
"date", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we should include a case for trailing (BC) for the case where it is valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. It's invalid in presto but valid in spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bikramSingh91 My previous understanding of trailing (BC) was wrong. The trailing (BC) starts with an additional space. So cast("2015-03-18(BC)" as date
returns NULL in spark but cast("2015-03-18 (BC)" as date
returns 2015-03-18
velox/core/QueryConfig.h
Outdated
@@ -80,6 +80,11 @@ class QueryConfig { | |||
// truncating the decimal part instead of rounding. | |||
static constexpr const char* kCastToIntByTruncate = "cast_to_int_by_truncate"; | |||
|
|||
// This flags allows the cast from string to date accept patterns other than | |||
// the "(+-)[YYYY-MM-DD]" format. | |||
static constexpr const char* kCastStringToDateNonStandard = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you also mention this newly added query config in the PR description that enables this behavior.
velox/docs/configs.rst
Outdated
@@ -109,6 +109,11 @@ Expression Evaluation Configuration | |||
- bool | |||
- false | |||
- This flags forces the cast from float/double to integer to be performed by truncating the decimal part instead of rounding. | |||
* - cast_string_to_date_non_standard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also document configs that alter cast behavior in cast documentation: https://facebookincubator.github.io/velox/functions/presto/conversion.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova "non_standard" here means non ISO8601 standard. Or change it to cast_string_to_date_iso8601
and make the default value true?
@mbasmanova @bikramSingh91 Could you help to review again? Thanks! |
@marin-ma Can you please take a look at the failure in |
@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@bikramSingh91 PTAL. Thanks! |
@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@bikramSingh91 Could you help to check the error message of Linter? I can't see the errors. |
@@ -15,6 +15,7 @@ | |||
*/ | |||
|
|||
#include "velox/type/TimestampConversion.h" | |||
#include <common/base/tests/GTestUtils.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you please update this to use
#include "velox/common/base/tests/GTestUtils.h"
and add it under L21 instead
velox/type/TimestampConversion.cpp
Outdated
@@ -270,6 +270,15 @@ bool tryParseDateString( | |||
|
|||
// In standard-cast mode, no more trailing characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is outdated In standard-cast mode, no more trailing characters.
@@ -15,6 +15,7 @@ | |||
*/ | |||
|
|||
#include "velox/type/TimestampConversion.h" | |||
#include <common/base/tests/GTestUtils.h> | |||
#include <gmock/gmock.h> | |||
#include <gtest/gtest.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include #include "velox/common/base/tests/GTestUtils.h"
we can remove #include <gtest/gtest.h>
@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
velox/type/TimestampConversion.cpp
Outdated
// Skip trailing spaces. | ||
while (pos < len && characterIsSpace(buf[pos])) { | ||
pos++; | ||
} | ||
// Check position. if end was not reached, non-space chars remaining. | ||
if (pos < len) { | ||
return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marin-ma Thanks for making all the changes till now. I have a last request, can you please move this part of the change out in a separate PR? This unfortunately, is a change in behavior of existing cast
and can result in issues for existing workloads that dont expect this behavior. It would be good to separate it in case we need to roll back this behavior change while still keeping all the spark specific functionality you added here.
@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Thank you for moving out the changes to standard mode. I'll import these updates and will proceed with merging. |
@bikramSingh91 Removed the changes in the latest commit. Could you help review again? Thanks! |
@bikramSingh91 Could you help to review again? Thanks! |
@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@marin-ma I was on PTO for that last week, will take this up today with high priority |
@bikramSingh91 merged this pull request in 8d6c296. |
…r#5844) Summary: Below patterns are considered valid to cast from string to date in spark sql functions: 1. Year only: "YYYY" 2. Year and month only: "YYYY-MM" 3. Any characters after trailing spaces: "YYYY-MM-DD ", "YYYY-MM-DD 123", "YYYY-MM-DD (BC)" 4. Any characters after trailing character 'T': "YYYY-MM-DDT" Below patterns are invalid: 1. Year is too large (exceed INT32_MAX): 20150318 2. Other separators "YYYY/MM/DD" Reference: Spark cast from string to date: https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298 Unit test: CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126 CastExprTest::fromStringToDateInvalid is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73 Pull Request resolved: facebookincubator#5844 Reviewed By: kevinwilfong Differential Revision: D48881690 Pulled By: bikramSingh91 fbshipit-source-id: 4236669585cf76762da1bc96d7976014aebab5d3
…r#5844) Summary: Below patterns are considered valid to cast from string to date in spark sql functions: 1. Year only: "YYYY" 2. Year and month only: "YYYY-MM" 3. Any characters after trailing spaces: "YYYY-MM-DD ", "YYYY-MM-DD 123", "YYYY-MM-DD (BC)" 4. Any characters after trailing character 'T': "YYYY-MM-DDT" Below patterns are invalid: 1. Year is too large (exceed INT32_MAX): 20150318 2. Other separators "YYYY/MM/DD" Reference: Spark cast from string to date: https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298 Unit test: CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126 CastExprTest::fromStringToDateInvalid is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73 Pull Request resolved: facebookincubator#5844 Reviewed By: kevinwilfong Differential Revision: D48881690 Pulled By: bikramSingh91 fbshipit-source-id: 4236669585cf76762da1bc96d7976014aebab5d3
…r#5844) Summary: Below patterns are considered valid to cast from string to date in spark sql functions: 1. Year only: "YYYY" 2. Year and month only: "YYYY-MM" 3. Any characters after trailing spaces: "YYYY-MM-DD ", "YYYY-MM-DD 123", "YYYY-MM-DD (BC)" 4. Any characters after trailing character 'T': "YYYY-MM-DDT" Below patterns are invalid: 1. Year is too large (exceed INT32_MAX): 20150318 2. Other separators "YYYY/MM/DD" Reference: Spark cast from string to date: https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298 Unit test: CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126 CastExprTest::fromStringToDateInvalid is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73 Pull Request resolved: facebookincubator#5844 Reviewed By: kevinwilfong Differential Revision: D48881690 Pulled By: bikramSingh91 fbshipit-source-id: 4236669585cf76762da1bc96d7976014aebab5d3
…#5994) Summary: This is a splitting PR from facebookincubator#5844 Since `fromDateString` cannot properly handle different cast behaviors and returns int64_t that may cause overflow of Date type, this PR add `castFromDateString` as a helper function, which can be used for handle different cast from string to date behavior from sparksql and prestosql. Below patterns are considered valid to cast from string to date in spark sql functions: Year only: "YYYY" Year and month only: "YYYY-MM" Digits after trailing spaces: "YYYY-MM-DD 123" Trailing character 'T': "YYYY-MM-DDT" Digits after 'T': "YYYY-MM-DDT123123" "[+-]" before any valid pattern. Below patterns are valid in presto sql: "[+-]YYYY-MM-DD" Pull Request resolved: facebookincubator#5994 Reviewed By: kgpai Differential Revision: D48234502 Pulled By: bikramSingh91 fbshipit-source-id: 1e773692cad438bc5d3b948b0ab33e5f39f89823
…r#5844) Summary: Below patterns are considered valid to cast from string to date in spark sql functions: 1. Year only: "YYYY" 2. Year and month only: "YYYY-MM" 3. Any characters after trailing spaces: "YYYY-MM-DD ", "YYYY-MM-DD 123", "YYYY-MM-DD (BC)" 4. Any characters after trailing character 'T': "YYYY-MM-DDT" Below patterns are invalid: 1. Year is too large (exceed INT32_MAX): 20150318 2. Other separators "YYYY/MM/DD" Reference: Spark cast from string to date: https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298 Unit test: CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126 CastExprTest::fromStringToDateInvalid is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73 Pull Request resolved: facebookincubator#5844 Reviewed By: kevinwilfong Differential Revision: D48881690 Pulled By: bikramSingh91 fbshipit-source-id: 4236669585cf76762da1bc96d7976014aebab5d3
Below patterns are considered valid to cast from string to date in spark sql functions:
Below patterns are invalid:
Reference:
Spark cast from string to date:
https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298
Unit test:
CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126
CastExprTest::fromStringToDateInvalid is derived from
https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73