Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support all Spark patterns in cast(varchar as date) #5844

Closed
wants to merge 7 commits into from

Conversation

marin-ma
Copy link
Contributor

@marin-ma marin-ma commented Jul 26, 2023

Below patterns are considered valid to cast from string to date in spark sql functions:

  1. Year only: "YYYY"
  2. Year and month only: "YYYY-MM"
  3. Any characters after trailing spaces: "YYYY-MM-DD ", "YYYY-MM-DD 123", "YYYY-MM-DD (BC)"
  4. Any characters after trailing character 'T': "YYYY-MM-DDT"

Below patterns are invalid:

  1. Year is too large (exceed INT32_MAX): 20150318
  2. Other separators "YYYY/MM/DD"

Reference:
Spark cast from string to date:
https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298

Unit test:
CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126
CastExprTest::fromStringToDateInvalid is derived from
https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 26, 2023
@netlify
Copy link

netlify bot commented Jul 26, 2023

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit 0fbc6a2
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/64fc1715b45c7900082146df
😎 Deploy Preview https://deploy-preview-5844--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@netlify
Copy link

netlify bot commented Jul 26, 2023

👷 Deploy Preview for meta-velox processing.

Name Link
🔨 Latest commit c461566
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/64c105eec2fa6300087f904f

@marin-ma marin-ma changed the title Support more Date pattern Support more date pattern for sparksql cast from string to date Aug 1, 2023
@marin-ma
Copy link
Contributor Author

marin-ma commented Aug 1, 2023

@mbasmanova PTAL. Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma Does Presto support these patterns as well? Would you update documentation for CAST to add these new patterns?

#5847

CC: @kagamiori

@marin-ma
Copy link
Contributor Author

marin-ma commented Aug 1, 2023

@marin-ma Does Presto support these patterns as well? Would you update documentation for CAST to add these new patterns?

#5847

CC: @kagamiori

I'm not sure about prestosql. Neither do I have an test environment to check. @kagamiori Could you help to check on prestosql?

@marin-ma
Copy link
Contributor Author

marin-ma commented Aug 1, 2023

I just got a test environment for prestosql. Only SELECT cast('YYYY-MM-DD' as date) pass and all other patterns fails. Looks like prestosql only support one pattern 'YYYY-MM-DD' @kagamiori

@mbasmanova
Copy link
Contributor

@marin-ma Thank you for checking. Sounds like Presto and Spark behavior for cast(varchar as date) is different. Since, CastExpr is shared between the two, we would need to introduce a configuration property to control the behavior. Search for queryConfig.isCastToIntByTruncate() to see some examples.

@kagamiori
Copy link
Contributor

I just got a test environment for prestosql. Only SELECT cast('YYYY-MM-DD' as date) pass and all other patterns fails. Looks like prestosql only support one pattern 'YYYY-MM-DD' @kagamiori

Double checked on my side too. These (except YYYY-MM-DD) are not supported in Presto.

@marin-ma marin-ma force-pushed the date-pattern branch 2 times, most recently from 2cda6f8 to 2c5370e Compare August 2, 2023 10:37
@marin-ma
Copy link
Contributor Author

marin-ma commented Aug 2, 2023

@marin-ma Thank you for checking. Sounds like Presto and Spark behavior for cast(varchar as date) is different. Since, CastExpr is shared between the two, we would need to introduce a configuration property to control the behavior. Search for queryConfig.isCastToIntByTruncate() to see some examples.

@mbasmanova Updated & PTAL. Thanks!

@@ -105,6 +105,8 @@ constexpr int32_t kCumulativeYearDays[] = {

namespace {

enum class ParseMode { STRICT, NON_STRICT, STANDARD_CAST, NON_STANDARD_CAST };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, review the coding guidelines and update accordingly.

/// `[+-]YYYY*-[M]M-[D]DT*`
///
/// Throws VeloxUserError if the format or date is invalid.
int32_t castFromDateString(const char* buf, size_t len, bool isNonStandardCast);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to extract this change into a separate PR and add tests.

facebook-github-bot pushed a commit that referenced this pull request Aug 14, 2023
Summary:
This is a splitting PR from #5844

Since `fromDateString` cannot properly handle different cast behaviors and returns int64_t that may cause overflow of Date type, this PR add `castFromDateString` as a helper function, which can be used for handle different cast from string to date behavior from sparksql and prestosql.

Below patterns are considered valid to cast from string to date in spark sql functions:

Year only: "YYYY"
Year and month only: "YYYY-MM"
Digits after trailing spaces: "YYYY-MM-DD 123"
Trailing character 'T': "YYYY-MM-DDT"
Digits after 'T': "YYYY-MM-DDT123123"
"[+-]" before any valid pattern.

Below patterns are valid in presto sql:

"[+-]YYYY-MM-DD"

Pull Request resolved: #5994

Reviewed By: kgpai

Differential Revision: D48234502

Pulled By: bikramSingh91

fbshipit-source-id: 1e773692cad438bc5d3b948b0ab33e5f39f89823
@marin-ma
Copy link
Contributor Author

@bikramSingh91 Could you help to review?

@marin-ma
Copy link
Contributor Author

@bikramSingh91 Could you help to review this PR? This one is a follow-up of #5994 Thanks!

unigof pushed a commit to unigof/velox that referenced this pull request Aug 18, 2023
…#5994)

Summary:
This is a splitting PR from facebookincubator#5844

Since `fromDateString` cannot properly handle different cast behaviors and returns int64_t that may cause overflow of Date type, this PR add `castFromDateString` as a helper function, which can be used for handle different cast from string to date behavior from sparksql and prestosql.

Below patterns are considered valid to cast from string to date in spark sql functions:

Year only: "YYYY"
Year and month only: "YYYY-MM"
Digits after trailing spaces: "YYYY-MM-DD 123"
Trailing character 'T': "YYYY-MM-DDT"
Digits after 'T': "YYYY-MM-DDT123123"
"[+-]" before any valid pattern.

Below patterns are valid in presto sql:

"[+-]YYYY-MM-DD"

Pull Request resolved: facebookincubator#5994

Reviewed By: kgpai

Differential Revision: D48234502

Pulled By: bikramSingh91

fbshipit-source-id: 1e773692cad438bc5d3b948b0ab33e5f39f89823
@marin-ma
Copy link
Contributor Author

@mbasmanova Could you help to review again?

@@ -651,6 +658,68 @@ TEST_F(CastExprTest, invalidDate) {
"date", {"2012-Oct-23"}, {0}, true, false, VARCHAR(), DATE());
}

TEST_F(CastExprTest, stringToDate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you move these tests to TEST_F(CastExprTest, date)

DATE());
}

TEST_F(CastExprTest, stringToDateInvalid) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you move these tests to TEST_F(CastExprTest, invalidDate)

- bool
- false
- This flags allows the cast from string to date accept patterns other than the "[+-](YYYY-MM-DD)" format.
Valid patterns include (YYYY, YYYY-MM, YYYY-MM-DD), and any patterns prefixed with [+-]".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also mention other cases like
Digits after trailing spaces: "YYYY-MM-DD 123"
Trailing character 'T': "YYYY-MM-DDT"
Digits after 'T': "YYYY-MM-DDT123123"

for (bool conf : {true, false}) {
setCastStringToDateNonStandard(conf);
testCast<std::string, int32_t>(
"date",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we should include a case for trailing (BC) for the case where it is valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. It's invalid in presto but valid in spark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bikramSingh91 My previous understanding of trailing (BC) was wrong. The trailing (BC) starts with an additional space. So cast("2015-03-18(BC)" as date returns NULL in spark but cast("2015-03-18 (BC)" as date returns 2015-03-18

@@ -80,6 +80,11 @@ class QueryConfig {
// truncating the decimal part instead of rounding.
static constexpr const char* kCastToIntByTruncate = "cast_to_int_by_truncate";

// This flags allows the cast from string to date accept patterns other than
// the "(+-)[YYYY-MM-DD]" format.
static constexpr const char* kCastStringToDateNonStandard =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you also mention this newly added query config in the PR description that enables this behavior.

@@ -109,6 +109,11 @@ Expression Evaluation Configuration
- bool
- false
- This flags forces the cast from float/double to integer to be performed by truncating the decimal part instead of rounding.
* - cast_string_to_date_non_standard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also document configs that alter cast behavior in cast documentation: https://facebookincubator.github.io/velox/functions/presto/conversion.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova "non_standard" here means non ISO8601 standard. Or change it to cast_string_to_date_iso8601 and make the default value true?

@marin-ma
Copy link
Contributor Author

@mbasmanova @bikramSingh91 Could you help to review again? Thanks!

@bikramSingh91
Copy link
Contributor

@marin-ma Can you please take a look at the failure in castFromDateStringInvalid that failing some of the circleCI precommits?

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@marin-ma
Copy link
Contributor Author

marin-ma commented Sep 1, 2023

@marin-ma Can you please take a look at the failure in castFromDateStringInvalid that failing some of the circleCI precommits?

@bikramSingh91 PTAL. Thanks!

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@marin-ma
Copy link
Contributor Author

marin-ma commented Sep 5, 2023

@bikramSingh91 Could you help to check the error message of Linter? I can't see the errors.

@@ -15,6 +15,7 @@
*/

#include "velox/type/TimestampConversion.h"
#include <common/base/tests/GTestUtils.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you please update this to use
#include "velox/common/base/tests/GTestUtils.h"
and add it under L21 instead

@@ -270,6 +270,15 @@ bool tryParseDateString(

// In standard-cast mode, no more trailing characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is outdated In standard-cast mode, no more trailing characters.

@@ -15,6 +15,7 @@
*/

#include "velox/type/TimestampConversion.h"
#include <common/base/tests/GTestUtils.h>
#include <gmock/gmock.h>
#include <gtest/gtest.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include #include "velox/common/base/tests/GTestUtils.h" we can remove #include <gtest/gtest.h>

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines 272 to 280
// Skip trailing spaces.
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
// Check position. if end was not reached, non-space chars remaining.
if (pos < len) {
return false;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma Thanks for making all the changes till now. I have a last request, can you please move this part of the change out in a separate PR? This unfortunately, is a change in behavior of existing cast and can result in issues for existing workloads that dont expect this behavior. It would be good to separate it in case we need to roll back this behavior change while still keeping all the spark specific functionality you added here.

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@bikramSingh91
Copy link
Contributor

Thank you for moving out the changes to standard mode. I'll import these updates and will proceed with merging.

@marin-ma
Copy link
Contributor Author

@bikramSingh91 Removed the changes in the latest commit. Could you help review again? Thanks!

@marin-ma
Copy link
Contributor Author

@bikramSingh91 Could you help to review again? Thanks!

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@bikramSingh91
Copy link
Contributor

@marin-ma I was on PTO for that last week, will take this up today with high priority

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 merged this pull request in 8d6c296.

codyschierbeck pushed a commit to codyschierbeck/velox that referenced this pull request Sep 27, 2023
…r#5844)

Summary:
Below patterns are considered valid to cast from string to date in spark sql functions:

1. Year only: "YYYY"
2. Year and month only: "YYYY-MM"
3. Any characters after trailing spaces: "YYYY-MM-DD ", "YYYY-MM-DD 123", "YYYY-MM-DD (BC)"
4. Any characters after trailing character 'T': "YYYY-MM-DDT"

Below patterns are invalid:

1. Year is too large (exceed INT32_MAX): 20150318
2. Other separators "YYYY/MM/DD"

Reference:
Spark cast from string to date:
https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298

Unit test:
CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126
CastExprTest::fromStringToDateInvalid is derived from
https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73

Pull Request resolved: facebookincubator#5844

Reviewed By: kevinwilfong

Differential Revision: D48881690

Pulled By: bikramSingh91

fbshipit-source-id: 4236669585cf76762da1bc96d7976014aebab5d3
codyschierbeck pushed a commit to codyschierbeck/velox that referenced this pull request Sep 27, 2023
…r#5844)

Summary:
Below patterns are considered valid to cast from string to date in spark sql functions:

1. Year only: "YYYY"
2. Year and month only: "YYYY-MM"
3. Any characters after trailing spaces: "YYYY-MM-DD ", "YYYY-MM-DD 123", "YYYY-MM-DD (BC)"
4. Any characters after trailing character 'T': "YYYY-MM-DDT"

Below patterns are invalid:

1. Year is too large (exceed INT32_MAX): 20150318
2. Other separators "YYYY/MM/DD"

Reference:
Spark cast from string to date:
https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298

Unit test:
CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126
CastExprTest::fromStringToDateInvalid is derived from
https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73

Pull Request resolved: facebookincubator#5844

Reviewed By: kevinwilfong

Differential Revision: D48881690

Pulled By: bikramSingh91

fbshipit-source-id: 4236669585cf76762da1bc96d7976014aebab5d3
codyschierbeck pushed a commit to codyschierbeck/velox that referenced this pull request Sep 27, 2023
…r#5844)

Summary:
Below patterns are considered valid to cast from string to date in spark sql functions:

1. Year only: "YYYY"
2. Year and month only: "YYYY-MM"
3. Any characters after trailing spaces: "YYYY-MM-DD ", "YYYY-MM-DD 123", "YYYY-MM-DD (BC)"
4. Any characters after trailing character 'T': "YYYY-MM-DDT"

Below patterns are invalid:

1. Year is too large (exceed INT32_MAX): 20150318
2. Other separators "YYYY/MM/DD"

Reference:
Spark cast from string to date:
https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298

Unit test:
CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126
CastExprTest::fromStringToDateInvalid is derived from
https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73

Pull Request resolved: facebookincubator#5844

Reviewed By: kevinwilfong

Differential Revision: D48881690

Pulled By: bikramSingh91

fbshipit-source-id: 4236669585cf76762da1bc96d7976014aebab5d3
ericyuliu pushed a commit to ericyuliu/velox that referenced this pull request Oct 12, 2023
…#5994)

Summary:
This is a splitting PR from facebookincubator#5844

Since `fromDateString` cannot properly handle different cast behaviors and returns int64_t that may cause overflow of Date type, this PR add `castFromDateString` as a helper function, which can be used for handle different cast from string to date behavior from sparksql and prestosql.

Below patterns are considered valid to cast from string to date in spark sql functions:

Year only: "YYYY"
Year and month only: "YYYY-MM"
Digits after trailing spaces: "YYYY-MM-DD 123"
Trailing character 'T': "YYYY-MM-DDT"
Digits after 'T': "YYYY-MM-DDT123123"
"[+-]" before any valid pattern.

Below patterns are valid in presto sql:

"[+-]YYYY-MM-DD"

Pull Request resolved: facebookincubator#5994

Reviewed By: kgpai

Differential Revision: D48234502

Pulled By: bikramSingh91

fbshipit-source-id: 1e773692cad438bc5d3b948b0ab33e5f39f89823
ericyuliu pushed a commit to ericyuliu/velox that referenced this pull request Oct 12, 2023
…r#5844)

Summary:
Below patterns are considered valid to cast from string to date in spark sql functions:

1. Year only: "YYYY"
2. Year and month only: "YYYY-MM"
3. Any characters after trailing spaces: "YYYY-MM-DD ", "YYYY-MM-DD 123", "YYYY-MM-DD (BC)"
4. Any characters after trailing character 'T': "YYYY-MM-DDT"

Below patterns are invalid:

1. Year is too large (exceed INT32_MAX): 20150318
2. Other separators "YYYY/MM/DD"

Reference:
Spark cast from string to date:
https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298

Unit test:
CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126
CastExprTest::fromStringToDateInvalid is derived from
https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73

Pull Request resolved: facebookincubator#5844

Reviewed By: kevinwilfong

Differential Revision: D48881690

Pulled By: bikramSingh91

fbshipit-source-id: 4236669585cf76762da1bc96d7976014aebab5d3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants