Skip to content

Commit

Permalink
Support all Spark patterns in cast(varchar as date) (#5844)
Browse files Browse the repository at this point in the history
Summary:
Below patterns are considered valid to cast from string to date in spark sql functions:

1. Year only: "YYYY"
2. Year and month only: "YYYY-MM"
3. Any characters after trailing spaces: "YYYY-MM-DD ", "YYYY-MM-DD 123", "YYYY-MM-DD (BC)"
4. Any characters after trailing character 'T': "YYYY-MM-DDT"

Below patterns are invalid:

1. Year is too large (exceed INT32_MAX): 20150318
2. Other separators "YYYY/MM/DD"

Reference:
Spark cast from string to date:
https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L286-L298

Unit test:
CastExprTest::fromStringToDate is derived from https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala#L103-L126
CastExprTest::fromStringToDateInvalid is derived from
https://github.com/apache/spark/blob/3a9185964a0de3c720a6b77d38a446258b73468e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala#L67-L73

Pull Request resolved: #5844

Reviewed By: kevinwilfong

Differential Revision: D48881690

Pulled By: bikramSingh91

fbshipit-source-id: 4236669585cf76762da1bc96d7976014aebab5d3
  • Loading branch information
marin-ma authored and facebook-github-bot committed Sep 15, 2023
1 parent ec641b0 commit 8d6c296
Show file tree
Hide file tree
Showing 8 changed files with 262 additions and 93 deletions.
19 changes: 19 additions & 0 deletions velox/core/QueryConfig.h
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,21 @@ class QueryConfig {
/// decimal part, otherwise rounds.
static constexpr const char* kCastToIntByTruncate = "cast_to_int_by_truncate";

/// If set, cast from string to date allows only ISO 8601 formatted strings:
/// [+-](YYYY-MM-DD). Otherwise, allows all patterns supported by Spark:
/// `[+-]yyyy*`
/// `[+-]yyyy*-[m]m`
/// `[+-]yyyy*-[m]m-[d]d`
/// `[+-]yyyy*-[m]m-[d]d *`
/// `[+-]yyyy*-[m]m-[d]dT*`
/// The asterisk `*` in `yyyy*` stands for any numbers.
/// For the last two patterns, the trailing `*` can represent none or any
/// sequence of characters, e.g:
/// "1970-01-01 123"
/// "1970-01-01 (BC)"
static constexpr const char* kCastStringToDateIsIso8601 =
"cast_string_to_date_is_iso_8601";

/// Used for backpressure to block local exchange producers when the local
/// exchange buffer reaches or exceeds this size.
static constexpr const char* kMaxLocalExchangeBufferSize =
Expand Down Expand Up @@ -336,6 +351,10 @@ class QueryConfig {
return get<bool>(kCastToIntByTruncate, false);
}

bool isIso8601() const {
return get<bool>(kCastStringToDateIsIso8601, true);
}

bool codegenEnabled() const {
return get<bool>(kCodegenEnabled, false);
}
Expand Down
17 changes: 17 additions & 0 deletions velox/docs/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ Generic Configuration
- 1000
- The minimum number of table rows that can trigger the parallel hash join table build.

.. _expression-evaluation-conf:

Expression Evaluation Configuration
-----------------------------------
.. list-table::
Expand Down Expand Up @@ -109,6 +111,21 @@ Expression Evaluation Configuration
- bool
- false
- This flags forces the cast from float/double/decimal/string to integer to be performed by truncating the decimal part instead of rounding.
* - cast_string_to_date_is_iso_8601
- bool
- true
- If set, cast from string to date allows only ISO 8601 formatted strings: ``[+-](YYYY-MM-DD)``.
Otherwise, allows all patterns supported by Spark:
* ``[+-]yyyy*``
* ``[+-]yyyy*-[m]m``
* ``[+-]yyyy*-[m]m-[d]d``
* ``[+-]yyyy*-[m]m-[d]d *``
* ``[+-]yyyy*-[m]m-[d]dT*``
The asterisk ``*`` in ``yyyy*`` stands for any numbers.
For the last two patterns, the trailing ``*`` can represent none or any sequence of characters, e.g:
* "1970-01-01 123"
* "1970-01-01 (BC)"
Regardless of this setting's value, leading spaces will be trimmed.

Memory Management
-----------------
Expand Down
43 changes: 39 additions & 4 deletions velox/docs/functions/presto/conversion.rst
Original file line number Diff line number Diff line change
Expand Up @@ -582,20 +582,55 @@ Cast to Date
From strings
^^^^^^^^^^^^

Casting from a string to date is allowed if the string represents a date in the
format `YYYY-MM-DD`. Casting from invalid input values throws.
By default, only ISO 8601 strings are supported: `[+-]YYYY-MM-DD`.

Valid example
If cast_string_to_date_is_iso_8601 is set to false, all Spark supported patterns are allowed.
See the documentation for cast_string_to_date_is_iso_8601 in :ref:`Expression Evaluation Configuration<expression-evaluation-conf>`
for the full list of supported patterns.

Casting from invalid input values throws.

Valid examples

**cast_string_to_date_is_iso_8601=true**

::

SELECT cast('1970-01-01' as date); -- 1970-01-01

Invalid example
**cast_string_to_date_is_iso_8601=false**

::

SELECT cast('1970' as date); -- 1970-01-01
SELECT cast('1970-01' as date); -- 1970-01-01
SELECT cast('1970-01-01' as date); -- 1970-01-01
SELECT cast('1970-01-01T123' as date); -- 1970-01-01
SELECT cast('1970-01-01 ' as date); -- 1970-01-01
SELECT cast('1970-01-01 (BC)' as date); -- 1970-01-01

Invalid examples

**cast_string_to_date_is_iso_8601=true**

::

SELECT cast('2012' as date); -- Invalid argument
SELECT cast('2012-10' as date); -- Invalid argument
SELECT cast('2012-10-23T123' as date); -- Invalid argument
SELECT cast('2012-10-23 (BC)' as date); -- Invalid argument
SELECT cast('2012-Oct-23' as date); -- Invalid argument
SELECT cast('2012/10/23' as date); -- Invalid argument
SELECT cast('2012.10.23' as date); -- Invalid argument
SELECT cast('2012-10-23 ' as date); -- Invalid argument

**cast_string_to_date_is_iso_8601=false**

::

SELECT cast('2012-Oct-23' as date); -- Invalid argument
SELECT cast('2012/10/23' as date); -- Invalid argument
SELECT cast('2012.10.23' as date); -- Invalid argument

From TIMESTAMP
^^^^^^^^^^^^^^
Expand Down
10 changes: 5 additions & 5 deletions velox/expression/CastExpr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -103,14 +103,14 @@ VectorPtr CastExpr::castToDate(
switch (fromType->kind()) {
case TypeKind::VARCHAR: {
auto* inputVector = input.as<SimpleVector<StringView>>();
const auto& queryConfig = context.execCtx()->queryCtx()->queryConfig();
auto isIso8601 = queryConfig.isIso8601();
applyToSelectedNoThrowLocal(context, rows, castResult, [&](int row) {
try {
auto inputString = inputVector->valueAt(row);
resultFlatVector->set(row, DATE()->toDays(inputString));
} catch (const VeloxException& ue) {
if (!ue.isUserError()) {
throw;
}
resultFlatVector->set(
row, util::castFromDateString(inputString, isIso8601));
} catch (const VeloxUserError& ue) {
VELOX_USER_FAIL(
makeErrorMessage(input, row, DATE()) + " " + ue.message());
} catch (const std::exception& e) {
Expand Down
141 changes: 104 additions & 37 deletions velox/expression/tests/CastExprTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,12 @@ class CastExprTest : public functions::test::CastBaseTest {
});
}

void setCastStringToDateIsIso8601(bool value) {
queryCtx_->testingOverrideConfigUnsafe({
{core::QueryConfig::kCastStringToDateIsIso8601, std::to_string(value)},
});
}

std::shared_ptr<core::ConstantTypedExpr> makeConstantNullExpr(TypeKind kind) {
return std::make_shared<core::ConstantTypedExpr>(
createType(kind, {}), variant(kind));
Expand Down Expand Up @@ -672,49 +678,110 @@ TEST_F(CastExprTest, timestampAdjustToTimezoneInvalid) {
}

TEST_F(CastExprTest, date) {
std::vector<std::optional<std::string>> input{
"1970-01-01",
"2020-01-01",
"2135-11-09",
"1969-12-27",
"1812-04-15",
"1920-01-02",
std::nullopt,
};
std::vector<std::optional<int32_t>> result{
0,
18262,
60577,
-5,
-57604,
-18262,
std::nullopt,
};

testCast<std::string, int32_t>(
"date", input, result, false, false, VARCHAR(), DATE());
for (bool isIso8601 : {true, false}) {
setCastStringToDateIsIso8601(isIso8601);
testCast<std::string, int32_t>(
"date",
{"1970-01-01",
"2020-01-01",
"2135-11-09",
"1969-12-27",
"1812-04-15",
"1920-01-02",
"12345-12-18",
"1970-1-2",
"1970-01-2",
"1970-1-02",
"+1970-01-02",
"-1-1-1",
" 1970-01-01",
std::nullopt},
{0,
18262,
60577,
-5,
-57604,
-18262,
3789742,
1,
1,
1,
1,
-719893,
0,
std::nullopt},
false,
false,
VARCHAR(),
DATE());
}

setCastIntByTruncate(true);
setCastStringToDateIsIso8601(false);
testCast<std::string, int32_t>(
"date", input, result, false, false, VARCHAR(), DATE());
"date",
{"12345",
"2015",
"2015-03",
"2015-03-18T",
"2015-03-18T123123",
"2015-03-18 123142",
"2015-03-18 (BC)"},
{3789391, 16436, 16495, 16512, 16512, 16512, 16512},
false,
false,
VARCHAR(),
DATE());
}

TEST_F(CastExprTest, invalidDate) {
testCast<int8_t, int32_t>("date", {12}, {0}, true, false, TINYINT(), DATE());
testCast<int16_t, int32_t>(
"date", {1234}, {0}, true, false, SMALLINT(), DATE());
testCast<int32_t, int32_t>(
"date", {1234}, {0}, true, false, INTEGER(), DATE());
testCast<int64_t, int32_t>(
"date", {1234}, {0}, true, false, BIGINT(), DATE());

testCast<float, int32_t>("date", {12.99}, {0}, true, false, REAL(), DATE());
testCast<double, int32_t>(
"date", {12.99}, {0}, true, false, DOUBLE(), DATE());

// Parsing an ill-formated date.
for (bool isIso8601 : {true, false}) {
setCastStringToDateIsIso8601(isIso8601);

testCast<int8_t, int32_t>(
"date", {12}, {0}, true, false, TINYINT(), DATE());
testCast<int16_t, int32_t>(
"date", {1234}, {0}, true, false, SMALLINT(), DATE());
testCast<int32_t, int32_t>(
"date", {1234}, {0}, true, false, INTEGER(), DATE());
testCast<int64_t, int32_t>(
"date", {1234}, {0}, true, false, BIGINT(), DATE());

testCast<float, int32_t>("date", {12.99}, {0}, true, false, REAL(), DATE());
testCast<double, int32_t>(
"date", {12.99}, {0}, true, false, DOUBLE(), DATE());

// Parsing ill-formated dates.
testCast<std::string, int32_t>(
"date", {"2012-Oct-23"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"2015-03-18X"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"2015/03/18"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"2015.03.18"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"20150318"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"2015-031-8"}, {0}, true, false, VARCHAR(), DATE());
}

setCastStringToDateIsIso8601(true);
testCast<std::string, int32_t>(
"date", {"12345"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"2015-03"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"2015-03-18 123412"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"2015-03-18T"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"2015-03-18T123412"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"2015-03-18 (BC)"}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"1970-01-01 "}, {0}, true, false, VARCHAR(), DATE());
testCast<std::string, int32_t>(
"date", {"2012-Oct-23"}, {0}, true, false, VARCHAR(), DATE());
"date", {" 1970-01-01 "}, {0}, true, false, VARCHAR(), DATE());
}

TEST_F(CastExprTest, primitiveInvalidCornerCases) {
Expand Down
23 changes: 11 additions & 12 deletions velox/type/TimestampConversion.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,6 @@ bool tryParseDateString(
return false;
}

// In standard-cast mode, no more trailing characters.
if (mode == ParseMode::kStandardCast) {
daysSinceEpoch = daysSinceEpochFromDate(year, month, day);

Expand All @@ -278,7 +277,7 @@ bool tryParseDateString(
return false;
}

// In non-standard cast mode, any optional trailing 'T' or spaces followed
// In non-standard cast mode, an optional trailing 'T' or space followed
// by any optional characters are valid patterns.
if (mode == ParseMode::kNonStandardCast) {
daysSinceEpoch = daysSinceEpochFromDate(year, month, day);
Expand Down Expand Up @@ -586,26 +585,26 @@ int64_t fromDateString(const char* str, size_t len) {
return daysSinceEpoch;
}

int32_t
castFromDateString(const char* str, size_t len, bool isNonStandardCast) {
int32_t castFromDateString(const char* str, size_t len, bool isIso8601) {
int64_t daysSinceEpoch;
size_t pos = 0;

auto mode = isNonStandardCast ? ParseMode::kNonStandardCast
: ParseMode::kStandardCast;
auto mode =
isIso8601 ? ParseMode::kStandardCast : ParseMode::kNonStandardCast;
if (!tryParseDateString(str, len, pos, daysSinceEpoch, mode)) {
if (isNonStandardCast) {
if (isIso8601) {
VELOX_USER_FAIL(
"Unable to parse date value: \"{}\"."
"Valid date string patterns include "
"(YYYY, YYYY-MM, YYYY-MM-DD), and any pattern prefixed with [+-]",
"Valid date string pattern is (YYYY-MM-DD), "
"and can be prefixed with [+-]",
std::string(str, len));

} else {
VELOX_USER_FAIL(
"Unable to parse date value: \"{}\"."
"Valid date string pattern is (YYYY-MM-DD), "
"and can be prefixed with [+-]",
"Valid date string patterns include "
"(yyyy*, yyyy*-[m]m, yyyy*-[m]m-[d]d, "
"yyyy*-[m]m-[d]d *, yyyy*-[m]m-[d]dT*), "
"and any pattern prefixed with [+-]",
std::string(str, len));
}
}
Expand Down
12 changes: 5 additions & 7 deletions velox/type/TimestampConversion.h
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ inline int64_t fromDateString(const StringView& str) {
}

/// Cast string to date.
/// When isNonStandardCast = false, only support "[+-]YYYY-MM-DD" format.
/// When isNonStandardCast = true, supported date formats include:
/// When isIso8601 = true, only support "[+-]YYYY-MM-DD" format (ISO 8601).
/// When isIso8601 = false, supported date formats include:
///
/// `[+-]YYYY*`
/// `[+-]YYYY*-[M]M`
Expand All @@ -96,12 +96,10 @@ inline int64_t fromDateString(const StringView& str) {
/// `[+-]YYYY*-[M]M-[D]DT*`
///
/// Throws VeloxUserError if the format or date is invalid.
int32_t castFromDateString(const char* buf, size_t len, bool isNonStandardCast);
int32_t castFromDateString(const char* buf, size_t len, bool isIso8601);

inline int32_t castFromDateString(
const StringView& str,
bool isNonStandardCast) {
return castFromDateString(str.data(), str.size(), isNonStandardCast);
inline int32_t castFromDateString(const StringView& str, bool isIso8601) {
return castFromDateString(str.data(), str.size(), isIso8601);
}

// Extracts the day of the week from the number of days since epoch
Expand Down
Loading

0 comments on commit 8d6c296

Please sign in to comment.