Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31762][SQL] Fix perf regression of date/timestamp formatting in toHiveString #28582

Closed
wants to merge 12 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented May 19, 2020

What changes were proposed in this pull request?

  1. Add new methods that accept date-time Java types to the DateFormatter and TimestampFormatter traits. The methods format input date-time instances to strings:
    • TimestampFormatter:
      • def format(ts: Timestamp): String
      • def format(instant: Instant): String
    • DateFormatter:
      • def format(date: Date): String
      • def format(localDate: LocalDate): String
  2. Re-use the added methods from HiveResult.toHiveString
  3. Borrow the code for formatting of java.sql.Timestamp from Spark 2.4 DateTimeUtils.timestampToString to FractionTimestampFormatter because legacy formatters don't support variable length patterns for seconds fractions.

Why are the changes needed?

To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By existing tests for toHiveString and new tests in TimestampFormatterSuite.

@SparkQA
Copy link

SparkQA commented May 19, 2020

Test build #122854 has finished for PR 28582 at commit d8b0ed4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SPARK-31762][SQL] Fix perf regression of date/timestamp formatting in toHiveString [SPARK-31762][SQL] Fix perf regression of date/timestamp formatting in toHiveString May 20, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented May 20, 2020

@cloud-fan @HyukjinKwon @bogdanghit Please, review this PR.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122887 has finished for PR 28582 at commit 3914313.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -100,10 +111,26 @@ class Iso8601TimestampFormatter(
*/
class FractionTimestampFormatter(zoneId: ZoneId)
extends Iso8601TimestampFormatter(
"", zoneId, TimestampFormatter.defaultLocale, needVarLengthSecondFraction = false) {
TimestampFormatter.defaultPattern,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong if we still pass the empty string as the pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it's used by the legacy formatter.


@transient
override protected lazy val formatter = DateTimeFormatterHelper.fractionFormatter

// Converts Timestamp to string according to Hive TimestampWritable convention.
// The code is borrowed from Spark 2.4 DateTimeUtils.timestampToString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really needed? DateTimeFormatterHelper.fractionFormatter should omit tailing 0 already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should omit tailing 0 already.

The reason of making this PR is to pass java.sql.Timestamp to a legacy formatter which can accept the type but the legacy formatter of fractionFormatter is SimpleDateFormat which cannot omit tailing 0.

is it really needed?

I think so. What do you propose instead of it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the comment more clear:

// The new formatter will omit the trailing 0 in the timestamp string, but the legacy formatter can't.
// Here we borrow the code from Spark 2.4 DateTimeUtils.timestampToString to omit the
// trailing 0 for the legacy formatter as well.

We don't need to mention hive at all. This is just for internal consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I updated the comment.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122883 has finished for PR 28582 at commit 6d8c2ae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -584,7 +584,7 @@ select make_date(-44, 3, 15)
-- !query schema
struct<make_date(-44, 3, 15):date>
-- !query output
-0044-03-15
0045-03-15
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is compatible with Spark 2.4 because it uses SimpleDateFormat which cannot format negative years.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122893 has finished for PR 28582 at commit 29d8e51.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122902 has finished for PR 28582 at commit 31e8ffd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 5d67331 May 21, 2020
cloud-fan pushed a commit that referenced this pull request May 21, 2020
…n toHiveString

### What changes were proposed in this pull request?
1. Add new methods that accept date-time Java types to the DateFormatter and TimestampFormatter traits. The methods format input date-time instances to strings:
    - TimestampFormatter:
      - `def format(ts: Timestamp): String`
      - `def format(instant: Instant): String`
    - DateFormatter:
      - `def format(date: Date): String`
      - `def format(localDate: LocalDate): String`
2. Re-use the added methods from `HiveResult.toHiveString`
3. Borrow the code for formatting of `java.sql.Timestamp` from Spark 2.4 `DateTimeUtils.timestampToString` to `FractionTimestampFormatter` because legacy formatters don't support variable length patterns for seconds fractions.

### Why are the changes needed?
To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests for toHiveString and new tests in `TimestampFormatterSuite`.

Closes #28582 from MaxGekk/opt-format-old-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 5d67331)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
HyukjinKwon pushed a commit that referenced this pull request May 25, 2020
…ectly

### What changes were proposed in this pull request?
Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR #28582 added the methods to avoid conversions to days and microseconds.

### Why are the changes needed?
To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests in `RowJsonSuite`.

Closes #28620 from MaxGekk/toJson-format-Java-datetime-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request May 25, 2020
…ectly

### What changes were proposed in this pull request?
Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR #28582 added the methods to avoid conversions to days and microseconds.

### Why are the changes needed?
To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests in `RowJsonSuite`.

Closes #28620 from MaxGekk/toJson-format-Java-datetime-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 7f36310)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants