Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26456][SQL] Cast date/timestamp to string by Date/TimestampFormatter #23391

Closed
wants to merge 26 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 27, 2018

What changes were proposed in this pull request?

In the PR, I propose to switch on TimestampFormatter/DateFormatter in casting dates/timestamps to strings. The changes should make the date/timestamp casting consistent to JSON/CSV datasources and time-related functions like to_date, to_unix_timestamp/from_unixtime.

Local formatters are moved out from DateTimeUtils to where they are actually used. It allows to avoid re-creation of new formatter instance per-each call. Another reason is to have separate parser for PartitioningUtils because default parsing pattern cannot be used (expected optional section [.S]).

How was this patch tested?

It was tested by DateTimeUtilsSuite, CastSuite and JDBC*Suite.

@MaxGekk MaxGekk changed the title [SPARK-26456][SQL] Cast date/timestamp by Date/TimestampFormatter [SPARK-26456][SQL] Cast date/timestamp to string by Date/TimestampFormatter Dec 27, 2018
@SparkQA
Copy link

SparkQA commented Dec 27, 2018

Test build #100479 has finished for PR 23391 at commit 697688a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 28, 2018

Test build #100480 has finished for PR 23391 at commit f0a9fe7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 28, 2018

Test build #100494 has finished for PR 23391 at commit 5f0b0a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 28, 2018

Test build #100501 has finished for PR 23391 at commit 8541602.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 28, 2018

Test build #100499 has finished for PR 23391 at commit 56bdae4.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 28, 2018

@cloud-fan @srowen @dongjoon-hyun @HyukjinKwon May I ask you to review this PR.

@SparkQA
Copy link

SparkQA commented Dec 28, 2018

Test build #100510 has finished for PR 23391 at commit ecf0e89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -88,11 +88,18 @@ class LegacyFallbackDateFormatter(
}

object DateFormatter {
val defaultPattern: String = "yyyy-MM-dd"
val defaultLocale: Locale = Locale.US

def apply(format: String, locale: Locale): DateFormatter = {
if (SQLConf.get.legacyTimeParserEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would you need this? You could argue that since we are moving to Spark 3.0 we don't need to care as much about legacy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, date and timestamp patterns are fixed, and we shouldn't see any behavior changes but DateFormatter/TimestampFormatter are used from CSV/JSON Datasources and from a functions where user can set any patterns. Unfortunately supported patterns by SimpleDateFormat and DateTimeFormat are not absolutely the same. Also there are other differences in their behavior: https://github.com/apache/spark/pull/23358/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9R42

What I have learned from other PRs, if I introduce a behavior change, I should leave opportunity to users to come back to previous behavior. Later the old behavior can be deprecated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just saying that we are going to break stuff anyway. If the legacy behavior is somewhat unreasonable, then we should consider not supporting it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of the flag in the PR is slightly out of its scope, I believe. I would prefer to open a ticket and leave that to somebody who is much more brave.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket to remove the flag: https://issues.apache.org/jira/browse/SPARK-26503

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification, I don't think we should treat the previous behaviour unreasonable ... I am okay with considering to remove that legacy configuration regarding that we're going ahead for 3.0, it causes some overhead about maintenance, and it blocks some features.

Also, for clarification, it's kind of a breaking changes. Think about that the CSV codes were dependent on timestamp being inferred and suddenly it becomes strings after upgrade. Even, this behaviour was documented in 2.x (by referring SimpleDateFormat).

@@ -111,6 +110,9 @@ class QueryExecution(
protected def stringOrError[A](f: => A): String =
try f.toString catch { case e: AnalysisException => e.toString }

private val dateFormatter = DateFormatter()
Copy link
Contributor

@hvanhovell hvanhovell Dec 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably get rid of the hiveResultString method for 3.0. It does not make much sense to keep it in there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I create a separate JIRA for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please. We should just move that into the a test class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def apply(format: String): DateFormatter = apply(format, defaultLocale)

def apply(): DateFormatter = apply(defaultPattern)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both formatters seems to use thread safe implementations. You could consider just returning cached instances here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, both formatters are created per partition at least not per row. Do you think it makes sense to cache them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lets leave it for now.

@SparkQA
Copy link

SparkQA commented Dec 29, 2018

Test build #100516 has finished for PR 23391 at commit c3066e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…ormat

# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
@SparkQA
Copy link

SparkQA commented Jan 3, 2019

Test build #100687 has finished for PR 23391 at commit 5ae58a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…ormat

# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100880 has finished for PR 23391 at commit 2397401.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveResultSuite extends SparkFunSuite with SharedSQLContext

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 7, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100885 has finished for PR 23391 at commit 2397401.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveResultSuite extends SparkFunSuite with SharedSQLContext

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 10, 2019

@srowen @HyukjinKwon @cloud-fan Could you look at the PR one more time, please.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess that you know this change much better than I will, but from reviewing previous PRs and my understanding of the issues, this looks good.

srowen added a commit that referenced this pull request Jan 11, 2019
## What changes were proposed in this pull request?

Per discussion in #23391 (comment) this proposes to just remove the old pre-Spark-3 time parsing behavior.

This is a rebase of #23411

## How was this patch tested?

Existing tests.

Closes #23495 from srowen/SPARK-26503.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@SparkQA
Copy link

SparkQA commented Jan 12, 2019

Test build #101098 has finished for PR 23391 at commit 206c955.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Let's also update PR description. spark.sql.legacy.timeParser.enabled is now removed.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jan 14, 2019

Test build #101168 has finished for PR 23391 at commit 483e95b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 14, 2019

jenkins, retest this, please

@HyukjinKwon
Copy link
Member

Looks fine to me too.

@SparkQA
Copy link

SparkQA commented Jan 14, 2019

Test build #101177 has finished for PR 23391 at commit 483e95b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 115fecf Jan 14, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Per discussion in apache#23391 (comment) this proposes to just remove the old pre-Spark-3 time parsing behavior.

This is a rebase of apache#23411

## How was this patch tested?

Existing tests.

Closes apache#23495 from srowen/SPARK-26503.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…matter

## What changes were proposed in this pull request?

In the PR, I propose to switch on `TimestampFormatter`/`DateFormatter` in casting dates/timestamps to strings. The changes should make the date/timestamp casting consistent to JSON/CSV datasources and time-related functions like `to_date`, `to_unix_timestamp`/`from_unixtime`.

Local formatters are moved out from `DateTimeUtils` to where they are actually used. It allows to avoid re-creation of new formatter instance per-each call. Another reason is to have separate parser for `PartitioningUtils` because default parsing pattern cannot be used (expected optional section `[.S]`).

## How was this patch tested?

It was tested by `DateTimeUtilsSuite`, `CastSuite` and `JDBC*Suite`.

Closes apache#23391 from MaxGekk/thread-local-date-format.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the thread-local-date-format branch August 17, 2019 13:36
case (t: Timestamp, TimestampType) =>
val timeZone = DateTimeUtils.getTimeZone(SQLConf.get.sessionLocalTimeZone)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk @cloud-fan by moving getting the timezone from here to a lazy val in the object, it will be initialized only once by the first session that uses it. Another session with a different sessionLocalTimeZone set will get results in wrong timezone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliuszsompolski Thank you for the bug report. I will fix the issue. I think it is ok to create formatters in place because they can be pulled from caches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR #28024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants