Bug Fix for Spark 3.x - Avoid converting converted Row values #868

suhsteve · 2021-03-26T06:33:13Z

There has been no modification to the RowPickler code between EvaluatePython.scala (Spark 2.4.7) and EvaluatePython.scala (Spark 3.0.0).

Spark 2.4.7 used pyrolite 4.13 and starting with Spark 3.0.0 pyrolite was updated to 4.30. In RowPickler, Spark pickles the row values using:

        while (i < row.values.length) {
          pickler.save(row.values(i))
          i += 1
        }

In pickler.save(Object) Pyrolite checks whether the object has been memoized, and if it hasn't it will process the object and pickle it. There was a PR in Nov 2017 (between the 4.13 and 4.30 releases) that updated the logic with how the memoize check was done. Pyrolite 4.13 checked the System.identityHashCode(obj), however pyrolite 4.30 only checks the obj.hashCode() , which is the default behavior, unless the valueCompare flag has been toggled. Toggling this flag would go back to the 4.13 behavior. Spark 3.x, however, does not use the Pickler constructor to set this.

I have refactored the RowConstructor class a bit to make it easier to understand as well as fixed the issue with converting already converted row values.

Fixes #760

imback82 · 2021-03-26T15:29:50Z

@suhsteve Can you check the test failures if they are related?

imback82 · 2021-03-26T16:18:18Z

src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs

@@ -143,9 +143,9 @@ internal class PicklingSqlCommandExecutor : SqlCommandExecutor
                        // The following can happen if an UDF takes Row object(s).
                        // The JVM Spark side sends a Row object that wraps all the columns used
                        // in the UDF, thus, it is normalized below (the extra layer is removed).


Is this comment still relevant?

Worker will crash without this, so I believe it is ?

oh I meant respect to the code. I think "extra layer is removed" is regarding the RowConstructor, but now that it's gone, is the comment up to date?

Extra layer can refer to Row, so we take out Values from it ?

I'm okay with removing the ( )'s though if things sound unclear.

@elvaliuliuliu do we need to update the description or does it still apply ?

src/csharp/Microsoft.Spark/Sql/RowConstructor.cs

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/RowTests.cs

suhsteve · 2021-03-26T19:08:18Z

src/csharp/Microsoft.Spark/Sql/Types/Timestamp.cs

@@ -94,7 +94,7 @@ public Timestamp(DateTime dateTime)
        /// <summary>
        /// Readable string representation for this type.
        /// </summary>
-        public override string ToString() => _dateTime.ToString("yyyy-MM-dd HH:mm:ss.ffffff");
+        public override string ToString() => _dateTime.ToString("yyyy-MM-dd HH:mm:ss.ffffffZ");


I assume this was a bug ? Converting to string and casting back to Timestamp in Spark caused the time to shift 8 hours.
@elvaliuliuliu @imback82

Adding a separate PR instead to address this. #871

thanks. let's discuss this in #871 and remove from this PR.

src/csharp/Microsoft.Spark/Sql/RowConstructor.cs

imback82 · 2021-03-26T21:09:10Z

src/csharp/Microsoft.Spark/Sql/RowConstructor.cs

-            // It is possible that an entry of a Row (row1) may itself be a Row (row2).
-            // If the entry is a RowConstructor then it will be a RowConstructor


I guess we already have test case handling this right?

yeah there are a few that have rows as column values.

imback82

LGTM (if tests pass), thanks @suhsteve!

imback82 · 2021-03-26T23:03:14Z

@suhsteve BTW, did you also test against the repro in #760?

suhsteve · 2021-03-26T23:21:38Z

@suhsteve BTW, did you also test against the repro in #760?

Yeah I ran it against the repro.

suhsteve · 2021-03-26T23:22:40Z

@suhsteve BTW, did you also test against the repro in #760?

Hmm I'm surprised it's passing. It was failing earlier for TestUdfWithDuplicateTimestamps

suhsteve · 2021-03-26T23:25:17Z

@suhsteve BTW, did you also test against the repro in #760?

Hmm I'm surprised it's passing. It was failing earlier for TestUdfWithDuplicateTimestamps

The target machines must be using UTC time as their timezone. It fails locally on my machine.

  Message: 
    Assert.Equal() Failure
    Expected: 1970-01-02 00:00:00.000000
    Actual:   1970-01-02 08:00:00.000000

imback82 · 2021-03-26T23:25:19Z

Can you push an empty commit?

suhsteve added 2 commits March 25, 2021 22:49

initial commit

234846d

Merge branch 'main' into fixtimestamp

6ebbedc

suhsteve linked an issue Mar 26, 2021 that may be closed by this pull request

[BUG]: Applying UDFs to TimestampType causes occasional exception #760

Closed

suhsteve self-assigned this Mar 26, 2021

suhsteve added the fixing bug Fixing a bug label Mar 26, 2021

imback82 reviewed Mar 26, 2021

View reviewed changes

suhsteve added 2 commits March 26, 2021 11:34

PR comments.

c6a5f94

update comment.

4bc8369

suhsteve commented Mar 26, 2021

View reviewed changes

fix test

0f40497

suhsteve requested a review from Niharikadutta March 26, 2021 19:28

suhsteve mentioned this pull request Mar 26, 2021

Prep 1.1 release #757

Merged

fix test

1f1f890

imback82 reviewed Mar 26, 2021

View reviewed changes

src/csharp/Microsoft.Spark/Sql/RowConstructor.cs Outdated Show resolved Hide resolved

remove timestamp changes

2cc62b3

imback82 reviewed Mar 26, 2021

View reviewed changes

src/csharp/Microsoft.Spark/Sql/RowConstructor.cs Outdated Show resolved Hide resolved

imback82 reviewed Mar 26, 2021

View reviewed changes

PR comments

487e729

imback82 previously approved these changes Mar 26, 2021

View reviewed changes

imback82 mentioned this pull request Mar 26, 2021

Fixes for TimestampType and DateType conversion #765

Closed

update test to remove timestamp string

9fb90c3

suhsteve dismissed imback82’s stale review via 9fb90c3 March 27, 2021 00:17

skip tests

e9ca1d6

imback82 approved these changes Mar 27, 2021

View reviewed changes

imback82 merged commit 33299cf into dotnet:main Mar 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fix for Spark 3.x - Avoid converting converted Row values #868

Bug Fix for Spark 3.x - Avoid converting converted Row values #868

suhsteve commented Mar 26, 2021 •

edited by imback82

Loading

imback82 commented Mar 26, 2021

imback82 Mar 26, 2021

suhsteve Mar 26, 2021

imback82 Mar 26, 2021

suhsteve Mar 26, 2021

suhsteve Mar 26, 2021

suhsteve Mar 26, 2021

suhsteve Mar 26, 2021 •

edited

Loading

suhsteve Mar 26, 2021

imback82 Mar 26, 2021

imback82 Mar 26, 2021

suhsteve Mar 26, 2021

imback82 left a comment

imback82 commented Mar 26, 2021

suhsteve commented Mar 26, 2021

suhsteve commented Mar 26, 2021

suhsteve commented Mar 26, 2021

imback82 commented Mar 26, 2021

		// It is possible that an entry of a Row (row1) may itself be a Row (row2).
		// If the entry is a RowConstructor then it will be a RowConstructor

Bug Fix for Spark 3.x - Avoid converting converted Row values #868

Bug Fix for Spark 3.x - Avoid converting converted Row values #868

Conversation

suhsteve commented Mar 26, 2021 • edited by imback82 Loading

imback82 commented Mar 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suhsteve Mar 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

imback82 commented Mar 26, 2021

suhsteve commented Mar 26, 2021

suhsteve commented Mar 26, 2021

suhsteve commented Mar 26, 2021

imback82 commented Mar 26, 2021

suhsteve commented Mar 26, 2021 •

edited by imback82

Loading

suhsteve Mar 26, 2021 •

edited

Loading