Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Fix for Spark 3.x - Avoid converting converted Row values #868

Merged
merged 10 commits into from
Mar 27, 2021

Conversation

suhsteve
Copy link
Member

@suhsteve suhsteve commented Mar 26, 2021

There has been no modification to the RowPickler code between EvaluatePython.scala (Spark 2.4.7) and EvaluatePython.scala (Spark 3.0.0).

Spark 2.4.7 used pyrolite 4.13 and starting with Spark 3.0.0 pyrolite was updated to 4.30. In RowPickler, Spark pickles the row values using:

        while (i < row.values.length) {
          pickler.save(row.values(i))
          i += 1
        }

In pickler.save(Object) Pyrolite checks whether the object has been memoized, and if it hasn't it will process the object and pickle it. There was a PR in Nov 2017 (between the 4.13 and 4.30 releases) that updated the logic with how the memoize check was done. Pyrolite 4.13 checked the System.identityHashCode(obj), however pyrolite 4.30 only checks the obj.hashCode() , which is the default behavior, unless the valueCompare flag has been toggled. Toggling this flag would go back to the 4.13 behavior. Spark 3.x, however, does not use the Pickler constructor to set this.

I have refactored the RowConstructor class a bit to make it easier to understand as well as fixed the issue with converting already converted row values.

Fixes #760

@suhsteve suhsteve linked an issue Mar 26, 2021 that may be closed by this pull request
@suhsteve suhsteve self-assigned this Mar 26, 2021
@suhsteve suhsteve added the fixing bug Fixing a bug label Mar 26, 2021
@imback82
Copy link
Contributor

@suhsteve Can you check the test failures if they are related?

@@ -143,9 +143,9 @@ internal class PicklingSqlCommandExecutor : SqlCommandExecutor
// The following can happen if an UDF takes Row object(s).
// The JVM Spark side sends a Row object that wraps all the columns used
// in the UDF, thus, it is normalized below (the extra layer is removed).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worker will crash without this, so I believe it is ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I meant respect to the code. I think "extra layer is removed" is regarding the RowConstructor, but now that it's gone, is the comment up to date?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra layer can refer to Row, so we take out Values from it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with removing the ( )'s though if things sound unclear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elvaliuliuliu do we need to update the description or does it still apply ?

src/csharp/Microsoft.Spark/Sql/RowConstructor.cs Outdated Show resolved Hide resolved
@@ -94,7 +94,7 @@ public Timestamp(DateTime dateTime)
/// <summary>
/// Readable string representation for this type.
/// </summary>
public override string ToString() => _dateTime.ToString("yyyy-MM-dd HH:mm:ss.ffffff");
public override string ToString() => _dateTime.ToString("yyyy-MM-dd HH:mm:ss.ffffffZ");
Copy link
Member Author

@suhsteve suhsteve Mar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was a bug ? Converting to string and casting back to Timestamp in Spark caused the time to shift 8 hours.
@elvaliuliuliu @imback82

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a separate PR instead to address this. #871

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. let's discuss this in #871 and remove from this PR.

@suhsteve suhsteve requested a review from Niharikadutta March 26, 2021 19:28
@suhsteve suhsteve mentioned this pull request Mar 26, 2021
Comment on lines -91 to -92
// It is possible that an entry of a Row (row1) may itself be a Row (row2).
// If the entry is a RowConstructor then it will be a RowConstructor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we already have test case handling this right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there are a few that have rows as column values.

imback82
imback82 previously approved these changes Mar 26, 2021
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (if tests pass), thanks @suhsteve!

@imback82
Copy link
Contributor

@suhsteve BTW, did you also test against the repro in #760?

@suhsteve
Copy link
Member Author

@suhsteve BTW, did you also test against the repro in #760?

Yeah I ran it against the repro.

@suhsteve
Copy link
Member Author

@suhsteve BTW, did you also test against the repro in #760?

Hmm I'm surprised it's passing. It was failing earlier for TestUdfWithDuplicateTimestamps

@suhsteve
Copy link
Member Author

@suhsteve BTW, did you also test against the repro in #760?

Hmm I'm surprised it's passing. It was failing earlier for TestUdfWithDuplicateTimestamps

The target machines must be using UTC time as their timezone. It fails locally on my machine.

  Message: 
    Assert.Equal() Failure
    Expected: 1970-01-02 00:00:00.000000
    Actual:   1970-01-02 08:00:00.000000

@imback82
Copy link
Contributor

Can you push an empty commit?

@imback82 imback82 merged commit 33299cf into dotnet:main Mar 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixing bug Fixing a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: Applying UDFs to TimestampType causes occasional exception
2 participants