[WIP][SPARK-27562][Shuffle]Complete the verification mechanism for shuffle transmitted data #24447

turboFei · 2019-04-24T02:39:38Z

What changes were proposed in this pull request?

We've seen some shuffle data corruption during shuffle read phase.

As described in SPARK-26089, spark only checks small shuffle blocks before PR #23453, which is proposed by ankuriitg.

There are two changes/improvements that are made in PR #23453.

Large blocks are checked upto maxBytesInFlight/3 size in a similar way as smaller blocks, so if a
large block is corrupt in the starting, that block will be re-fetched and if that also fails,
FetchFailureException will be thrown.
If large blocks are corrupt after size maxBytesInFlight/3, then any IOException thrown while
reading the stream will be converted to FetchFailureException. This is slightly more aggressive
than was originally intended but since the consumer of the stream may have already read some records and processed them, we can't just re-fetch the block, we need to fail the whole task. Additionally, we also thought about maybe adding a new type of TaskEndReason, which would re-try the task couple of times before failing the previous stage, but given the complexity involved in that solution we decided to not proceed in that direction.

However, I think there still exists some problems with the current shuffle transmitted data verification mechanism:

For a large block, it is checked upto maxBytesInFlight/3 size when fetching shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it can not be detected in data fetch phase. This has been described in the previous section.
Only the compressed or wrapped blocks are checked, I think we should also check thease blocks which are not wrapped.

This pr complete the verification mechanism for shuffle transmitted data:

Firstly, crc32 is choosed for the checksum verification of shuffle data.

Crc is also used for checksum verification in hadoop, it is simple and fast.

During shuffle write phase, after completing the partitionedFile, we compute

the crc32 value for each partition and then write these digests with the indexs into shuffle index file.

For the sortShuffleWriter and unsafe shuffle writer, there is only one partitionedFile for a shuffleMapTask, so the compution of digests(compute the digests for each partition depend on the indexs of this partitionedFile) is cheap.

For the bypassShuffleWriter, the reduce partitions is little than byPassMergeThreshold, so the cost of digests compution is acceptable.

During shuffle read phase, the digest value will be passed with the block data.

And we will recompute the digest of the data obtained to compare with the origin digest value.
When recomputing the digest of data obtained, it only need an additional buffer(2048Bytes) for computing crc32 value.
After recomputing, we will reset the obtained data inputStream, if it is markSupported we only need reset it, otherwise it is a fileSegmentManagerBuffer, we need recreate it.

So, I think this verification mechanism proposed for shuffle transmitted data is efficient and complete.

How was this patch tested?

Unit test.

turboFei · 2019-04-24T03:25:21Z

@squito Could you help to review this?

turboFei · 2019-04-24T09:40:05Z

@attilapiros @srowen

squito · 2019-04-24T14:57:19Z

HI @turboFei thanks for posting this. Have you looked at SPARK-26089 yet? That actually addresses two out of your three concerns:

only check the first maxBytesInFlight/3 bytes.
need additional memory.

I'm not saying there is no value here -- that change does not handle the 3rd one "only detect the compressed or wrapped data." I also have always worried about whether it really makes sense to rely on the the codec to detect corruption, so using a digest could also make sense. But, this is a large change, so the case should be made clearly.

squito · 2019-04-24T14:57:48Z

@ankuriitg

squito · 2019-04-24T15:00:24Z

oops, I just noticed you actually used that jira for this change. So clearly if we were going to do this, you should open another jira and explain the case for having a checksum. or also happy to hear if there is some mistake in the change already committed for SPARK-26089: 688b0c0

turboFei · 2019-04-25T00:38:07Z

oops, I just noticed you actually used that jira for this change. So clearly if we were going to do this, you should open another jira and explain the case for having a checksum. or also happy to hear if there is some mistake in the change already committed for SPARK-26089: 688b0c0

Thanks, I will open another jira and make the case clearly.

turboFei · 2019-04-25T02:25:43Z

@squito I have created a new jira[SPARK-27562] and describe the schema.

turboFei · 2019-04-25T11:46:05Z

@jerryshao

turboFei · 2019-04-27T05:15:08Z

@cloud-fan Could you help to review this? I think this pr can guarantee the accuracy of shuffle data transmission efficiently.

github-actions · 2019-12-30T00:07:10Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

jerryshao · 2020-05-12T12:09:00Z

This seems to be a quite useful feature in heavy load production environment. We're also suffering from such shuffle issue, I think this is a valuable improvement.

@turboFei will you continue to work on this PR? If so, I'm going to reopen and help to review it.

cloud-fan · 2020-05-12T13:57:42Z

This is a nice feature. We need to think about how to integrate it with the batch shuffle fetch though.

turboFei · 2020-05-12T14:05:36Z

I will try to finish it. Please help reopen it, thanks.

jerryshao · 2020-05-13T02:45:15Z

Not sure why this PR is closed again, I'm gonna reopen it. @turboFei seems there's lots conflicts in code, can you please bring this to the latest?

AmplabJenkins · 2020-05-13T02:53:51Z

Can one of the admins verify this patch?

turboFei · 2020-05-13T02:57:08Z

I will.

turboFei · 2020-05-13T03:44:59Z

It seems that we have to recompute the crc32 value when fetching continuous blocks data.

…e transmitted data

turboFei · 2020-05-14T03:39:05Z

The behavior of bot is strange, have created a new pr, #28525

turboFei force-pushed the SPARK-26089 branch 4 times, most recently from 6c7916e to b52e1aa Compare April 24, 2019 03:43

turboFei force-pushed the SPARK-26089 branch from b52e1aa to 28c367b Compare April 25, 2019 00:46

turboFei changed the title ~~[SPARK-26089] Checking the shuffle transmitted data to handle large corrupt shuffle blocks~~ [SPARK-27562][Shuffle]Complete the verification mechanism for shuffle transmitted data Apr 25, 2019

turboFei force-pushed the SPARK-26089 branch 9 times, most recently from 8e2baff to 853542e Compare April 27, 2019 05:09

turboFei force-pushed the SPARK-26089 branch 2 times, most recently from c0bb54a to e2a2fea Compare April 27, 2019 16:31

turboFei mentioned this pull request May 12, 2019

About my pull request contributed to apache spark. turboFei/turbofei.github.com#2

Closed

dongjoon-hyun added the SHUFFLE label Jun 14, 2019

github-actions bot added the Stale label Dec 30, 2019

github-actions bot closed this Dec 31, 2019

cloud-fan reopened this May 12, 2020

github-actions bot closed this May 13, 2020

jerryshao reopened this May 13, 2020

turboFei added 2 commits May 14, 2020 07:53

[SPARK-27562][Shuffle] Complete the verification mechanism for shuffl…

1fbd643

…e transmitted data

fix code

8aa9ab5

turboFei changed the title ~~[SPARK-27562][Shuffle]Complete the verification mechanism for shuffle transmitted data~~ [WIP][SPARK-27562][Shuffle]Complete the verification mechanism for shuffle transmitted data May 13, 2020

turboFei force-pushed the SPARK-26089 branch from e2a2fea to 8aa9ab5 Compare May 13, 2020 23:55

probot-autolabeler bot added CORE DOCS labels May 13, 2020

github-actions bot closed this May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-27562][Shuffle]Complete the verification mechanism for shuffle transmitted data #24447

[WIP][SPARK-27562][Shuffle]Complete the verification mechanism for shuffle transmitted data #24447

turboFei commented Apr 24, 2019 •

edited

Loading

turboFei commented Apr 24, 2019

turboFei commented Apr 24, 2019

squito commented Apr 24, 2019 •

edited

Loading

squito commented Apr 24, 2019

squito commented Apr 24, 2019

turboFei commented Apr 25, 2019

turboFei commented Apr 25, 2019

turboFei commented Apr 25, 2019

turboFei commented Apr 27, 2019 •

edited

Loading

github-actions bot commented Dec 30, 2019

jerryshao commented May 12, 2020

cloud-fan commented May 12, 2020

turboFei commented May 12, 2020

jerryshao commented May 13, 2020

AmplabJenkins commented May 13, 2020

turboFei commented May 13, 2020

turboFei commented May 13, 2020 •

edited

Loading

turboFei commented May 14, 2020

[WIP][SPARK-27562][Shuffle]Complete the verification mechanism for shuffle transmitted data #24447

[WIP][SPARK-27562][Shuffle]Complete the verification mechanism for shuffle transmitted data #24447

Conversation

turboFei commented Apr 24, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

turboFei commented Apr 24, 2019

turboFei commented Apr 24, 2019

squito commented Apr 24, 2019 • edited Loading

squito commented Apr 24, 2019

squito commented Apr 24, 2019

turboFei commented Apr 25, 2019

turboFei commented Apr 25, 2019

turboFei commented Apr 25, 2019

turboFei commented Apr 27, 2019 • edited Loading

github-actions bot commented Dec 30, 2019

jerryshao commented May 12, 2020

cloud-fan commented May 12, 2020

turboFei commented May 12, 2020

jerryshao commented May 13, 2020

AmplabJenkins commented May 13, 2020

turboFei commented May 13, 2020

turboFei commented May 13, 2020 • edited Loading

turboFei commented May 14, 2020

turboFei commented Apr 24, 2019 •

edited

Loading

squito commented Apr 24, 2019 •

edited

Loading

turboFei commented Apr 27, 2019 •

edited

Loading

turboFei commented May 13, 2020 •

edited

Loading