Often coding shred count in an FEC block is not maximum #7428

pgarg66 · 2019-12-11T18:50:29Z

Problem

Currently we are using 32:32 erasure ratio. This guarantees that for every 32 data shreds, there'll be 32 coding shreds.

However, the number of data shreds could be less than 32. For example, at end of a slot the last entries may not yield 32 data shreds. Or, in an idling cluster, we generate 64 ticks per 400ms slot. That'll yield 32 data shreds every 200ms. We do not wait to accumulate 32 data shreds to generate 32 coding shreds.

The current code cannot generate more coding shreds than data shreds in a given FEC block. This is because the coding shreds share the same indexing space as data shreds. The first coding shred in an FEC block has the same index as the first data shred in that block. If there were more coding shreds than data shreds in a block, that'll cause the coding shred index to overflow data shred index space. This will either cause the coding index in current block to overlap with the data index in the next FEC block, or leave holes in data index address space. Neither of these approach will work.

Due to this, maximum number of coding shreds = number of data shreds in the FEC block. So occasionally, there's only 1 coding shred is generated in a FEC block. This is not enough for erasure recovery for missing data. We need a mechanism through which we can still generate 32 coding shreds independent of number of data shreds in the block.

Proposed Solution

Considered following solutions.

Wait for 32 data shreds before generating coding shreds. Let data shreds be transmitted as they are created, but don't generate/transmit coding shreds until we have 32 data shreds. For the end of slot, generate padding data shreds to fill up 32 data shred limit.
Decouple index space for coding shreds from data shreds. Don't assume first coding shred index = first data shred index. Wait for certain time to accumulate 32 data shreds (e.g. N msec). If not enough data shreds are created, generate 32 coding shreds from current set of data shreds.

The 1st solution will trigger more repairs in an idling network with packet drops. As we'll wait almost half the slot before generating/transmitting coding shreds. Repair will kick in even before the coding shreds are transmitted. This will have adverse effect on overall network traffic and confirmation times.

At this point, 2nd seems a more comprehensive solution. It does have a design impact on shred data structure. Need to analyze it further.

tag: @aeyakovenko

garious · 2019-12-12T00:12:45Z

Beautifully written, thanks! More recon on option 2 sounds good to me. Seems worthy of a design proposal too. "Decoupled Coding Shreds"?

aeyakovenko · 2019-12-12T01:19:43Z

The odd side effects of this bug is that we have higher availability under load.

This is why we are seeing much larger than expected lost blocks when a small amount of stake is missing from the network. Any small batch of shreds like 2:2 fails in a block, that block will only succeed if repair succeeds.

pgarg66 · 2019-12-12T03:18:50Z

Beautifully written, thanks! More recon on option 2 sounds good to me. Seems worthy of a design proposal too. "Decoupled Coding Shreds"?

Sounds good @garious

pgarg66 · 2019-12-17T22:07:48Z

The issue was fixed with #7474

pgarg66 self-assigned this Dec 11, 2019

pgarg66 added this to the Supertubes v0.22.0 milestone Dec 11, 2019

pgarg66 mentioned this issue Dec 11, 2019

Allow coding shred index to be different than data shred index #7438

Merged

pgarg66 mentioned this issue Dec 13, 2019

Generate MAX_DATA_SHREDS_PER_FEC_BLOCK coding shreds for each FEC block #7474

Merged

pgarg66 closed this as completed Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Often coding shred count in an FEC block is not maximum #7428

Often coding shred count in an FEC block is not maximum #7428

pgarg66 commented Dec 11, 2019 •

edited

Loading

garious commented Dec 12, 2019

aeyakovenko commented Dec 12, 2019 •

edited

Loading

pgarg66 commented Dec 12, 2019

pgarg66 commented Dec 17, 2019

Often coding shred count in an FEC block is not maximum #7428

Often coding shred count in an FEC block is not maximum #7428

Comments

pgarg66 commented Dec 11, 2019 • edited Loading

Problem

Proposed Solution

garious commented Dec 12, 2019

aeyakovenko commented Dec 12, 2019 • edited Loading

pgarg66 commented Dec 12, 2019

pgarg66 commented Dec 17, 2019

pgarg66 commented Dec 11, 2019 •

edited

Loading

aeyakovenko commented Dec 12, 2019 •

edited

Loading