Skip to content

Commit

Permalink
[SPARK-49770][SS][ROCKSDB HARDENING] Improve RocksDB SST file mapping…
Browse files Browse the repository at this point in the history
… management, and fix issue with reloading same version with existing snapshot

### What changes were proposed in this pull request?
Currently, we have a scenario where if a version X is loaded (and there is an existing snapshot with version X), Spark will reuse the SST files from the existing Snapshot resulting in a VersionID Mismatch error. This PR fixes this issue, and simplifies RocksDB state management. The change eliminates the majority of shared state between Task thread and Maintenance thread, simplifying the implementation.

With this change, the task thread is now solely responsible for keeping track of the local files to DFS file mapping, the maintenance thread will not access this mapping. The DFS file names are generated at `commit()` - the generated snapshot, and the mapping containing the new SST files (with their generated DFS names) is handed over to the maintenance thread. The latest generated snapshot will be appended to a ConcurrentLinkedQueue (named snapshotsToUploadQueue).

The maintenance thread polls from the snapshotsToUploadQueue repeatedly until its empty. For the last snapshot polled out of the queue, the maintenance thread will upload the new SST files. The maintenance thread will also clear all removed snapshot objects from the disk.

### Why are the changes needed?

These changes fix an issue where Spark Streaming fails with RocksDB VersionIdMismatch error if there is a existing snapshot (not yet uploaded) for RocksDB version being loaded. In this scenario, SST files from existing snapshot are reused resulting in a versionId Mismatch error. In short, these changes:

1. Remove shared fileMapping between RocksDB Maintenance thread and task thread, simplifying the file mapping logic. The file Mapping is only modified from task thread after this change.
2. Fixes the issue where RocksDB SST files from current snapshot (with same version) are reused, resulting in RocksDB VersionId Mismatch.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

1. All existing testcases pass.
2. Added new testcases suggested in https://github.com/apache/spark/pull/47850/files, and ensure they pass with these changes.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47875 from sahnib/master.

Lead-authored-by:  Bhuwan Sahni <bhuwan.sahni@databricks.com>
Co-authored-by: micheal-o <micheal.okutubo@gmail.com>
Co-authored-by: Bhuwan Sahni <bhuwan.sahni@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
  • Loading branch information
2 people authored and HeartSaVioR committed Oct 17, 2024
1 parent 74dbd9a commit 8405c9b
Show file tree
Hide file tree
Showing 3 changed files with 405 additions and 281 deletions.
Loading

0 comments on commit 8405c9b

Please sign in to comment.