Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction cleans input files instead of output files on failure #3633

Closed
evenyag opened this issue Apr 3, 2024 · 1 comment · Fixed by #3635
Closed

Compaction cleans input files instead of output files on failure #3633

evenyag opened this issue Apr 3, 2024 · 1 comment · Fixed by #3635
Assignees
Labels
C-bug Category Bugs

Comments

@evenyag
Copy link
Contributor

evenyag commented Apr 3, 2024

What type of bug is this?

Data corruption

What subsystems are affected?

Distributed Cluster

Minimal reproduce step

It only occurs under the distributed mode. You have to trigger a region failover while the region is doing compaction.

After that, querying that region may return File not found error.

What did you expect to see?

Query results.

What did you see instead?

Error

DataFusion error: NotFound (persistent) at  => File not found: data/nowawfppg0gtgreptime_example/public/72338/72338_0000000000/eab814da-d854-4bda-9cf8-7b0c04469de3.parquet

What operating system did you use?

Unrelated

What version of GreptimeDB did you use?

0.7.1

Relevant log output and stack trace

2024-04-01T08:48:04.916105Z  INFO mito2::worker::handle_flush: Region 310689344258048(72338, 0) flush finished, tries to bump wal to 317074
2024-04-01T08:48:04.916189Z  INFO mito2::compaction::twcs: Compaction window for region 310689344258048(72338, 0) is not present, inferring from files: 31536000
2024-04-01T08:48:05.243598Z  INFO mito2::compaction::twcs: Compaction region 310689344258048(72338, 0) output [41302e2e-7924-422b-901e-b52416428897,d224976e-9bf1-4919-b2ad-985c1ca1e7ea,a5618090-f38f-4b17-a78d-8636cd15ac55,0012b018-42fc-43ec-84f5-5dee1736833d,eab814da-d854-4bda-9cf8-7b0c04469de3]-> 7ad55aed-a418-40e5-ace0-ab1924d16ecd
2024-04-01T08:48:06.294727Z  INFO mito2::compaction::twcs: Compacted SST files, input: [FileMeta { region_id: 310689344258048(72338, 0), file_id: FileId(41302e2e-7924-422b-901e-b52416428897), time_range: (Timestamp { value: 1711173787999, unit: Millisecond }, Timestamp { value: 1711407969817, unit: Millisecond }), level: 0, file_size: 155255, available_indexes: [], index_file_size: 0 }, FileMeta { region_id: 310689344258048(72338, 0), file_id: FileId(d224976e-9bf1-4919-b2ad-985c1ca1e7ea), time_range: (Timestamp { value: 1711407972817, unit: Millisecond }, Timestamp { value: 1711637759478, unit: Millisecond }), level: 0, file_size: 183539, available_indexes: [], index_file_size: 0 }, FileMeta { region_id: 310689344258048(72338, 0), file_id: FileId(a5618090-f38f-4b17-a78d-8636cd15ac55), time_range: (Timestamp { value: 1711009012082, unit: Millisecond }, Timestamp { value: 1711173784999, unit: Millisecond }), level: 0, file_size: 114050, available_indexes: [], index_file_size: 0 }, FileMeta { region_id: 310689344258048(72338, 0), file_id: FileId(0012b018-42fc-43ec-84f5-5dee1736833d), time_range: (Timestamp { value: 1711637762479, unit: Millisecond }, Timestamp { value: 1711858653434, unit: Millisecond }), level: 0, file_size: 185360, available_indexes: [], index_file_size: 0 }, FileMeta { region_id: 310689344258048(72338, 0), file_id: FileId(eab814da-d854-4bda-9cf8-7b0c04469de3), time_range: (Timestamp { value: 1711858656434, unit: Millisecond }, Timestamp { value: 1711961282058, unit: Millisecond }), level: 0, file_size: 119211, available_indexes: [], index_file_size: 0 }], output: [FileMeta { region_id: 310689344258048(72338, 0), file_id: FileId(7ad55aed-a418-40e5-ace0-ab1924d16ecd), time_range: (Timestamp { value: 1711009012082, unit: Millisecond }, Timestamp { value: 1711961282058, unit: Millisecond }), level: 1, file_size: 748267, available_indexes: [], index_file_size: 0 }], window: Some(31536000)
2024-04-01T08:48:21.023492Z  WARN datanode::alive_keeper: The region 310689344258048(72338, 0) lease is expired, set region to readonly.
2024-04-01T08:48:47.075311Z  WARN mito2::request: Cleaning region 310689344258048(72338, 0) compaction output file: 41302e2e-7924-422b-901e-b52416428897
2024-04-01T08:48:47.075500Z  WARN mito2::request: Cleaning region 310689344258048(72338, 0) compaction output file: d224976e-9bf1-4919-b2ad-985c1ca1e7ea
2024-04-01T08:48:47.075599Z  WARN mito2::request: Cleaning region 310689344258048(72338, 0) compaction output file: a5618090-f38f-4b17-a78d-8636cd15ac55
2024-04-01T08:48:47.075708Z  WARN mito2::request: Cleaning region 310689344258048(72338, 0) compaction output file: 0012b018-42fc-43ec-84f5-5dee1736833d
2024-04-01T08:48:47.075898Z  WARN mito2::request: Cleaning region 310689344258048(72338, 0) compaction output file: eab814da-d854-4bda-9cf8-7b0c04469de3
2024-04-01T08:48:47.425921Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 41302e2e-7924-422b-901e-b52416428897, region: 310689344258048(72338, 0)
2024-04-01T08:48:47.438670Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: d224976e-9bf1-4919-b2ad-985c1ca1e7ea, region: 310689344258048(72338, 0)
2024-04-01T08:48:47.447091Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 0012b018-42fc-43ec-84f5-5dee1736833d, region: 310689344258048(72338, 0)
2024-04-01T08:48:47.448284Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: a5618090-f38f-4b17-a78d-8636cd15ac55, region: 310689344258048(72338, 0)
2024-04-01T08:48:47.461953Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: eab814da-d854-4bda-9cf8-7b0c04469de3, region: 310689344258048(72338, 0)
@evenyag evenyag self-assigned this Apr 3, 2024
@evenyag evenyag added the C-bug Category Bugs label Apr 3, 2024
@evenyag
Copy link
Contributor Author

evenyag commented Apr 3, 2024

The region lease expired during compaction.

2024-04-01T08:48:21.023492Z  WARN datanode::alive_keeper: The region 310689344258048(72338, 0) lease is expired, set region to readonly.

Then it invoked the on_failure() method of CompactionFinished.

pub(crate) async fn handle_compaction_finished(
&mut self,
region_id: RegionId,
mut request: CompactionFinished,
) {
let Some(region) = self.regions.writable_region_or(region_id, &mut request) else {
return;
};

Here we remove all compacted files

for file in &self.compacted_files {
warn!(
"Cleaning region {} compaction output file: {}",
self.region_id, file.file_id
);
self.file_purger.send_request(PurgeRequest {
file_meta: file.clone(),
});
}

However, compacted_files are compaction inputs that we shouldn't remove on failure. We should remove compaction_outputs instead.

pub(crate) struct CompactionFinished {
/// Region id.
pub(crate) region_id: RegionId,
/// Compaction output files that are to be added to region version.
pub(crate) compaction_outputs: Vec<FileMeta>,
/// Compacted files that are to be removed from region version.
pub(crate) compacted_files: Vec<FileMeta>,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category Bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant