Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to backup/restore metadata files for DiskS3 #18377

Merged
merged 13 commits into from
Feb 5, 2021

Conversation

Jokser
Copy link
Contributor

@Jokser Jokser commented Dec 22, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add the ability to backup/restore metadata files for DiskS3

Detailed description / Documentation draft:
We want to add the possibility to restore metadata files on local FS for DiskS3 using only S3. What do we need to implement this feature?

  1. Add 'path' key to object metadata for each S3 object we write through DiskS3. This is a path to the metadata file on the local file system.
  2. For each file/directories rename store a special S3 object with 'old_path', 'new_path' keys in object metadata. Each rename is an atomic commit of insert/merge/remove operations.
  3. Additionally for each hard link creation we need to store a similar S3 object. This is needed to properly restore mutations.

To restore metadata files we need to list all S3 objects having 'path' in metadata and create a metadata file on local FS by the given path.
Then we need to apply all renames and hard link creations. Renames and hard links should be time ordered, so we need logical clocks to linearize all operations (file write, rename, hardlink).
We can introduce a simple implementation of such clocks - named as 'revision'. This is an atomic counter incremented by each write / rename / hardlink operation. Each object should have revision in its name.

Revisions can be used also to make a snapshot of the current state of S3 (backup). To make a backup we need to freeze all tables and remember a current revision counter. Part files wouldn't be deleted as they backed-up in the shadow folder, the revision counter cuts a set of files / operations needed to restore a particular state.

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Dec 22, 2020
@Jokser Jokser marked this pull request as ready for review January 13, 2021 13:46
@vdimir vdimir self-assigned this Jan 25, 2021
Copy link
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work!

I'll put here some details about implementation, correct me if I understand something wrong:

  • All new data files on S3 will have name /r<REV>-file-<RANDOM_NAME> and path in metadata with path to metadata files on local disk
  • If we want to restore file we create metadata file for corresponding data file on local disk in path specified in path. Metadata files itself not stored on S3, it created for each data-file from scratch and we need to know only data-file name and path to metadata file on local disk for it.
  • In S3 path /operations/r<REV>-rename or /operations/r<REV>-hardlink we store information about operations performed on metadata files and replay it for restored metadata files.

Also I left some general question in code.

src/Disks/S3/DiskS3.cpp Show resolved Hide resolved

LOG_INFO(&Poco::Logger::get("DiskS3"), "Starting up disk {}", name);

/// Find last revision.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we cannot keep this information in file on local disk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we can lose all metadata information on the local disk including the last revision in separate file.
We can rely only on the data stored in S3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about to store latest revision in S3 object called latest_revision? Or searching revision in such way expected to be cheap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing latest_revision every time to the separate object will cause overhead on each file write.
Search revision at start-up is cheap, just ~64 (binary search) head object requests.

information.source_path = s3_root_path;

readRestoreInformation(information);
if (information.revision == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happended if we try to restore from revision, but some data files from it was deleted?
Do we actually need restoring from arbitrary revision, not from last?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually needed if we want to have backups and restore metadata to an arbitrary backup state.
To ensure that data is not deleted we should perform 'ALTER TABLE PARTITIONS FREEZE' queries and remember the revision counter that dumped into revision.txt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Is there plans to write piece of documentation about how to use this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course

object_metadata = {
{"path", path}
};
s3_path = "r" + revisionToString(revision) + "-file-" + s3_path;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What heppened if we enable this feature for disk created without it ealier? Will restoring fail or it restore information only from newly created files with this metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, we can restore the disk state for only newly created files, other files that don't have 'path' in metadata will be ignored.
To have the possibility to restore old files we should introduce a migration utility.
This utility should update metadata for existing files storing 'path' key.
This is the very important thing, but I prefer to do it in a separate PR with separate tests.
At the moment we can enable/disable restore possibility using 'send_metadata' option for DiskS3, so we can easily test this utility.

Copy link
Member

@vdimir vdimir Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reject restoring if there are object without path? To ensure that we don't restore some non-consistent state

UPD: this feature cannot be changed for existing disks, so we can only enable for newly created disks and partial state is impossible Am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes sense. I'll change this piece of code to throw exception during restore.
I think behaviour can be changed for existing disk if we do migration existing files with path in metadata.


void DiskS3::restore()
{
if (!exists(restore_file_name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put information into config? If some metadata files exists do nothing and restore it with information from config otherwise. Is it suitable or there are other resons why we need to put it in file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have an existing cluster with backups and we want to restore a backup to another cluster with a different configured bucket/path we should somehow transfer this information.
This is the usual case for ClickHouse clusters in Yandex.Cloud.
I haven't thought of anything better than just put this information into a separate file.

src/Disks/S3/DiskS3.h Show resolved Hide resolved
{
if (send_metadata)
return (DiskS3::ObjectMetadata){{"path", path}};
const String key = "operations/r" + revisionToString(revision) + "-" + operation_name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it support remove* operations? Now It can be only move or hardlink. Why we can omit other operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it could seem that we need to support remove operations as well, but we can actually skip it.
Before every removing operation CH renames the part directory to '_delete_tmp_XXX'.
So we need only to restore this rename to '_delete_tmp_XXX'.
When CH starts it will just remove all these '_delete_tmp_XXX' directories by himself

@vdimir
Copy link
Member

vdimir commented Jan 28, 2021

I'll put here some details about implementation, correct me if I understand something wrong ...

Could you also comment this from my previous message?

@Jokser
Copy link
Contributor Author

Jokser commented Feb 3, 2021

I'll put here some details about implementation, correct me if I understand something wrong ...

Could you also comment this from my previous message?

Your thougths about how it should work generally are right.

@vdimir
Copy link
Member

vdimir commented Feb 4, 2021

Looks like failed tests failed in master recently too.

CI Info
SELECT
	any(check_name),
	max(check_start_time) ,
	splitByString('::', test_name)[1],
	groupUniqArray(test_status),
	count(pull_request_number)
FROM `gh-data`.checks
WHERE
    test_status != 'OK'
    AND check_name LIKE 'Integration%'
	AND (
		 test_name like 'test_storage_kerberized_kafka%' 
		OR test_name like 'test_adaptive_granularity_different_settings%' 
		OR test_name like 'test_format_avro_confluent%' 
		OR test_name like 'test_old_versions%' 
		OR test_name like 'test_testkeeper_back_to_back%' 
	)
    AND pull_request_number = 0
	AND check_start_time > now() - INTERVAL '4' day
GROUP BY splitByString('::', test_name)[1]
ORDER BY max(check_start_time) DESC;

┌─any(check_name)────────────┬─max(check_start_time)─┬─arrayElement(splitByString('::', test_name), 1)──────┬─groupUniqArray(test_status)─┬─count(pull_request_number)─┐
│ Integration tests (memory) │   2021-02-04 09:30:12 │ test_storage_kerberized_kafka/test.py                │ ['ERROR','FAIL']            │                         38 │
│ Integration tests (memory) │   2021-02-03 15:27:57 │ test_format_avro_confluent/test.py                   │ ['ERROR']                   │                          1 │
│ Integration tests (thread) │   2021-02-03 12:36:52 │ test_testkeeper_back_to_back/test.py                 │ ['FAIL']                    │                         10 │
│ Integration tests (thread) │   2021-02-03 12:36:52 │ test_old_versions/test.py                            │ ['ERROR']                   │                          3 │
│ Integration tests (thread) │   2021-02-03 12:36:52 │ test_adaptive_granularity_different_settings/test.py │ ['ERROR']                   │                          1 │
└────────────────────────────┴───────────────────────┴──────────────────────────────────────────────────────┴─────────────────────────────┴────────────────────────────┘

@vdimir vdimir merged commit 1e2c302 into ClickHouse:master Feb 5, 2021
@sevirov
Copy link
Contributor

sevirov commented Nov 19, 2021

Internal documentation ticket: DOCSUP-18488

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants