Ability to backup/restore metadata files for DiskS3 #18377

Jokser · 2020-12-22T19:21:23Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add the ability to backup/restore metadata files for DiskS3

Detailed description / Documentation draft:
We want to add the possibility to restore metadata files on local FS for DiskS3 using only S3. What do we need to implement this feature?

Add 'path' key to object metadata for each S3 object we write through DiskS3. This is a path to the metadata file on the local file system.
For each file/directories rename store a special S3 object with 'old_path', 'new_path' keys in object metadata. Each rename is an atomic commit of insert/merge/remove operations.
Additionally for each hard link creation we need to store a similar S3 object. This is needed to properly restore mutations.

To restore metadata files we need to list all S3 objects having 'path' in metadata and create a metadata file on local FS by the given path.
Then we need to apply all renames and hard link creations. Renames and hard links should be time ordered, so we need logical clocks to linearize all operations (file write, rename, hardlink).
We can introduce a simple implementation of such clocks - named as 'revision'. This is an atomic counter incremented by each write / rename / hardlink operation. Each object should have revision in its name.

Revisions can be used also to make a snapshot of the current state of S3 (backup). To make a backup we need to freeze all tables and remember a current revision counter. Part files wouldn't be deleted as they backed-up in the shadow folder, the revision counter cuts a set of files / operations needed to restore a particular state.

…ore-metadata # Conflicts: # src/Disks/DiskCacheWrapper.cpp # src/Disks/S3/DiskS3.cpp

vdimir

Thank you for your work!

I'll put here some details about implementation, correct me if I understand something wrong:

All new data files on S3 will have name /r<REV>-file-<RANDOM_NAME> and path in metadata with path to metadata files on local disk
If we want to restore file we create metadata file for corresponding data file on local disk in path specified in path. Metadata files itself not stored on S3, it created for each data-file from scratch and we need to know only data-file name and path to metadata file on local disk for it.
In S3 path /operations/r<REV>-rename or /operations/r<REV>-hardlink we store information about operations performed on metadata files and replay it for restored metadata files.

Also I left some general question in code.

src/Disks/S3/DiskS3.cpp

vdimir · 2021-01-27T09:50:18Z

src/Disks/S3/DiskS3.cpp

+
+    LOG_INFO(&Poco::Logger::get("DiskS3"), "Starting up disk {}", name);
+
+    /// Find last revision.


Why we cannot keep this information in file on local disk?

Because we can lose all metadata information on the local disk including the last revision in separate file.
We can rely only on the data stored in S3.

What about to store latest revision in S3 object called latest_revision? Or searching revision in such way expected to be cheap?

Storing latest_revision every time to the separate object will cause overhead on each file write.
Search revision at start-up is cheap, just ~64 (binary search) head object requests.

vdimir · 2021-01-27T10:00:49Z

src/Disks/S3/DiskS3.cpp

+        information.source_path = s3_root_path;
+
+        readRestoreInformation(information);
+        if (information.revision == 0)


What happended if we try to restore from revision, but some data files from it was deleted?
Do we actually need restoring from arbitrary revision, not from last?

This is actually needed if we want to have backups and restore metadata to an arbitrary backup state.
To ensure that data is not deleted we should perform 'ALTER TABLE PARTITIONS FREEZE' queries and remember the revision counter that dumped into revision.txt.

Got it. Is there plans to write piece of documentation about how to use this feature?

Yes, of course

vdimir · 2021-01-27T12:39:14Z

src/Disks/S3/DiskS3.cpp

+        object_metadata = {
+            {"path", path}
+        };
+        s3_path = "r" + revisionToString(revision) + "-file-" + s3_path;


What heppened if we enable this feature for disk created without it ealier? Will restoring fail or it restore information only from newly created files with this metadata?

At the moment, we can restore the disk state for only newly created files, other files that don't have 'path' in metadata will be ignored.
To have the possibility to restore old files we should introduce a migration utility.
This utility should update metadata for existing files storing 'path' key.
This is the very important thing, but I prefer to do it in a separate PR with separate tests.
At the moment we can enable/disable restore possibility using 'send_metadata' option for DiskS3, so we can easily test this utility.

Should we reject restoring if there are object without path? To ensure that we don't restore some non-consistent state

UPD: this feature cannot be changed for existing disks, so we can only enable for newly created disks and partial state is impossible Am I right?

Yes, it makes sense. I'll change this piece of code to throw exception during restore.
I think behaviour can be changed for existing disk if we do migration existing files with path in metadata.

vdimir · 2021-01-27T12:41:19Z

src/Disks/S3/DiskS3.cpp

+
+void DiskS3::restore()
+{
+    if (!exists(restore_file_name))


Can we put information into config? If some metadata files exists do nothing and restore it with information from config otherwise. Is it suitable or there are other resons why we need to put it in file?

If we have an existing cluster with backups and we want to restore a backup to another cluster with a different configured bucket/path we should somehow transfer this information.
This is the usual case for ClickHouse clusters in Yandex.Cloud.
I haven't thought of anything better than just put this information into a separate file.

src/Disks/S3/DiskS3.h

vdimir · 2021-01-27T13:15:34Z

src/Disks/S3/DiskS3.cpp

 {
-    if (send_metadata)
-        return (DiskS3::ObjectMetadata){{"path", path}};
+    const String key = "operations/r" + revisionToString(revision) + "-" + operation_name;


Should it support remove* operations? Now It can be only move or hardlink. Why we can omit other operations?

Yes, it could seem that we need to support remove operations as well, but we can actually skip it.
Before every removing operation CH renames the part directory to '_delete_tmp_XXX'.
So we need only to restore this rename to '_delete_tmp_XXX'.
When CH starts it will just remove all these '_delete_tmp_XXX' directories by himself

vdimir · 2021-01-28T09:01:48Z

I'll put here some details about implementation, correct me if I understand something wrong ...

Could you also comment this from my previous message?

Jokser · 2021-02-03T11:10:48Z

I'll put here some details about implementation, correct me if I understand something wrong ...

Could you also comment this from my previous message?

Your thougths about how it should work generally are right.

…ore-metadata # Conflicts: # src/Disks/DiskDecorator.h # src/Disks/IDisk.h # src/Disks/S3/DiskS3.cpp

…metadata.

vdimir · 2021-02-04T14:43:08Z

Looks like failed tests failed in master recently too.

CI Info

SELECT
	any(check_name),
	max(check_start_time) ,
	splitByString('::', test_name)[1],
	groupUniqArray(test_status),
	count(pull_request_number)
FROM `gh-data`.checks
WHERE
    test_status != 'OK'
    AND check_name LIKE 'Integration%'
	AND (
		 test_name like 'test_storage_kerberized_kafka%' 
		OR test_name like 'test_adaptive_granularity_different_settings%' 
		OR test_name like 'test_format_avro_confluent%' 
		OR test_name like 'test_old_versions%' 
		OR test_name like 'test_testkeeper_back_to_back%' 
	)
    AND pull_request_number = 0
	AND check_start_time > now() - INTERVAL '4' day
GROUP BY splitByString('::', test_name)[1]
ORDER BY max(check_start_time) DESC;

┌─any(check_name)────────────┬─max(check_start_time)─┬─arrayElement(splitByString('::', test_name), 1)──────┬─groupUniqArray(test_status)─┬─count(pull_request_number)─┐
│ Integration tests (memory) │   2021-02-04 09:30:12 │ test_storage_kerberized_kafka/test.py                │ ['ERROR','FAIL']            │                         38 │
│ Integration tests (memory) │   2021-02-03 15:27:57 │ test_format_avro_confluent/test.py                   │ ['ERROR']                   │                          1 │
│ Integration tests (thread) │   2021-02-03 12:36:52 │ test_testkeeper_back_to_back/test.py                 │ ['FAIL']                    │                         10 │
│ Integration tests (thread) │   2021-02-03 12:36:52 │ test_old_versions/test.py                            │ ['ERROR']                   │                          3 │
│ Integration tests (thread) │   2021-02-03 12:36:52 │ test_adaptive_granularity_different_settings/test.py │ ['ERROR']                   │                          1 │
└────────────────────────────┴───────────────────────┴──────────────────────────────────────────────────────┴─────────────────────────────┴────────────────────────────┘

…store-metadata

sevirov · 2021-11-19T18:31:25Z

Internal documentation ticket: DOCSUP-18488

Ability to backup-restore metadata files for DiskS3 (WIP)

18fe1c7

robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Dec 22, 2020

Jokser added 6 commits December 23, 2020 15:35

Ability to backup-restore metadata files for DiskS3 (WIP)

cc3b595

Ability to backup-restore metadata files for DiskS3 (WIP)

2848b32

Ability to backup-restore metadata files for DiskS3 (fixes and tests)

0856b2c

Fix get_process_pid

0164965

Ability to backup-restore metadata files for DiskS3 (fixes and tests)

b09862b

Ability to backup-restore metadata files for DiskS3 (minor fixes)

53389f7

Jokser marked this pull request as ready for review January 13, 2021 13:46

Jokser added 2 commits January 18, 2021 13:37

Ability to backup-restore metadata files for DiskS3 (minor fixes)

2dda8ed

Merge remote-tracking branch 'origin/master' into disk-s3-backup-rest…

1e3a059

…ore-metadata # Conflicts: # src/Disks/DiskCacheWrapper.cpp # src/Disks/S3/DiskS3.cpp

vdimir self-assigned this Jan 25, 2021

vdimir reviewed Jan 27, 2021

View reviewed changes

Jokser added 3 commits February 3, 2021 14:22

Merge remote-tracking branch 'origin/master' into disk-s3-backup-rest…

b7151aa

…ore-metadata # Conflicts: # src/Disks/DiskDecorator.h # src/Disks/IDisk.h # src/Disks/S3/DiskS3.cpp

Throw exception during restore DiskS3 if object doesn't have path in …

b609485

…metadata.

Remove unused error code in DiskS3

62e96c1

Merge remote-tracking branch 'upstream/master' into disk-s3-backup-re…

30f9879

…store-metadata

vdimir merged commit 1e2c302 into ClickHouse:master Feb 5, 2021

alexey-milovidov mentioned this pull request Aug 18, 2022

ClickHouse takes more than 3-4 minutes to start when s3 send_metadata is enabled. #40316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to backup/restore metadata files for DiskS3 #18377

Ability to backup/restore metadata files for DiskS3 #18377

Jokser commented Dec 22, 2020 •

edited

Loading

vdimir left a comment •

edited

Loading

vdimir Jan 27, 2021

Jokser Jan 27, 2021

vdimir Jan 28, 2021

Jokser Feb 3, 2021

vdimir Jan 27, 2021

Jokser Jan 27, 2021

vdimir Jan 28, 2021

Jokser Feb 3, 2021

vdimir Jan 27, 2021

Jokser Jan 27, 2021

vdimir Jan 28, 2021 •

edited

Loading

Jokser Feb 3, 2021

vdimir Jan 27, 2021

Jokser Jan 27, 2021

vdimir Jan 27, 2021

Jokser Jan 27, 2021

vdimir commented Jan 28, 2021 •

edited

Loading

Jokser commented Feb 3, 2021

vdimir commented Feb 4, 2021

sevirov commented Nov 19, 2021


		LOG_INFO(&Poco::Logger::get("DiskS3"), "Starting up disk {}", name);

		/// Find last revision.

Ability to backup/restore metadata files for DiskS3 #18377

Ability to backup/restore metadata files for DiskS3 #18377

Conversation

Jokser commented Dec 22, 2020 • edited Loading

vdimir left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdimir Jan 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdimir commented Jan 28, 2021 • edited Loading

Jokser commented Feb 3, 2021

vdimir commented Feb 4, 2021

sevirov commented Nov 19, 2021

Jokser commented Dec 22, 2020 •

edited

Loading

vdimir left a comment •

edited

Loading

vdimir Jan 28, 2021 •

edited

Loading

vdimir commented Jan 28, 2021 •

edited

Loading