Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental backup failing due to missing part error #462

Closed
piyushsriv opened this issue Jun 30, 2022 · 30 comments · Fixed by #481
Closed

Incremental backup failing due to missing part error #462

piyushsriv opened this issue Jun 30, 2022 · 30 comments · Fixed by #481

Comments

@piyushsriv
Copy link

Hi Team,

We are facing this issue with our incremental backup.

"version": "1.3.2",
"clickhouse_version": "v22.1.3.7-stable",

We have taken backups which are like below

clickhouse-backup list remote --config=config_v2.yml

daily-backup-v2-data-full-2022-06-28     2.78TiB     28/06/2022 15:26:15   remote                                             tar

daily-backup-v2-data-incr-2022-06-29     43.13GiB    29/06/2022 08:11:58   remote   +daily-backup-v2-data-full-2022-06-28     tar

daily-backup-v2-data-incr-2022-06-30     28.96GiB    30/06/2022 08:10:11   remote   +daily-backup-v2-data-incr-2022-06-29     tar

We also sync our daily backup on our dev cluster (restore them when we need it) like below

clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30

Now, we are facing an issue in download where it throws an error like below

2022/06/30 08:18:47.871292 error one of Download go-routine return error: one of downloadDiffParts go-routine return error: <TABLE_NAME> 72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 not found on daily-backup-v2-data-incr-2022-06-29 and all required backups sequence

So, this table exists in our full backup. We confirmed the same part is present in the full backup. This table didn't change since the full backup. This problem was not there in daily-backup-v2-data-incr-2022-06-29 but coming in daily-backup-v2-data-incr-2022-06-30

What could be the reason and how can we fix it?

Currently, the implementation is that if we have a problem in one table then the whole process stop. Can we have some exclusion table list mechanism in such a process that skips those tables?
Like we have for backup where we can skip tables using skip_tables which I found only works for backup and not for restore like operation.

@k0t3n
Copy link

k0t3n commented Jun 30, 2022

Same issue

@k0t3n
Copy link

k0t3n commented Jun 30, 2022

One more error during that:
error can't acquire semaphore during Download: context canceled backup=2022-06-30T14-37-04 operation=download

@k0t3n
Copy link

k0t3n commented Jun 30, 2022

@piyushsriv in my environment an issue is reproduced only when trying to download backup with parent with non exist in local, so pulling the parent backups first could bypass an error.

@piyushsriv
Copy link
Author

One more error during that:
error can't acquire semaphore during Download: context canceled backup=2022-06-30T14-37-04 operation=download

Yeah, right. When we are retrying it after deleting the folder. We are also getting this error.

@piyushsriv in my environment an issue is reproduced only when trying to download backup with parent with non exist in local, so pulling the parent backups first could bypass an error.

In our case, we daily sync our backup. So, both parent backups (28 and 29) are already present locally when we started downloading 30 one.

@k0t3n
Copy link

k0t3n commented Jun 30, 2022

In our case, we daily sync our backup. So, both parent backups (28 and 29) are already present locally when we started downloading 30 one.

Did you pull them manually? I mean try to download them without auto tree pass:

# delete local backups first, they might be damaged
clickhouse-backup delete local daily-backup-v2-data-incr-2022-06-28
clickhouse-backup delete local daily-backup-v2-data-incr-2022-06-29
clickhouse-backup delete local daily-backup-v2-data-incr-2022-06-30

# download manually
clickhouse-backup download  daily-backup-v2-data-incr-2022-06-28
clickhouse-backup download  daily-backup-v2-data-incr-2022-06-29
clickhouse-backup download  daily-backup-v2-data-incr-2022-06-30

@piyushsriv
Copy link
Author

We can do that and it may solve the problem but then you have to download the full backup again which takes long hours.

This problem shouldn't happen in the first place. What if this happen just after you download all the data from S3 to restore the cluster in an outage situation then you have to do it again. :(

@Slach
Copy link
Collaborator

Slach commented Jul 4, 2022

@piyushsriv
do you try download on the same server or on the diffrenet server?

I tried to reproduce

CREATE TABLE t1(id UInt64) ENGINE=MergeTree() PARTITION BY id ORDER BY id;
INSERT INTO t1 SELECT number FROM numbers(100);

source server

clickhouse-backup create full
clickhouse-backup create increment1
clickhouse-backup create increment2

clickhouse-backup upload full
clickhouse-backup upload increment1 --diff-from=full
clickhouse-backup create increment2 --diff-from=increment1

destination server

clickhouse-backup download increment2 

all 100 data parts was download from full backup successfully

could you share from your S3 following files:

s3://<backup_bucket>/<backup_path>/daily-backup-v2-data-full-2022-06-28/metadata/<database>/<TABLE_NAME>.json
s3://<backup_bucket>/<backup_path>/daily-backup-v2-data-incr-2022-06-29/metadata/<database>/<TABLE_NAME>.json
s3://<backup_bucket>/<backup_path>/daily-backup-v2-data-incr-2022-06-30/metadata/<database>/<TABLE_NAME>.json

need ensure
data part 72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 present on daily-backup-v2-data-full-2022-06-28 or daily-backup-v2-data-incr-2022-06-29

moreover, could you share
clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30 ?

@Slach
Copy link
Collaborator

Slach commented Jul 4, 2022

Currently, the implementation is that if we have a problem in one table then the whole process stop. Can we have some exclusion table list mechanism in such a process that skips those tables?
Like we have for backup where we can skip tables using skip_tables which I found only works for backup and not for restore like operation.

Currently, only filter --table=db.prefix* could be applied for download and restore command
no "exclude" mechanism implemented, feel free to make pull request

@Slach
Copy link
Collaborator

Slach commented Jul 4, 2022

in my environment an issue is reproduced only when trying to download backup with parent with non exist in local, so pulling the parent backups first could bypass an error.

@k0t3n
could you share your clickhouse-backup --version and clickhouse-backup print-config ?

@Slach
Copy link
Collaborator

Slach commented Jul 4, 2022

In our case, we daily sync our backup. So, both parent backups (28 and 29) are already present locally when we started downloading 30 one.

I also tried reproduce this case
source server

CREATE TABLE t1(id UInt64) ENGINE=MergeTree() PARTITION BY id ORDER BY id;
INSERT INTO t1 SELECT number FROM numbers(100);
clickhouse-backup create full
clickhouse-backup create increment1
clickhouse-backup create increment2

clickhouse-backup upload full
clickhouse-backup upload increment1 --diff-from=full
clickhouse-backup upload increment2 --diff-from=increment1

clickhouse-backup delete local increment2
clickhouse-backup download increment2

in logs

2022/07/04 05:22:24.669348  info done                      diff_parts=0 duration=66ms operation=downloadDiffParts

it expected, cause all 100 data parts already present on in other backup during download

are you sure /var/lib/clickhouse/backup/daily-backup-v2-data-full-2022-06-28/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 and
/var/lib/clickhouse/backup/daily-backup-v2-data-incr-2022-06-29/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0
folders presents when you call clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30?

@Slach
Copy link
Collaborator

Slach commented Jul 4, 2022

Yeah, right. When we are retrying it after deleting the folder. We are also getting this error.

Could you please clarify, which exactly folder do you mean?

Moreover, could you share results for the following command

LOG_LEVEL=debug clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30

@piyushsriv
Copy link
Author

@Slach

do you try download on the same server or on the diffrenet server?

As I mentioned in the issue we do regular downloads (each day) on different dev cluster. So, we take backup everyday on production cluster, upload it to S3, and then download it to dev cluster.

could you share from your S3 following files:

please find below
table_2022-06-28.txt
table_2022-06-29.txt
table_2022-06-30.txt

Could you please clarify, which exactly folder do you mean?

The local backup folder i.e. daily-backup-v2-data-incr-2022-06-30 (in var/lib/clickhouse/backup)

are you sure /var/lib/clickhouse/backup/daily-backup-v2-data-full-2022-06-28/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 and
/var/lib/clickhouse/backup/daily-backup-v2-data-incr-2022-06-29/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0
folders presents when you call clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30?

Unfortunately, I can't confirm this now. Every month we delete all old backups and start with new so we have deleted those backup folders.

@piyushsriv
Copy link
Author

Currently, the implementation is that if we have a problem in one table then the whole process stop. Can we have some exclusion table list mechanism in such a process that skips those tables?
Like we have for backup where we can skip tables using skip_tables which I found only works for backup and not for restore like operation.

Currently, only filter --table=db.prefix* could be applied for download and restore command no "exclude" mechanism implemented, feel free to make pull request

OK.

@piyushsriv
Copy link
Author

are you sure /var/lib/clickhouse/backup/daily-backup-v2-data-full-2022-06-28/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 and
/var/lib/clickhouse/backup/daily-backup-v2-data-incr-2022-06-29/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0
folders presents when you call clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30?

Is it possible that due to any reason if the file becomes missing then the download will break?
Shouldn't it download it from S3 if it is missing on local disk?

@Slach
Copy link
Collaborator

Slach commented Jul 4, 2022

@piyushsriv
diff algorithm has the following steps:

  1. Download backup_name/metadata/db/table.json and read it in memory
  2. Download all data parts which not marks as required to local disk
  3. For each part which marks as required
  4. Download if not exists to local disk required_backup_name/metadata/db/table.json, (required_backup_name gets from backup_name/metadata.json)
  5. If part in required_backup_name/metadata/db/table.json marks as required then download require_backup_name/metada.json get new parent required_backup_name and go to step 4 until not found backup where part is not marks as required. Else, just download part to local disk /var/lib/clickhouse/backup/required_backup_name/shadow/db/table/<part_name> and make hardlinks to downloaded files from /var/lib/clickhouse/backup/backup_name/shadow/db/table/<part_name>

@Slach
Copy link
Collaborator

Slach commented Jul 4, 2022

please find below
table_2022-06-28.txt
table_2022-06-29.txt
table_2022-06-30.txt

looks weird
all should works

table_2022-06-28.txt contains

	"files": {
		"default": [
			"default_72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0.tar",
...
		]
	},
	"table": "<MASKED>",
	"database": "<MASKED>",
	"parts": {
		"default": [
...
			{
				"name": "72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0"
			},
...
		]
	},

it expected for full backup

table_2022-06-29.txt and table_2022-06-30.txt
have the same content (and it also expected cause you said table didn't change between backup)

{
	"table": "<MASKED>",
	"database": "<MASKED>",
	"parts": {
		"default": [
...
			{
				"name": "72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0",
				"required": true
			},
...
		]
	},
...
}

and part properly marks as required

Is it possible that due to any reason if the file becomes missing then the download will break?
Shouldn't it download it from S3 if it is missing on local disk?

so, if 30 full backup downloads was break
then maybe you have /var/lib/clickhouse/backup/daily-backup-v2-data-full-2022-06-28/metadata/db/table.json
but not have /var/lib/clickhouse/backup/daily-backup-v2-data-full-2022-06-28/shadow/db/table/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0

but even for this corner case clickhouse-backup should download part 72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 from daily-backup-v2-data-full-2022-06-28

@Slach
Copy link
Collaborator

Slach commented Jul 4, 2022

@k0t3n
Copy link

k0t3n commented Jul 5, 2022

@Slach still reproduces on v1.4.6
config:

general:
  remote_storage: s3
  max_file_size: 0
  disable_progress_bar: true
  backups_to_keep_local: 0
  backups_to_keep_remote: 0
  log_level: info
  allow_empty_backups: true
  download_concurrency: 3
  upload_concurrency: 3
  restore_schema_on_cluster: ""
  upload_by_part: true
  download_by_part: true
clickhouse:
  username: default
  password: ""
  host: 127.0.0.1
  port: 9000
  disk_mapping: {}
  skip_tables:
  - system.*
  - INFORMATION_SCHEMA.*
  - information_schema.*
  timeout: 5m
  freeze_by_part: false
  freeze_by_part_where: ""
  secure: false
  skip_verify: false
  sync_replicated_tables: false
  log_sql_queries: true
  config_dir: /etc/clickhouse-server/
  restart_command: systemctl restart clickhouse-server
  ignore_not_exists_error_during_freeze: false
  tls_key: ""
  tls_cert: ""
  tls_ca: ""
  debug: false
s3:
  access_key: secret
  secret_key: secret
  bucket: secret
  endpoint: ""
  region: eu-central-1
  acl: private
  assume_role_arn: ""
  force_path_style: false
  path: ""
  disable_ssl: false
  compression_level: 1
  compression_format: tar
  sse: ""
  disable_cert_verification: false
  storage_class: STANDARD
  concurrency: 1
  part_size: 0
  max_parts_count: 10000
  allow_multipart_download: false
  debug: true
gcs:
  credentials_file: ""
  credentials_json: ""
  bucket: ""
  path: ""
  compression_level: 1
  compression_format: tar
  debug: false
  endpoint: ""
cos:
  url: ""
  timeout: 2m
  secret_id: ""
  secret_key: ""
  path: ""
  compression_format: tar
  compression_level: 1
  debug: false
api:
  listen: localhost:7171
  enable_metrics: true
  enable_pprof: false
  username: ""
  password: ""
  secure: false
  certificate_file: ""
  private_key_file: ""
  create_integration_tables: false
  integration_tables_host: ""
  allow_parallel: false
ftp:
  address: ""
  timeout: 2m
  username: ""
  password: ""
  tls: false
  path: ""
  compression_format: tar
  compression_level: 1
  concurrency: 3
  debug: false
sftp:
  address: ""
  port: 22
  username: ""
  password: ""
  key: ""
  path: ""
  compression_format: tar
  compression_level: 1
  concurrency: 1
  debug: false
azblob:
  endpoint_suffix: core.windows.net
  account_name: ""
  account_key: ""
  sas: ""
  use_managed_identity: false
  container: ""
  path: ""
  compression_level: 1
  compression_format: tar
  sse_key: ""
  buffer_size: 0
  buffer_count: 3
  max_parts_count: 10000`

S3 download debug log:

-----------------------------------------------------
2022/07/05 08:35:53.133046  info DEBUG: Request s3/HeadObject Details:
---[ REQUEST POST-SIGN ]-----------------------------
HEAD /2022-07-05T08-14-44/shadow/secret/eventlogtable/default_202104-103_714_714_0.tar HTTP/1.1

Host: secret-bucket.s3.eu-central-1.amazonaws.com

User-Agent: aws-sdk-go/1.43.0 (go1.18.3; linux; amd64)

Authorization: AWS4-HMAC-SHA256 Credential=secret/20220705/eu-central-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=68b72010c9df4972f2cc05a7676e6bf9887bbf904219f5154935e17ad2843832

X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

X-Amz-Date: 20220705T083553Z




-----------------------------------------------------
2022/07/05 08:35:53.139582  info DEBUG: Response s3/HeadObject Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK

Content-Length: 17920

Accept-Ranges: bytes

Content-Type: binary/octet-stream

Date: Tue, 05 Jul 2022 08:35:54 GMT

Etag: "98f9ae0d02f679c1a6333fe06b5644ae"

Last-Modified: Tue, 05 Jul 2022 08:17:30 GMT

Server: AmazonS3

X-Amz-Id-2: DNUmwVZ38kdu7VmRhx7B7DJx+D026pr19hqMD+VpExDj0ib+ATrEwinfQkGbZtLoF+LGhqK3BlM=

X-Amz-Request-Id: XYY8PWRH19XJ9KD5

error:

2022/07/05 08:37:10.481681 error one of Download go-routine return error: one of downloadDiffParts go-routine return error: secret.eventlogtable 202104-103_714_714_0 not found on 2022-07-05T08-14-44 and all required backups sequence

@Slach
Copy link
Collaborator

Slach commented Jul 5, 2022

@k0t3n please remove

s3:
  debug: true

and setup

general:
  log_level: debug

after it, please try to share full log for download command

@Slach
Copy link
Collaborator

Slach commented Jul 5, 2022

@k0t3n is your S3_DEBUG log contains

GET /2022-07-05T08-14-44/shadow/secret/eventlogtable/default_202104-103_714_714_0.tar

?

@piyushsriv
Copy link
Author

Thanks, @Slach for all the information and help.
I hope we get this problem again so that I can confirm what you are asking to check.

It seems @k0t3n can help you better as he has a reproducible scenario.

@k0t3n
Copy link

k0t3n commented Jul 23, 2022

@Slach sorry for long response.

Yes, my log contains
GET /2022-07-05T08-14-44/shadow/secret/eventlogtable/default_202104-103_714_714_0.tar
but correct path is
GET /2022-07-05T08-14-44/shadow/secret/eventlogtable/default_202104%2D103_714_714_0.tar

So url encoding was missed. An error is reproduced only with incremental backup download and only in findDiffRecursive, findDiffOnePart and findDiffOnePartArchive actions.

@k0t3n
Copy link

k0t3n commented Jul 23, 2022

@Slach
Copy link
Collaborator

Slach commented Jul 24, 2022

@k0t3n look like gist with logs is private, could you make it public?

@k0t3n
Copy link

k0t3n commented Jul 24, 2022

@Slach sorry, fixed

@piyushsriv
Copy link
Author

Currently, the implementation is that if we have a problem in one table then the whole process stop. Can we have some exclusion table list mechanism in such a process that skips those tables?
Like we have for backup where we can skip tables using skip_tables which I found only works for backup and not for restore like operation.

Currently, only filter --table=db.prefix* could be applied for download and restore command no "exclude" mechanism implemented, feel free to make pull request

I was looking at the code, to implement skip tables in case of restore, but found that it is already implemented here.

I tested it and it is working. Earlier when it was not working for me was an old version. It seems the latest version has this capability.

@Slach
Copy link
Collaborator

Slach commented Jul 26, 2022

@k0t3n could you share

SHOW CREATE TABLE secret.eventlogtable;

need to know PARTITION BY clause

@k0t3n
Copy link

k0t3n commented Jul 26, 2022

@Slach

CREATE TABLE secret.eventlogtable
(
    `time` DateTime DEFAULT '0000000000',
    `action` Int32,
    `page_from` Int64,
    `page_to` Int64,
    `rows` UInt16,
    `sent` UInt16,
    `is_success` Int8,
    `duration` Float64
)
ENGINE = MergeTree
PARTITION BY (toYYYYMM(time), action)
ORDER BY (page_from, action, time)
SETTINGS index_granularity = 8192

@Slach
Copy link
Collaborator

Slach commented Jul 26, 2022

@k0t3n reproduced on my side
thanks a lot for reporting

@k0t3n
Copy link

k0t3n commented Jul 26, 2022

@Slach thank you! Looking forward for the fix 🚀

@Slach Slach mentioned this issue Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants