Incremental backup failing due to missing part error #462

piyushsriv · 2022-06-30T13:47:03Z

Hi Team,

We are facing this issue with our incremental backup.

"version": "1.3.2",
"clickhouse_version": "v22.1.3.7-stable",

We have taken backups which are like below

clickhouse-backup list remote --config=config_v2.yml

daily-backup-v2-data-full-2022-06-28     2.78TiB     28/06/2022 15:26:15   remote                                             tar

daily-backup-v2-data-incr-2022-06-29     43.13GiB    29/06/2022 08:11:58   remote   +daily-backup-v2-data-full-2022-06-28     tar

daily-backup-v2-data-incr-2022-06-30     28.96GiB    30/06/2022 08:10:11   remote   +daily-backup-v2-data-incr-2022-06-29     tar

We also sync our daily backup on our dev cluster (restore them when we need it) like below

clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30

Now, we are facing an issue in download where it throws an error like below

2022/06/30 08:18:47.871292 error one of Download go-routine return error: one of downloadDiffParts go-routine return error: <TABLE_NAME> 72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 not found on daily-backup-v2-data-incr-2022-06-29 and all required backups sequence

So, this table exists in our full backup. We confirmed the same part is present in the full backup. This table didn't change since the full backup. This problem was not there in daily-backup-v2-data-incr-2022-06-29 but coming in daily-backup-v2-data-incr-2022-06-30

What could be the reason and how can we fix it?

Currently, the implementation is that if we have a problem in one table then the whole process stop. Can we have some exclusion table list mechanism in such a process that skips those tables?
Like we have for backup where we can skip tables using skip_tables which I found only works for backup and not for restore like operation.

The text was updated successfully, but these errors were encountered:

k0t3n · 2022-06-30T17:23:36Z

Same issue

k0t3n · 2022-06-30T17:33:03Z

One more error during that:
error can't acquire semaphore during Download: context canceled backup=2022-06-30T14-37-04 operation=download

k0t3n · 2022-06-30T17:39:51Z

@piyushsriv in my environment an issue is reproduced only when trying to download backup with parent with non exist in local, so pulling the parent backups first could bypass an error.

piyushsriv · 2022-06-30T17:44:52Z

One more error during that:
error can't acquire semaphore during Download: context canceled backup=2022-06-30T14-37-04 operation=download

Yeah, right. When we are retrying it after deleting the folder. We are also getting this error.

@piyushsriv in my environment an issue is reproduced only when trying to download backup with parent with non exist in local, so pulling the parent backups first could bypass an error.

In our case, we daily sync our backup. So, both parent backups (28 and 29) are already present locally when we started downloading 30 one.

k0t3n · 2022-06-30T17:50:33Z

In our case, we daily sync our backup. So, both parent backups (28 and 29) are already present locally when we started downloading 30 one.

Did you pull them manually? I mean try to download them without auto tree pass:

# delete local backups first, they might be damaged
clickhouse-backup delete local daily-backup-v2-data-incr-2022-06-28
clickhouse-backup delete local daily-backup-v2-data-incr-2022-06-29
clickhouse-backup delete local daily-backup-v2-data-incr-2022-06-30

# download manually
clickhouse-backup download  daily-backup-v2-data-incr-2022-06-28
clickhouse-backup download  daily-backup-v2-data-incr-2022-06-29
clickhouse-backup download  daily-backup-v2-data-incr-2022-06-30

piyushsriv · 2022-06-30T19:09:35Z

We can do that and it may solve the problem but then you have to download the full backup again which takes long hours.

This problem shouldn't happen in the first place. What if this happen just after you download all the data from S3 to restore the cluster in an outage situation then you have to do it again. :(

Slach · 2022-07-04T04:57:45Z

@piyushsriv
do you try download on the same server or on the diffrenet server?

I tried to reproduce

CREATE TABLE t1(id UInt64) ENGINE=MergeTree() PARTITION BY id ORDER BY id;
INSERT INTO t1 SELECT number FROM numbers(100);

source server

clickhouse-backup create full
clickhouse-backup create increment1
clickhouse-backup create increment2

clickhouse-backup upload full
clickhouse-backup upload increment1 --diff-from=full
clickhouse-backup create increment2 --diff-from=increment1

destination server

clickhouse-backup download increment2

all 100 data parts was download from full backup successfully

could you share from your S3 following files:

s3://<backup_bucket>/<backup_path>/daily-backup-v2-data-full-2022-06-28/metadata/<database>/<TABLE_NAME>.json
s3://<backup_bucket>/<backup_path>/daily-backup-v2-data-incr-2022-06-29/metadata/<database>/<TABLE_NAME>.json
s3://<backup_bucket>/<backup_path>/daily-backup-v2-data-incr-2022-06-30/metadata/<database>/<TABLE_NAME>.json

need ensure
data part 72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 present on daily-backup-v2-data-full-2022-06-28 or daily-backup-v2-data-incr-2022-06-29

moreover, could you share
clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30 ?

Slach · 2022-07-04T05:07:18Z

Currently, the implementation is that if we have a problem in one table then the whole process stop. Can we have some exclusion table list mechanism in such a process that skips those tables?
Like we have for backup where we can skip tables using skip_tables which I found only works for backup and not for restore like operation.

Currently, only filter --table=db.prefix* could be applied for download and restore command
no "exclude" mechanism implemented, feel free to make pull request

Slach · 2022-07-04T05:16:40Z

in my environment an issue is reproduced only when trying to download backup with parent with non exist in local, so pulling the parent backups first could bypass an error.

@k0t3n
could you share your clickhouse-backup --version and clickhouse-backup print-config ?

Slach · 2022-07-04T05:27:28Z

In our case, we daily sync our backup. So, both parent backups (28 and 29) are already present locally when we started downloading 30 one.

I also tried reproduce this case
source server

CREATE TABLE t1(id UInt64) ENGINE=MergeTree() PARTITION BY id ORDER BY id;
INSERT INTO t1 SELECT number FROM numbers(100);

clickhouse-backup create full
clickhouse-backup create increment1
clickhouse-backup create increment2

clickhouse-backup upload full
clickhouse-backup upload increment1 --diff-from=full
clickhouse-backup upload increment2 --diff-from=increment1

clickhouse-backup delete local increment2
clickhouse-backup download increment2

in logs

2022/07/04 05:22:24.669348  info done                      diff_parts=0 duration=66ms operation=downloadDiffParts

it expected, cause all 100 data parts already present on in other backup during download

are you sure /var/lib/clickhouse/backup/daily-backup-v2-data-full-2022-06-28/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 and
/var/lib/clickhouse/backup/daily-backup-v2-data-incr-2022-06-29/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0
folders presents when you call clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30?

Slach · 2022-07-04T05:29:46Z

Yeah, right. When we are retrying it after deleting the folder. We are also getting this error.

Could you please clarify, which exactly folder do you mean?

Moreover, could you share results for the following command

LOG_LEVEL=debug clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30

piyushsriv · 2022-07-04T15:02:32Z

@Slach

do you try download on the same server or on the diffrenet server?

As I mentioned in the issue we do regular downloads (each day) on different dev cluster. So, we take backup everyday on production cluster, upload it to S3, and then download it to dev cluster.

could you share from your S3 following files:

please find below
table_2022-06-28.txt
table_2022-06-29.txt
table_2022-06-30.txt

Could you please clarify, which exactly folder do you mean?

The local backup folder i.e. daily-backup-v2-data-incr-2022-06-30 (in var/lib/clickhouse/backup)

are you sure /var/lib/clickhouse/backup/daily-backup-v2-data-full-2022-06-28/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 and
/var/lib/clickhouse/backup/daily-backup-v2-data-incr-2022-06-29/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0
folders presents when you call clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30?

Unfortunately, I can't confirm this now. Every month we delete all old backups and start with new so we have deleted those backup folders.

piyushsriv · 2022-07-04T15:02:52Z

Currently, the implementation is that if we have a problem in one table then the whole process stop. Can we have some exclusion table list mechanism in such a process that skips those tables?
Like we have for backup where we can skip tables using skip_tables which I found only works for backup and not for restore like operation.

Currently, only filter --table=db.prefix* could be applied for download and restore command no "exclude" mechanism implemented, feel free to make pull request

OK.

piyushsriv · 2022-07-04T16:40:31Z

are you sure /var/lib/clickhouse/backup/daily-backup-v2-data-full-2022-06-28/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 and
/var/lib/clickhouse/backup/daily-backup-v2-data-incr-2022-06-29/shadow/<db>/<table>/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0
folders presents when you call clickhouse-backup download --config=config_v2.yml daily-backup-v2-data-incr-2022-06-30?

Is it possible that due to any reason if the file becomes missing then the download will break?
Shouldn't it download it from S3 if it is missing on local disk?

Slach · 2022-07-04T16:56:06Z

@piyushsriv
diff algorithm has the following steps:

Download backup_name/metadata/db/table.json and read it in memory
Download all data parts which not marks as required to local disk
For each part which marks as required
Download if not exists to local disk required_backup_name/metadata/db/table.json, (required_backup_name gets from backup_name/metadata.json)
If part in required_backup_name/metadata/db/table.json marks as required then download require_backup_name/metada.json get new parent required_backup_name and go to step 4 until not found backup where part is not marks as required. Else, just download part to local disk /var/lib/clickhouse/backup/required_backup_name/shadow/db/table/<part_name> and make hardlinks to downloaded files from /var/lib/clickhouse/backup/backup_name/shadow/db/table/<part_name>

Slach · 2022-07-04T18:07:58Z

please find below
table_2022-06-28.txt
table_2022-06-29.txt
table_2022-06-30.txt

looks weird
all should works

table_2022-06-28.txt contains

	"files": {
		"default": [
			"default_72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0.tar",
...
		]
	},
	"table": "<MASKED>",
	"database": "<MASKED>",
	"parts": {
		"default": [
...
			{
				"name": "72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0"
			},
...
		]
	},

it expected for full backup

table_2022-06-29.txt and table_2022-06-30.txt
have the same content (and it also expected cause you said table didn't change between backup)

{
	"table": "<MASKED>",
	"database": "<MASKED>",
	"parts": {
		"default": [
...
			{
				"name": "72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0",
				"required": true
			},
...
		]
	},
...
}

and part properly marks as required

Is it possible that due to any reason if the file becomes missing then the download will break?
Shouldn't it download it from S3 if it is missing on local disk?

so, if 30 full backup downloads was break
then maybe you have /var/lib/clickhouse/backup/daily-backup-v2-data-full-2022-06-28/metadata/db/table.json
but not have /var/lib/clickhouse/backup/daily-backup-v2-data-full-2022-06-28/shadow/db/table/72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0

but even for this corner case clickhouse-backup should download part 72ea255cc4b9d84bc294ae4ac6daf1b7_2_2_0 from daily-backup-v2-data-full-2022-06-28

Slach · 2022-07-04T18:48:50Z

could you check https://github.com/AlexAkulov/clickhouse-backup/releases/tag/v1.4.6?

k0t3n · 2022-07-05T08:45:29Z

@Slach still reproduces on v1.4.6
config:

general:
  remote_storage: s3
  max_file_size: 0
  disable_progress_bar: true
  backups_to_keep_local: 0
  backups_to_keep_remote: 0
  log_level: info
  allow_empty_backups: true
  download_concurrency: 3
  upload_concurrency: 3
  restore_schema_on_cluster: ""
  upload_by_part: true
  download_by_part: true
clickhouse:
  username: default
  password: ""
  host: 127.0.0.1
  port: 9000
  disk_mapping: {}
  skip_tables:
  - system.*
  - INFORMATION_SCHEMA.*
  - information_schema.*
  timeout: 5m
  freeze_by_part: false
  freeze_by_part_where: ""
  secure: false
  skip_verify: false
  sync_replicated_tables: false
  log_sql_queries: true
  config_dir: /etc/clickhouse-server/
  restart_command: systemctl restart clickhouse-server
  ignore_not_exists_error_during_freeze: false
  tls_key: ""
  tls_cert: ""
  tls_ca: ""
  debug: false
s3:
  access_key: secret
  secret_key: secret
  bucket: secret
  endpoint: ""
  region: eu-central-1
  acl: private
  assume_role_arn: ""
  force_path_style: false
  path: ""
  disable_ssl: false
  compression_level: 1
  compression_format: tar
  sse: ""
  disable_cert_verification: false
  storage_class: STANDARD
  concurrency: 1
  part_size: 0
  max_parts_count: 10000
  allow_multipart_download: false
  debug: true
gcs:
  credentials_file: ""
  credentials_json: ""
  bucket: ""
  path: ""
  compression_level: 1
  compression_format: tar
  debug: false
  endpoint: ""
cos:
  url: ""
  timeout: 2m
  secret_id: ""
  secret_key: ""
  path: ""
  compression_format: tar
  compression_level: 1
  debug: false
api:
  listen: localhost:7171
  enable_metrics: true
  enable_pprof: false
  username: ""
  password: ""
  secure: false
  certificate_file: ""
  private_key_file: ""
  create_integration_tables: false
  integration_tables_host: ""
  allow_parallel: false
ftp:
  address: ""
  timeout: 2m
  username: ""
  password: ""
  tls: false
  path: ""
  compression_format: tar
  compression_level: 1
  concurrency: 3
  debug: false
sftp:
  address: ""
  port: 22
  username: ""
  password: ""
  key: ""
  path: ""
  compression_format: tar
  compression_level: 1
  concurrency: 1
  debug: false
azblob:
  endpoint_suffix: core.windows.net
  account_name: ""
  account_key: ""
  sas: ""
  use_managed_identity: false
  container: ""
  path: ""
  compression_level: 1
  compression_format: tar
  sse_key: ""
  buffer_size: 0
  buffer_count: 3
  max_parts_count: 10000`

S3 download debug log:

-----------------------------------------------------
2022/07/05 08:35:53.133046  info DEBUG: Request s3/HeadObject Details:
---[ REQUEST POST-SIGN ]-----------------------------
HEAD /2022-07-05T08-14-44/shadow/secret/eventlogtable/default_202104-103_714_714_0.tar HTTP/1.1

Host: secret-bucket.s3.eu-central-1.amazonaws.com

User-Agent: aws-sdk-go/1.43.0 (go1.18.3; linux; amd64)

Authorization: AWS4-HMAC-SHA256 Credential=secret/20220705/eu-central-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=68b72010c9df4972f2cc05a7676e6bf9887bbf904219f5154935e17ad2843832

X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

X-Amz-Date: 20220705T083553Z




-----------------------------------------------------
2022/07/05 08:35:53.139582  info DEBUG: Response s3/HeadObject Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK

Content-Length: 17920

Accept-Ranges: bytes

Content-Type: binary/octet-stream

Date: Tue, 05 Jul 2022 08:35:54 GMT

Etag: "98f9ae0d02f679c1a6333fe06b5644ae"

Last-Modified: Tue, 05 Jul 2022 08:17:30 GMT

Server: AmazonS3

X-Amz-Id-2: DNUmwVZ38kdu7VmRhx7B7DJx+D026pr19hqMD+VpExDj0ib+ATrEwinfQkGbZtLoF+LGhqK3BlM=

X-Amz-Request-Id: XYY8PWRH19XJ9KD5

error:

2022/07/05 08:37:10.481681 error one of Download go-routine return error: one of downloadDiffParts go-routine return error: secret.eventlogtable 202104-103_714_714_0 not found on 2022-07-05T08-14-44 and all required backups sequence

Slach · 2022-07-05T09:21:33Z

@k0t3n please remove

s3:
  debug: true

and setup

general:
  log_level: debug

after it, please try to share full log for download command

Slach · 2022-07-05T09:29:50Z

@k0t3n is your S3_DEBUG log contains

GET /2022-07-05T08-14-44/shadow/secret/eventlogtable/default_202104-103_714_714_0.tar

?

piyushsriv · 2022-07-05T17:55:45Z

Thanks, @Slach for all the information and help.
I hope we get this problem again so that I can confirm what you are asking to check.

It seems @k0t3n can help you better as he has a reproducible scenario.

k0t3n · 2022-07-23T10:55:47Z

@Slach sorry for long response.

Yes, my log contains
GET /2022-07-05T08-14-44/shadow/secret/eventlogtable/default_202104-103_714_714_0.tar
but correct path is
GET /2022-07-05T08-14-44/shadow/secret/eventlogtable/default_202104%2D103_714_714_0.tar

So url encoding was missed. An error is reproduced only with incremental backup download and only in findDiffRecursive, findDiffOnePart and findDiffOnePartArchive actions.

k0t3n · 2022-07-23T11:07:33Z

@Slach logs
https://gist.github.com/k0t3n/e6567656d09aaf7220e6d1c93e744c4d

Slach · 2022-07-24T14:42:16Z

@k0t3n look like gist with logs is private, could you make it public?

k0t3n · 2022-07-24T14:46:44Z

@Slach sorry, fixed

piyushsriv · 2022-07-25T14:16:28Z

Currently, the implementation is that if we have a problem in one table then the whole process stop. Can we have some exclusion table list mechanism in such a process that skips those tables?
Like we have for backup where we can skip tables using skip_tables which I found only works for backup and not for restore like operation.

Currently, only filter --table=db.prefix* could be applied for download and restore command no "exclude" mechanism implemented, feel free to make pull request

I was looking at the code, to implement skip tables in case of restore, but found that it is already implemented here.

I tested it and it is working. Earlier when it was not working for me was an old version. It seems the latest version has this capability.

Slach · 2022-07-26T08:43:47Z

@k0t3n could you share

SHOW CREATE TABLE secret.eventlogtable;

need to know PARTITION BY clause

k0t3n · 2022-07-26T09:07:27Z

@Slach

CREATE TABLE secret.eventlogtable
(
    `time` DateTime DEFAULT '0000000000',
    `action` Int32,
    `page_from` Int64,
    `page_to` Int64,
    `rows` UInt16,
    `sent` UInt16,
    `is_success` Int8,
    `duration` Float64
)
ENGINE = MergeTree
PARTITION BY (toYYYYMM(time), action)
ORDER BY (page_from, action, time)
SETTINGS index_granularity = 8192

Slach · 2022-07-26T14:17:51Z

@k0t3n reproduced on my side
thanks a lot for reporting

k0t3n · 2022-07-26T14:46:12Z

@Slach thank you! Looking forward for the fix 🚀

Slach mentioned this issue Jul 27, 2022

release 1.4.9 #481

Merged

Slach closed this as completed in #481 Jul 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental backup failing due to missing part error #462

Incremental backup failing due to missing part error #462

piyushsriv commented Jun 30, 2022

k0t3n commented Jun 30, 2022

k0t3n commented Jun 30, 2022

k0t3n commented Jun 30, 2022

piyushsriv commented Jun 30, 2022

k0t3n commented Jun 30, 2022

piyushsriv commented Jun 30, 2022

Slach commented Jul 4, 2022

Slach commented Jul 4, 2022

Slach commented Jul 4, 2022

Slach commented Jul 4, 2022 •

edited

Loading

Slach commented Jul 4, 2022

piyushsriv commented Jul 4, 2022

piyushsriv commented Jul 4, 2022

piyushsriv commented Jul 4, 2022

Slach commented Jul 4, 2022

Slach commented Jul 4, 2022 •

edited

Loading

Slach commented Jul 4, 2022

k0t3n commented Jul 5, 2022 •

edited

Loading

Slach commented Jul 5, 2022

Slach commented Jul 5, 2022

piyushsriv commented Jul 5, 2022

k0t3n commented Jul 23, 2022

k0t3n commented Jul 23, 2022 •

edited

Loading

Slach commented Jul 24, 2022

k0t3n commented Jul 24, 2022

piyushsriv commented Jul 25, 2022

Slach commented Jul 26, 2022

k0t3n commented Jul 26, 2022

Slach commented Jul 26, 2022

k0t3n commented Jul 26, 2022

Incremental backup failing due to missing part error #462

Incremental backup failing due to missing part error #462

Comments

piyushsriv commented Jun 30, 2022

k0t3n commented Jun 30, 2022

k0t3n commented Jun 30, 2022

k0t3n commented Jun 30, 2022

piyushsriv commented Jun 30, 2022

k0t3n commented Jun 30, 2022

piyushsriv commented Jun 30, 2022

Slach commented Jul 4, 2022

Slach commented Jul 4, 2022

Slach commented Jul 4, 2022

Slach commented Jul 4, 2022 • edited Loading

Slach commented Jul 4, 2022

piyushsriv commented Jul 4, 2022

piyushsriv commented Jul 4, 2022

piyushsriv commented Jul 4, 2022

Slach commented Jul 4, 2022

Slach commented Jul 4, 2022 • edited Loading

Slach commented Jul 4, 2022

k0t3n commented Jul 5, 2022 • edited Loading

Slach commented Jul 5, 2022

Slach commented Jul 5, 2022

piyushsriv commented Jul 5, 2022

k0t3n commented Jul 23, 2022

k0t3n commented Jul 23, 2022 • edited Loading

Slach commented Jul 24, 2022

k0t3n commented Jul 24, 2022

piyushsriv commented Jul 25, 2022

Slach commented Jul 26, 2022

k0t3n commented Jul 26, 2022

Slach commented Jul 26, 2022

k0t3n commented Jul 26, 2022

Slach commented Jul 4, 2022 •

edited

Loading

Slach commented Jul 4, 2022 •

edited

Loading

k0t3n commented Jul 5, 2022 •

edited

Loading

k0t3n commented Jul 23, 2022 •

edited

Loading