Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clickhoue-backup 1.3.1 upload freezing to S3 #404

Closed
vanyasvl opened this issue Feb 25, 2022 · 22 comments · Fixed by #426
Closed

clickhoue-backup 1.3.1 upload freezing to S3 #404

vanyasvl opened this issue Feb 25, 2022 · 22 comments · Fixed by #426
Assignees
Milestone

Comments

@vanyasvl
Copy link

Hello.
I have an issue with upload freezing to S3 on the new 1.3.1
Nothing interesting in the logs

2022/02/25 14:20:09.677384 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220225_214300_214333_3.tar
2022/02/25 14:20:09.677453 debug start upload 9 files to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220225_214399_214399_0.tar
2022/02/25 14:20:09.999781 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220225_214399_214399_0.tar
2022/02/25 14:20:09.999843 debug start upload 127 files to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220224_194748_196095_5.tar
2022/02/25 14:20:12.235436 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220225_213215_214299_5.tar
2022/02/25 14:20:12.235507 debug start upload 127 files to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220225_212174_213214_5.tar
2022/02/25 14:20:18.308771 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_hdd_20211112_131389_132172_4.tar
2022/02/25 14:20:18.308830 debug start upload 127 files to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220225_214371_214386_2.tar
2022/02/25 14:20:19.470721 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220225_214371_214386_2.tar
2022/02/25 14:20:19.470850 debug start upload 9 files to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220224_196714_196738_2.tar
2022/02/25 14:20:19.792712 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220224_196714_196738_2.tar
2022/02/25 14:20:29.796079 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220225_212174_213214_5.tar
2022/02/25 14:20:39.739611 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220224_194748_196095_5.tar
2022/02/25 14:21:21.147805 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220225_203478_211101_6.tar
2022/02/25 14:21:37.585912 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220224_186015_193309_6.tar
2022/02/25 14:21:41.767821 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_hdd_20211008_165853_166642_4.tar
2022/02/25 14:21:47.237032 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_ssd_20220225_196675_203477_6.tar
2022/02/25 14:21:49.817974 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_hdd_20220108_69124_69448_4.tar
2022/02/25 14:22:08.082189 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_hdd_20211222_3_96001_4.tar
2022/02/25 14:24:27.048444 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_hdd_20220131_5_44202_4.tar
2022/02/25 14:33:56.092509 debug finish upload to backup_25-02-2022_09:39:45/shadow/flows/flows/disk_hdd_20211118_99009_99854_4.tar

and no new messages after in several hours, no network activity from clickhouse-backup, it's just stuck
1.3.0 woked good for upload

If I enable progressbar, it shouws one file upload progress

@Slach
Copy link
Collaborator

Slach commented Feb 25, 2022

looks weird, thanks for reporting
could you setup
S3_DEBUG=1
or add

s3:
  debug: true

moreover, if you run clickhouse-backup server
could you add

api:
 enable_pprof: true
 listen: 0.0.0.0:7171

and try to get pprof profile when stuck?
go tool pprof http://clickhohse-backup-ip:7171/debug/pprof/profile?seconds=30 > cpuprofile.log

and share cpuprofile.log ?

@vanyasvl
Copy link
Author

vanyasvl commented Mar 1, 2022

Ok, here is stucked upload
See files in attach

clickhouse-backup list remote | grep 08:31
clickhouse_28-02-2022_08:31:19   ???         01/01/0001 00:00:00   remote                                       broken (can't stat metadata.json)
curl clickhouse:7171/backup/actions
{"command":"create clickhouse_28-02-2022_08:31:19","status":"success","start":"2022-02-28 08:31:19","finish":"2022-02-28 08:31:26"}
{"command":"upload clickhouse_28-02-2022_08:31:19","status":"in progress","start":"2022-02-28 08:31:27"}

clickhuse-backup.log.zip
pprof.clickhouse-backup.samples.cpu.003.pb.gz

@Slach
Copy link
Collaborator

Slach commented Mar 1, 2022

@vanyasvl you are man! thanks a lot will try to figure out

@Slach Slach self-assigned this Mar 1, 2022
@Slach Slach added this to the 1.3.2 milestone Mar 1, 2022
@Slach
Copy link
Collaborator

Slach commented Mar 1, 2022

hmm, pb.gz contains only 2 samples
image

are you sure you capture 30 seconds?

@Slach
Copy link
Collaborator

Slach commented Mar 1, 2022

maybe strace -p $(pgrep clickhouse-backup) will help?

@vanyasvl
Copy link
Author

vanyasvl commented Mar 1, 2022

Ok, I’m try to reproduce it tomorrow and make new capture and strace

@Slach
Copy link
Collaborator

Slach commented Mar 1, 2022

Thanks a lot, your debug is really helpful

@vanyasvl
Copy link
Author

vanyasvl commented Mar 2, 2022

go tool pprof "http://clickhouse:7171/debug/pprof/profile?seconds=300"
Fetching profile over HTTP from http://clickhouse:7171/debug/pprof/profile?seconds=300
Saved profile in pprof.clickhouse-backup.samples.cpu.008.pb.gz
File: clickhouse-backup
Type: cpu
Time: Mar 2, 2022 at 9:15am (EET)
Duration: 300s, Total samples = 20ms (0.0067%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) exit

strace.zip

pprof.clickhouse-backup.samples.cpu.008.pb.gz

@Slach
Copy link
Collaborator

Slach commented Mar 2, 2022

could you try to
go tool pprof "http://clickhouse:7171/debug/pprof/goroutine?seconds=300"

@vanyasvl
Copy link
Author

vanyasvl commented Mar 2, 2022

Only 2 samples

Duration: 300s, Total samples = 2

pprof.clickhouse-backup.goroutine.001.pb.gz

@Slach
Copy link
Collaborator

Slach commented Mar 2, 2022

Accroding to shared strace.zip

let's make strace -p $(pgrep clickhouse-backup) again when
inside strace find something like

write(5, "\0", 1)                       = 1

and run
lsof -d <number_of_file_descriptor_from_strace>
we need to figure out why golang go-routines are "stuck"

@vanyasvl
Copy link
Author

vanyasvl commented Mar 2, 2022

futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1)                       = 1
# lsof -d 5
COMMAND       PID             USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
systemd         1             root    5u  a_inode               0,14         0      13847 [signalfd]
systemd-j     789             root    5u     unix 0xffff8900546f0400       0t0      33749 /run/systemd/journal/socket type=DGRAM
systemd-u     837             root    5u     unix 0xffff89005707b800       0t0      30886 type=DGRAM
systemd-n     953  systemd-network    5u  a_inode               0,14         0      13847 [timerfd]
systemd-r    1148  systemd-resolve    5u  a_inode               0,14         0      13847 [signalfd]
systemd-t    1149 systemd-timesync    5u  a_inode               0,14         0      13847 [signalfd]
accounts-    1173             root    5u     unix 0xffff88f85404dc00       0t0      65604 type=STREAM
dbus-daem    1175       messagebus    5u  netlink                          0t0      49597 AUDIT
irqbalanc    1182             root    5u     unix 0xffff88f856b40c00       0t0      44071 /run/irqbalance//irqbalance1182.sock type=STREAM
networkd-    1183             root    5u     unix 0xffff88f853df2c00       0t0      54293 type=STREAM
systemd-l    1192             root    5u  a_inode               0,14         0      13847 [signalfd]
polkitd      1211             root    5u  a_inode               0,14         0      13847 [eventfd]
container   81501             root    5u     unix 0xffff88f840e9b800       0t0    2525210 /run/containerd/containerd.sock type=STREAM
dockerd     81995             root    5u  a_inode               0,14         0      13847 [eventpoll]
docker      83587             root    5u     unix 0xffff88f7f24a3800       0t0    2503563 type=STREAM
container   83703             root    5u  a_inode               0,14         0      13847 [eventpoll]
dmesg_exp   83769             root    5u     IPv4            2528308       0t0        TCP localhost:29091 (LISTEN)
docker     102243             root    5u     unix 0xffff88f70a871400       0t0    3244093 type=STREAM
container  102270             root    5u  a_inode               0,14         0      13847 [eventpoll]
node_expo  102291           nobody    5r     FIFO               0,13       0t0    3235283 pipe
rsyslogd   136793           syslog    5r      REG                0,5         0 4026532095 /proc/kmsg
atopacctd  396335             root    5u  netlink                          0t0   19273936 GENERIC
clckhouse  620042       clickhouse    5w      REG              253,1   6355466   60820183 /var/log/clickhouse-server/clickhouse-server.log
clickhous  620052       clickhouse    5w      REG              253,1   6355466   60820183 /var/log/clickhouse-server/clickhouse-server.log
systemd    653940              sid    5u  a_inode               0,14         0      13847 [signalfd]
clickhous  752490              sid    5u     IPv4           50377592       0t0        TCP localhost:53754->localhost:9000 (CLOSE_WAIT)
docker     755293             root    5u     unix 0xffff88fd95f24000       0t0   54323128 type=STREAM
container  755322             root    5u  a_inode               0,14         0      13847 [eventpoll]
docker     766980             root    5u     unix 0xffff88f7f0fce800       0t0   72400250 type=STREAM
container  767008             root    5u  a_inode               0,14         0      13847 [eventpoll]
nginx      767030             root    5w     FIFO               0,13       0t0   72414436 pipe
nginx      767072 systemd-timesync    5w     FIFO               0,13       0t0   72414436 pipe
journalct  970631             root    5r      REG              253,1 134217728   60820337 /var/log/journal/a9e3b97df44946de90b17a91111b9cc1/system@531c7752133f4d57b137be6b69f4a8f1-0000000000199d60-0005d90ff9001522.journal
clickhous 1022448       clickhouse    5w     FIFO               0,13       0t0  115363772 pipe
atop      1026739             root    5u  a_inode               0,14         0      13847 [perf_event]
sshd      1029361             root    5u      CHR                5,2       0t0         89 /dev/ptmx
sshd      1029498              sid    5u     unix 0xffff88f00dd76000       0t0  116404217 type=STREAM
sshd      1030829             root    5u      CHR                5,2       0t0         89 /dev/ptmx
sshd      1030948              sid    5u     unix 0xffff88f101b39400       0t0  116581885 type=STREAM
lsof      1030981             root    5w     FIFO               0,13       0t0  116581020 pipe

You can message me in telegram @vanyasvl for realtime troubleshooting

@vanyasvl
Copy link
Author

vanyasvl commented Mar 2, 2022

And stace with -y flag:

strace -p 1022448 -y
strace: Process 1022448 attached
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1)     = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1)     = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1)     = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1)     = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1)     = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1)     = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1)     = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0

@vanyasvl
Copy link
Author

vanyasvl commented Mar 2, 2022

write(10<socket:[115423663]>, "HTTP/1.1 200 OK\r\nContent-Encodin"..., 2443) = 2443
futex(0xc000680150, FUTEX_WAKE_PRIVATE, 1) = 1
read(10<socket:[115423663]>, 0xc0003ca000, 4096) = -1 EAGAIN (Resource temporarily unavailable)

And active tcp sockets in ss output

@Slach
Copy link
Collaborator

Slach commented Mar 2, 2022

Look like there is a stange pipe, i don't know which pipe is mean and why it's look like broken

clickhous 1022448       clickhouse    5w     FIFO               0,13       0t0  115363772 pipe

@mikezsin
Copy link

mikezsin commented Mar 29, 2022

we look the same situation

general:
  remote_storage: s3
  max_file_size: 1073741824
  disable_progress_bar: true
  backups_to_keep_local: 6
  backups_to_keep_remote: 6
  log_level: info
  allow_empty_backups: false
  download_concurrency: 12
  upload_concurrency: 12
  restore_schema_on_cluster: ""
  upload_by_part: true
  download_by_part: true
clickhouse:
  username: gc_rw
  password:    *******
  host: 127.0.0.1
  port: 9000
  disk_mapping: {}
  skip_tables:
  - system.*
  timeout: 5m
  freeze_by_part: false
  secure: false
  skip_verify: false
  sync_replicated_tables: true
  log_sql_queries: false
  config_dir: /etc/clickhouse-server/
  restart_command: systemctl restart clickhouse-server
  ignore_not_exists_error_during_freeze: true
  debug: false
s3:
  access_key: *****
  secret_key: ******
  bucket: cephs3-front-1
  endpoint: ***/clickhouse-backup
  region: default
  acl: private
  assume_role_arn: ""
  force_path_style: false
  path: ch-1
  disable_ssl: true
  compression_level: 1
  compression_format: zstd
  sse: ""
  disable_cert_verification: false
  storage_class: STANDARD
  concurrency: 1
  part_size: 0
  debug: false

@Slach Slach changed the title clickhoue-backup 1.3.1 upload freezing to S3 clickhoue-backup 1.3.1 upload freezing to S3, for CEPH and SWIFT s3 backends Apr 7, 2022
@Slach Slach changed the title clickhoue-backup 1.3.1 upload freezing to S3, for CEPH and SWIFT s3 backends clickhoue-backup 1.3.1 upload freezing to S3 Apr 10, 2022
@dkokot
Copy link

dkokot commented Apr 13, 2022

Hey guys, we have faced same situation, tried 2 times.. stucking in the end. Any workaround so far?

our config:

general:
  remote_storage: s3
  max_file_size: 1073741824
  disable_progress_bar: true
  backups_to_keep_local: 0
  backups_to_keep_remote: 0
  log_level: info
  allow_empty_backups: false
  download_concurrency: 16
  upload_concurrency: 16
  restore_schema_on_cluster: ""
  upload_by_part: true
  download_by_part: true
clickhouse:
  username: default
  password: ""
  host: localhost
  port: 9000
  disk_mapping: {}
  skip_tables:
  - system.*
  - INFORMATION_SCHEMA.*
  - information_schema.*
  timeout: 5m
  freeze_by_part: false
  secure: false
  skip_verify: false
  sync_replicated_tables: false
  log_sql_queries: false
  config_dir: /etc/clickhouse-server/
  restart_command: systemctl restart clickhouse-server
  ignore_not_exists_error_during_freeze: true
  debug: false
s3:
  access_key: ***
  secret_key: ***
  bucket: ***
  endpoint: fra1.digitaloceanspaces.com
  region: us-east-1
  acl: private
  assume_role_arn: ""
  force_path_style: false
  path: /backup
  disable_ssl: false
  compression_level: 1
  compression_format: tar
  sse: ""
  disable_cert_verification: false
  storage_class: STANDARD
  concurrency: 2
  part_size: 0
  debug: false

@Slach
Copy link
Collaborator

Slach commented Apr 14, 2022

@dkokot could you remove

general:
  max_file_size: 1073741824

from your config and try to altinity/clickhouse-backup:latest docker image?

@dkokot
Copy link

dkokot commented Apr 14, 2022

@dkokot could you remove

general:
  max_file_size: 1073741824

from your config and try to altinity/clickhouse-backup:latest docker image?

Thats default value, which one should i use for overwriting it?

Currenly im running it like this:

docker run  --rm -it --network host -v "/var/lib/clickhouse:/var/lib/clickhouse" \
-e REMOTE_STORAGE="s3" \
-e DISABLE_PROGRESS_BAR="true" \
-e S3_ENDPOINT="fra1.digitaloceanspaces.com" \
-e S3_BUCKET="***" \
-e S3_ACCESS_KEY="***" \
-e S3_SECRET_KEY="***" \
-e S3_PATH="/backup" \
-e S3_CONCURRENCY="2" \
   alexakulov/clickhouse-backup

Im also tried master and 1.3.0 version

@dkokot
Copy link

dkokot commented Apr 16, 2022

Tried to compile from source(using master branch here) with no luck, same freeze.. my db quite big(arrounf 50k files in backup folder, 950gb), can i somehow help to debug ?

@Slach Slach mentioned this issue Apr 19, 2022
@dkokot
Copy link

dkokot commented Apr 19, 2022

Hello there

After carefully checking digitalocean spaces limitation, i found main reason for freeze - their api dont fully support list-of-objects-v2, which are using when quering s3 storage.

Anyway, I found version 1.2.4 from afinity more stable and now using it with FTP backend from hetzner. This setup works like a charm in our situation(2 replicated servers, around 1 tb of raw data in CH)

I also found really poor documentation regarding increment backups and will be happy to add some notes for readme soon. Hope it will be helpful for others.

BR, Dmytro

@Slach
Copy link
Collaborator

Slach commented Apr 19, 2022

@dkokot thanks a lot for clarify

I also found really poor documentation regarding increment backups and will be happy to add some notes for readme soon.

What exactly part of documentation you missed?
Did you see https://github.com/Altinity/clickhouse-backup/blob/master/Examples.md#how-do-incremental-backups-work-to-remote-storage?

Anyway, feel free make Pull Request, make clickhouse-backup better together

sushraju added a commit to muxinc/clickhouse-backup that referenced this issue Jun 27, 2022
* fix Altinity#311

* fix Altinity#312

* fix https://github.com/Altinity/clickhouse-backup/runs/4385266807

* fix wrong amd64 `libc` dependency

* change default skip_tables pattern to exclude INFORMATION_SCHEMA database for clickhouse 21.11+

* actualize GET /backup/actions, and fix config.go `CLICKHOUSE_SKIP_TABLES` definition

* add COS_DEBUG separate setting, wait details in Altinity#316

* try to resolve Altinity#317

* Allow using OIDC token for AWS credentials

* update ReadMe.md add notes about INFORMATION_SCHEMA.*

* fix Altinity#220, allow total_bytes as uint64 fetching
fix allocations for `tableMetadataForDownload`
fix getTableSizeFromParts behavior only for required tables
fix Error handling on some suggested cases

* fix Altinity#331, corner case when `Table`  and `Database` have the same name.
update clickhouse-go to 1.5.1

* fix Altinity#331

* add SFTP_DEBUG to try debug Altinity#335

* fix bug, recursuve=>recursive

* BackUPList use 'recursive=true', and other codes do not change, hope this can pass CI

* Force recursive equals true locally

* Reset recursive flag to false

* fix Altinity#111

* add inner Interface for COS

* properly fix for recursive delimiter, fix Altinity#338

* Fix bug about metadata.json, we should check the file name first, instead of appending metadata.json arbitrary

* add ability to restore schema ON CLUSTER, fix Altinity#145

* fix bug about clickhouse-backup list remote which shows no backups info, clickhouse-backup create_remote which will not delete the right backups

* fix `Address: NULL pointer` when DROP TABLE ... ON CLUSTER, fix Altinity#145

* try to fix `TestServerAPI` https://github.com/Altinity/clickhouse-backup/runs/4727526265

* try to fix `TestServerAPI` https://github.com/Altinity/clickhouse-backup/runs/4727754542

* Give up using metaDataFilePath variable

* fix bug

* Add support encrypted disk (include s3 encrypted disks), fix [Altinity#260](Altinity#260)
add 21.12 to test matrix
fix FTP MkDirAll behavior
fix `restore --rm` behavior for 20.12+ for tables which have dependent objects (like dictionary)

* try to fix failed build https://github.com/Altinity/clickhouse-backup/runs/4749276032

* add S3 only disks check for 21.8+

* fix Altinity#304

* fix Altinity#309

* try return GCP_TESTS back

* fix run GCP_TESTS

* fix run GCP_TESTS, again

* split build-artifacts and build-test-artifacts

* try to fix https://github.com/Altinity/clickhouse-backup/runs/4757549891

* debug workflows/build.yaml

* debug workflows/build.yaml

* debug workflows/build.yaml

* final download atrifacts for workflows/build.yaml

* fix build docker https://github.com/Altinity/clickhouse-backup/runs/4758167628

* fix integration_tests https://github.com/AlexAkulov/clickhouse-backup/runs/4758357087

* Improve list remote speed via local metadata cache, fix Altinity#318

* try to fix https://github.com/Altinity/clickhouse-backup/runs/4763790332

* fix test after fail https://github.com/Altinity/clickhouse-backup/runs/4764141333

* fix concurrency MkDirAll for FTP remote storage, improve `invalid compression_format` error message

* fix TestLongListRemote

* Clean code, do not name variables so sloppily, names should be meaningful

* Update clickhouse.go

Change partitions => part

* Not change Files filed in json file

* Code should be placed in proper position

* Update server.go

* fix bug

* Invoke SoftSelect should begin with ch.

* fix error, clickhouse.common.TablePathEncode => common.TablePathEncode

* refine code

* try to commit

* fix bug

* Remove unused codes

* Use NewReplacer

* Add `CLICKHOUSE_IGNORE_NOT_EXISTS_ERROR_DURING_FREEZE`, fix Altinity#319

* fix test fail https://github.com/Altinity/clickhouse-backup/runs/4825973411?check_suite_focus=true

* run only TestSkipNotExistsTable on Github actions

* try to fix TestSkipNotExistsTable

* try to fix TestSkipNotExistsTable

* try to fix TestSkipNotExistsTable, for ClickHouse version v1.x

* try to fix TestSkipNotExistsTable, for ClickHouse version v1.x

* add microseconds to log, try to fix TestSkipNotExistsTable, for ClickHouse version v20.8

* add microseconds to log, try to fix TestSkipNotExistsTable, for ClickHouse version v20.8

* fix connectWithWait, some versions of clickhouse accept connections during process /entrypoint-initdb.d, need wait to continue

* add TestProjections

* rename dropAllDatabases to more mental and clear name

* skip TestSkipNotExistsTable

* Support specified partition backup (Altinity#356)

* Support specify partition during backup create

Authored-by: wangzhen <wangzhen@growingio.com>

* fix PROJECTION restore Altinity#320

* fix TestProjection fail after https://github.com/Altinity/clickhouse-backup/actions/runs/1712868840

* switch to `altinity-qa-test` bucket in GCS test

* update github.com/mholt/archiver/v3 and github.com/ClickHouse/clickhouse-go to latest version, remove old github.com/mholt/archiver usage

* fix `How to convert MergeTree to ReplicatedMergeTree` instruction

* fix `FTP` connection usage in MkDirAll

* optimize ftp.go connection pool

* Add `UPLOAD_BY_PART` config settings for improve upload/download concurrency fix Altinity#324

* try debug https://github.com/AlexAkulov/clickhouse-backup/runs/4920777422

* try debug https://github.com/AlexAkulov/clickhouse-backup/runs/4920777422

* fix vsFTPd 500 OOPS: vsf_sysutil_bind, maximum number of attempts to find a listening port exceeded, fix https://github.com/AlexAkulov/clickhouse-backup/runs/4921182982

* try to fix race condition in GCP https://github.com/AlexAkulov/clickhouse-backup/runs/4924432841

* update clickhouse-go to 1.5.3, properly handle `--schema` parameter for show local backup size after `download`

* add `Database not exists` corner case for `IgnoreNotExistsErrorDuringFreeze` option

* prepare release 1.3.0
- Add implementation `--diff-from-remote` for `upload` command and properly handle `required` on download command, fix Altinity#289
- properly `REMOTE_STORAGE=none` error handle, fix Altinity#375
- Add support for `--partitions` on create, upload, download, restore CLI commands and API endpoint fix Altinity#378, properly implementation of Altinity#356
- Add `print-config` cli command fix Altinity#366
- API Server optimization for speed of `last_backup_size_remote` metric calculation to make it async during REST API startup and after download/upload, fix Altinity#309
- Improve `list remote` speed via local metadata cache in `$TEMP/.clickhouse-backup.$REMOTE_STORAGE`, fix Altinity#318
- fix Altinity#375, properly `REMOTE_STORAGE=none` error handle
- fix Altinity#379, will try to clean `shadow` if `create` fail during `moveShadow`
- more precise calculation backup size during `upload`, for backups created with `--partitions`, fix bug after Altinity#356
- fix `restore --rm` behavior for 20.12+ for tables which have dependent objects (like dictionary)
- fix concurrency by `FTP` creation directories during upload, reduce connection pool usage
- properly handle `--schema` parameter for show local backup size after `download`
- add ClickHouse 22.1 instead of 21.12 to test matrix

* fix build https://github.com/Altinity/clickhouse-backup/runs/5033550335

* Add `API_ALLOW_PARALLEL` to support multiple parallel execution calls for, WARNING, control command names don't try to execute multiple same commands and be careful, it could allocate much memory during upload / download, fix Altinity#332

* apt-get update too slow today on github ;(

* fix TestLongListRemote

* fix Altinity#340, properly handle errors on S3 during Walk() and delete old backup

* Add TestFlows tests to GH workflow (Altinity#5)

* add actions tests

* Update test.yaml

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* updated

* added config.yml

* added config.yml

* update

* updated files

* added tests for views

* updated tests

* updated

* fixed snapshots

* updated tests in response to @Slach

* upload new stuff

* rerun

* fix

* fix

* remove file

* added requirements

* fix fails

* ReRun actions

* Moved credentials

* added secrets

* ReRun actions

* Edited test.yaml

* Edited test.yaml

* ReRun actions

* removed TE flag

* update

* update

* update

* fix type

* update

* try to reanimate ugly github actions and ugly python tests

* try to reanimate ugly config_rbac.py

* fix Altinity#300
fix WINDOW VIEW restore
fix restore for different compression_format than backup created
fix most of xfail in regression.py
merge test.yaml and build.yaml in github actions
Try to add experimental support for backup `MaterializedMySQL` and `MaterializedPostgeSQL` tables, restore MySQL tables not impossible now without replace `table_name.json` to `Engine=MergeTree`, PostgreSQL not supported now, see ClickHouse/ClickHouse#32902

* return format back

* fix build.yaml after https://github.com/Altinity/clickhouse-backup/actions/runs/1800312966

* fix build.yaml after https://github.com/Altinity/clickhouse-backup/actions/runs/1800312966

* build fixes after https://github.com/Altinity/clickhouse-backup/runs/5079597138

* build fixes after https://github.com/Altinity/clickhouse-backup/runs/5079630559

* build fixes after https://github.com/Altinity/clickhouse-backup/runs/5079669062

* fix tfs report

* fix upload artifact for tfs report

* fix upload artifact for clickhouse logs, remove unused BackupOptions

* suuka

* fix upload `clickhouse-logs` artifacts and tfs `report.html`

* fix upload `clickhouse-logs` artifacts

* fix upload `clickhouse-logs` artifacts, fix tfs reports

* fix tfs reports

* change retention to allow upload-artifacts work

* fix ChangeLog.md

* skip gcs and aws remote storage tests if secrets not set

* remove short output

* increase timeout to allow download images during pull

* remove upload `tesflows-clickhouse-logs` artifacts to avoid 500 error

* fix upload_release_assets action for properly support arm64

* switch to mantainable `softprops/action-gh-release`

* fix Unexpected input(s) 'release_name'

* move internal, config, util into `pkg` refactoring

* updated test requirements

* refactoring `filesystemhelper.Chown` remove unnecessary getter/setter, try to reproduce access denied for Altinity#388 (comment)

* resolve Altinity#390, for 1.2.3 hotfix branch

* backport 1.3.x Dockerfile and Makefile to allow 1.2.3 docker ARM support

* fix Altinity#387 (comment), improve documentation related to memory and CPU usage

* fix Altinity#388, improve restore ON CLUSTER for VIEW with TO clause

* fix Altinity#388, improve restore ATTACH ... VIEW ... ON CLUSTER, GCS golang sdk updated to latest

* fix Altinity#385, properly handle multiple incremental backup sequences + `BACKUPS_TO_KEEP_REMOTE`

* fix Altinity#392, correct download for recursive sequence of diff backups when `DOWNLOAD_BY_PART` true
fix integration_test.go, add RUN_ADVANCED_TESTS environment, fix minio_nodelete.sh

* try to reduce upload artifact jobs, look actions/upload-artifact#171 and https://github.com/Altinity/clickhouse-backup/runs/5229552384?check_suite_focus=true

* try to docker-compose up from first time https://github.com/AlexAkulov/clickhouse-backup/runs/5231510719?check_suite_focus=true

* disable telemetry for GCS related to googleapis/google-cloud-go#5664

* update aws-sdk-go and GCS storage SDK

* DROP DATABASE didn't clean S3 files, DROP TABLE clean!

* - fix Altinity#406, properly handle `path` for S3, GCS for case when it begin from "/"

* fix getTablesWithSkip

* fix Altinity#409

* cherry pick release.yaml from 1.3.x to 1.2.x

* fix Altinity#409, for 1.3.x avoid delete partially uploaded backups via `backups_keep_remote` option

* Updated requirements file

* fix Altinity#409, for 1.3.x avoid delete partially uploaded backups via `backups_keep_remote` option

* fix testflows test

* fix testflows test

* restore tests after update minio

* Fix incorrect in progress check on the example of Kubernetes CronJob

* removeOldBackup error log from fatal to warning, to avoid race-condition deletion during multi-shard backup

* switch to golang 1.18

Signed-off-by: Slach <bloodjazman@gmail.com>

* add 22.3 to test matrix, fix Altinity#422, avoid cache broken (partially uploaded) remote backup metadata.

* add 22.3 to test matrix

* fix Altinity#404, switch to 22.3 by default

Signed-off-by: Slach <bloodjazman@gmail.com>

* fix Altinity#404, update to archiver/v4, properly support context during upload / download and correct error handler, reduce `SELECT * system.disks` calls

Signed-off-by: Slach <bloodjazman@gmail.com>

* cleanup ChangeLog.md, finally before 1.3.2 release

Signed-off-by: Slach <bloodjazman@gmail.com>

* continue fix Altinity#404

Signed-off-by: Slach <bloodjazman@gmail.com>

* continue fix Altinity#404, properly calculate max_parts_count

Signed-off-by: Slach <bloodjazman@gmail.com>

* continue fix Altinity#404, properly calculate max_parts_count

Signed-off-by: Slach <bloodjazman@gmail.com>

* add multithreading GZIP implementation

Signed-off-by: Slach <bloodjazman@gmail.com>

* add multithreading GZIP implementation

Signed-off-by: Slach <bloodjazman@gmail.com>

* add multithreading GZIP implementation

Signed-off-by: Slach <bloodjazman@gmail.com>

* Updated Testflows README.md

* add `S3_ALLOW_MULTIPART_DOWNLOAD` to config, to improve download speed, fix Altinity#431

Signed-off-by: Slach <bloodjazman@gmail.com>

* fix snapshot after change default config

Signed-off-by: Slach <bloodjazman@gmail.com>

* fix testflows healthcheck for slow internet connection during `clickhouse_backup` start

Signed-off-by: Slach <bloodjazman@gmail.com>

* fix snapshot after change defaultConfig

Signed-off-by: Slach <bloodjazman@gmail.com>

* - add support backup/restore user defined functions https://clickhouse.com/docs/en/sql-reference/statements/create/function, fix Altinity#420

Signed-off-by: Slach <bloodjazman@gmail.com>

* Updated README.md in testflows tests

* remove unnecessary SQL query for calculateMaxSize, refactoring test to allow restoreRBAC with restart on 21.8 (strange bug, clickhouse stuck after try to run too much distributed DDL queries from ZK), update LastBackupSize metric during API call /list/remote, add healthcheck to docker-compose in integration tests

Signed-off-by: Slach <bloodjazman@gmail.com>

* try to fix GitHub actions

Signed-off-by: Slach <bloodjazman@gmail.com>

* try to fix GitHub actions, WTF, why testflows failed?

Signed-off-by: Slach <bloodjazman@gmail.com>

* add `clickhouse_backup_number_backups_remote`, `clickhouse_backup_number_backups_local`, `clickhouse_backup_number_backups_remote_expected`,`clickhouse_backup_number_backups_local_expected` prometheus metric, fix Altinity#437

Signed-off-by: Slach <bloodjazman@gmail.com>

* add ability to apply `system.macros` values to `path` field in all types of `remote_storage`, fix Altinity#438

Signed-off-by: Slach <bloodjazman@gmail.com>

* use all disks for upload and download for mutli-disk volumes in parallel when `upload_by_part: true` fix Altinity#400

Signed-off-by: Slach <bloodjazman@gmail.com>

* fix wrong warning for .gz, .bz2, .br archive extensions during download, fix Altinity#441

Signed-off-by: Slach <bloodjazman@gmail.com>

* fix Altinity#441, again ;(

Signed-off-by: Slach <bloodjazman@gmail.com>

* try to improve strange parts long tail during test

Signed-off-by: Slach <bloodjazman@gmail.com>

* update actions/download-artifact@v3 and actions/upload-artifact@v2, after actions fail

Signed-off-by: Slach <bloodjazman@gmail.com>

* downgrade actions/upload-artifact@v2.2.4, actions/upload-artifact#270, after actions fail https://github.com/AlexAkulov/clickhouse-backup/runs/6481819375

Signed-off-by: Slach <bloodjazman@gmail.com>

* fix upload data go routines wait, expect improve upload speed the same as 1.3.2

Signed-off-by: Slach <bloodjazman@gmail.com>

* prepare 1.4.1

Signed-off-by: Slach <bloodjazman@gmail.com>

* Fix typo in Example.md

* Set default value for max_parts_count in Azure config

* fix `--partitions` parameter parsing, fix Altinity#425

Signed-off-by: Slach <bloodjazman@gmail.com>

* remove unnecessary logs, fix release.yaml to mark properly tag in GitHub release

Signed-off-by: Slach <bloodjazman@gmail.com>

* add `API_INTEGRATION_TABLES_HOST` option to allow use DNS name in integration tables system.backup_list, system.backup_actions

Signed-off-by: Slach <bloodjazman@gmail.com>

* add `API_INTEGRATION_TABLES_HOST` fix for tesflows fails

Signed-off-by: Slach <bloodjazman@gmail.com>

* fix `upload_by_part: false` max file size calculation, fix Altinity#454

* upgrade actions/upload-artifact@v3, actions/upload-artifact#270, after actions fail https://github.com/Altinity/clickhouse-backup/runs/6962550621

* [clickhouse-backup] fixes on top of upstream

* upstream versions

Co-authored-by: Slach <bloodjazman@gmail.com>
Co-authored-by: Vilmos Nebehaj <vilmos@sprig.com>
Co-authored-by: Eugene Klimov <eklimov@altinity.com>
Co-authored-by: root <root@SLACH-MINI.localdomain>
Co-authored-by: wangzhen <wangzhen@growingio.com>
Co-authored-by: W <wangzhenaaa7@gmail.com>
Co-authored-by: Andrey Zvonov <32552679+zvonand@users.noreply.github.com>
Co-authored-by: zvonand <azvonov@altinity.com>
Co-authored-by: benbiti <wangshouben@hotmail.com>
Co-authored-by: Vitaliis <vsviderskyi@altinity.com>
Co-authored-by: Toan Nguyen <hgiasac@gmail.com>
Co-authored-by: Guido Iaquinti <guido@posthog.com>
Co-authored-by: ricoberger <mail@ricoberger.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants