br: failed to backup to s3 due to the limit of request rate. #30087

3pointer · 2021-11-24T03:13:51Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Didn't reproduce for now. just the image of the minimal reproduce step.

backup to specified bucket and prefix.
other service read/put to the same bucket prefix. makes the put request on same prefix to 3500/s.

2. What did you expect to see? (Required)

br or tikv slow down the request of backup. still success after other service stopped.

3. What did you see instead (Required)

TiKV received 503 Slow Down error and br didn't handle this error properly make the backup interrupt.

4. What is your TiDB version? (Required)

all

The text was updated successfully, but these errors were encountered:

kolbe · 2021-11-24T18:19:08Z

It seems like the rate limiting needs to happen in TiKV, since TiKV is the component actually interacting with S3. Or maybe you think the limiting should happen in BR, since BR is telling each of the TiKV nodes when and how many regions to back up?

3pointer · 2021-11-25T02:42:21Z

Yes, BR build tasks(one table or index corresponds to one task) and control the task interacting with TiKV. each task corresponds to one or many regions in TiKV.(according to table or index size).
So we had two level of concurrency.

task level (controlled by BR --concurrency)
thread level (controlled by TiKV config num-threads)

But according to our experience. It's hard to believe the normal backup can reach the 3500 PUT per second.

3pointer · 2021-12-22T02:36:52Z

For the long term. we need a flow controller. that control the request do not exceed cloud rate limit. and after that we may abandon such concurrency/num-threads/multi_part_size configurations. and implement auto concurrency easily. @IANTHEREAL

) ref #11666, ref pingcap/tidb#30087 Signed-off-by: ti-srebot <ti-srebot@pingcap.com> Signed-off-by: 3pointer <luancheng@pingcap.com> Co-authored-by: 3pointer <luancheng@pingcap.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

morgo · 2022-06-20T20:42:54Z

This causes backup failures for us :( It is not in production (yet), but we don't have a workaround.

3pointer · 2022-06-21T05:47:00Z

This causes backup failures for us :( It is not in production (yet), but we don't have a workaround.

After tikv/tikv#11666 merge, we do have a workaround by setting s3_multi_part_size config to disable multipart requests to s3.

In our experience, 100+ tikv and set s3_multi_part_size to 30M can finished backup.
but in future for the scenario of 600+ tikv's cluster. this may not work. and in that time we may need to build a high level task scheduler mechanism.

So I leave this issue opened here.

3pointer · 2022-06-21T05:48:53Z

This causes backup failures for us :( It is not in production (yet), but we don't have a workaround.

you can try the workaround first. if it works. I'll mark this issue major not critical.

morgo · 2022-06-21T15:55:27Z

Thanks, we will try this and get back to you soon!

morgo · 2022-06-23T16:57:55Z

Bad news: this option worked at the start, but after an hour or so a lot of retrying started. So we are having to set the backup num-threads=4.

To give you an idea of the impact:

With multi-part disabled we can backup 16TB in 1hr
With num-threads=4, it's only about 1TB/hr (or 700GB, somewhere around there)

Because TiDB doesn't have PITR, we have to rely on full backups. I consider it reasonable to require daily backups complete in 3-4 hours, so that limits our maximum TiDB cluster size to 4TB (which is a serious limitation: its not much larger than MySQL).

Part of the reason why we are hitting this, is because the backup is not following S3 best practices (additional link). There is a quota of 3500 PUT/COPY/POST/DELETE requests per prefix -- but no limits on the number of prefixes. So if the writes can be distributed across prefixes, we can scale to larger than 4TB clusters no problem.

3pointer · 2022-06-24T07:26:18Z

Thanks for feedback.

I also realize the same issue after we discuss this issue internally. we can change the backup structure, the detailed solution is as below, and we are working on it.

Solution
- Adjust the backup organization structure and add a store_id related prefix under the backup path
  - According to https://aws.amazon.com/cn/premiumsupport/knowledge-center/s3-troubleshoot-503-cross-region/，Only requests with the same prefix will receive a 3500 ratelimit limit.
  - So adding the different prefixes for different stores in on backup is natural.


./br backup --pd "127.0.0.1:2379" -s "s3://backup/20220621" 
  - After br command finished, we will have the structure below.
➜  backup tree .
.
└── 20220621
    ├── backupmeta
    ├── store1
    │   └── backup-xxx.sst
    ├── store100
    │   └── backup-yyy.sst
    ├── store2
    │   └── backup-zzz.sst
    ├── store3
    ├── store4
    └── store5
    
- Then we only need to set the correct rate limit for each store can solve the issue.
- For restore. We need to handle the backup path carefully. Make the download reach the correct sst file.

Signed-off-by: Gaoming <zhanggaoming028@gmail.com>

Signed-off-by: zhanggaoming <gaoming.zhang@pingcap.com>

kolbe · 2022-07-20T05:13:48Z

@3pointer is there a PR associated with a final fix for this issue?

3pointer · 2022-07-20T11:38:44Z

@3pointer is there a PR associated with a final fix for this issue?

PR is in tikv: tikv/tikv#12958

3pointer added type/bug The issue is confirmed as a bug. component/br This issue is related to BR of TiDB. labels Nov 24, 2021

ChenPeng2013 added the severity/major label Nov 24, 2021

3pointer mentioned this issue Dec 15, 2021

Backup: add S3 metrics && add s3_multi_part_size config tikv/tikv#11666

Merged

3pointer self-assigned this Dec 16, 2021

3pointer mentioned this issue Dec 27, 2021

br: failed to backup to s3 due to the limit of request rate. tikv/tikv#11727

Closed

jebter added affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. labels Jan 11, 2022

VelocityLight added the affects-6.0 label Mar 17, 2022

VelocityLight removed the affects-5.2 This bug affects 5.2.x versions. label Apr 12, 2022

ti-srebot mentioned this issue May 6, 2022

Backup: add S3 metrics && add s3_multi_part_size config (#11666) tikv/tikv#12457

Merged

VelocityLight added the affects-6.1 This bug affects the 6.1.x(LTS) versions. label May 20, 2022

morgo added severity/critical and removed severity/major labels Jun 20, 2022

ti-chi-bot added may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. labels Jun 20, 2022

3pointer assigned MoCuishle28 Jun 24, 2022

MoCuishle28 added a commit to MoCuishle28/tikv that referenced this issue Jul 5, 2022

Adjust the backup organization structure (pingcap/tidb#30087)

562d17f

Signed-off-by: Gaoming <zhanggaoming028@gmail.com>

MoCuishle28 mentioned this issue Jul 5, 2022

br: Adjust the backup organization structure tikv/tikv#12958

Merged

MoCuishle28 added a commit to MoCuishle28/tikv that referenced this issue Jul 5, 2022

Adjust the backup organization structure (pingcap/tidb#30087)

e3b4276

Signed-off-by: Gaoming <zhanggaoming028@gmail.com>

MoCuishle28 added a commit to MoCuishle28/tikv that referenced this issue Jul 16, 2022

Adjust the backup organization structure (pingcap/tidb#30087)

fd6bcb4

Signed-off-by: zhanggaoming <gaoming.zhang@pingcap.com>

MoCuishle28 added a commit to MoCuishle28/tikv that referenced this issue Jul 16, 2022

br: Adjust the backup organization structure (pingcap/tidb#30087)

01f7f98

Signed-off-by: zhanggaoming <gaoming.zhang@pingcap.com>

MoCuishle28 added a commit to MoCuishle28/tikv that referenced this issue Jul 18, 2022

br: Adjust the backup organization structure (pingcap/tidb#30087)

6bb8f89

Signed-off-by: zhanggaoming <gaoming.zhang@pingcap.com>

MoCuishle28 mentioned this issue Jul 20, 2022

br: failed to backup to s3 due to the limit of request rate. tikv/tikv#13063

Closed

3pointer closed this as completed Jul 20, 2022

shichun-0415 mentioned this issue Jul 21, 2022

br: add backup file layout description pingcap/docs-cn#10622

Merged

13 tasks

This was referenced Aug 29, 2022

release: add TiDB 6.1.1 release notes pingcap/docs-cn#11063

Merged

releases: add TiDB 6.1.1 release notes pingcap/docs#10139

Merged

ti-chi-bot mentioned this issue Aug 31, 2022

br: add backup file layout description (#10622) pingcap/docs-cn#11140

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

br: failed to backup to s3 due to the limit of request rate. #30087

br: failed to backup to s3 due to the limit of request rate. #30087

3pointer commented Nov 24, 2021

kolbe commented Nov 24, 2021

3pointer commented Nov 25, 2021 •

edited

Loading

3pointer commented Dec 22, 2021

morgo commented Jun 20, 2022

3pointer commented Jun 21, 2022 •

edited

Loading

3pointer commented Jun 21, 2022

morgo commented Jun 21, 2022

morgo commented Jun 23, 2022

3pointer commented Jun 24, 2022

kolbe commented Jul 20, 2022

3pointer commented Jul 20, 2022

br: failed to backup to s3 due to the limit of request rate. #30087

br: failed to backup to s3 due to the limit of request rate. #30087

Comments

3pointer commented Nov 24, 2021

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

kolbe commented Nov 24, 2021

3pointer commented Nov 25, 2021 • edited Loading

3pointer commented Dec 22, 2021

morgo commented Jun 20, 2022

3pointer commented Jun 21, 2022 • edited Loading

3pointer commented Jun 21, 2022

morgo commented Jun 21, 2022

morgo commented Jun 23, 2022

3pointer commented Jun 24, 2022

kolbe commented Jul 20, 2022

3pointer commented Jul 20, 2022

3pointer commented Nov 25, 2021 •

edited

Loading

3pointer commented Jun 21, 2022 •

edited

Loading