Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

br: failed to backup to s3 due to the limit of request rate. #30087

Closed
3pointer opened this issue Nov 24, 2021 · 11 comments
Closed

br: failed to backup to s3 due to the limit of request rate. #30087

3pointer opened this issue Nov 24, 2021 · 11 comments
Assignees
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.3 This bug affects 5.3.x versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. component/br This issue is related to BR of TiDB. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. severity/critical type/bug The issue is confirmed as a bug.

Comments

@3pointer
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Didn't reproduce for now. just the image of the minimal reproduce step.

  1. backup to specified bucket and prefix.
  2. other service read/put to the same bucket prefix. makes the put request on same prefix to 3500/s.

2. What did you expect to see? (Required)

br or tikv slow down the request of backup. still success after other service stopped.

3. What did you see instead (Required)

TiKV received 503 Slow Down error and br didn't handle this error properly make the backup interrupt.

4. What is your TiDB version? (Required)

all

@3pointer 3pointer added type/bug The issue is confirmed as a bug. component/br This issue is related to BR of TiDB. labels Nov 24, 2021
@kolbe
Copy link
Contributor

kolbe commented Nov 24, 2021

It seems like the rate limiting needs to happen in TiKV, since TiKV is the component actually interacting with S3. Or maybe you think the limiting should happen in BR, since BR is telling each of the TiKV nodes when and how many regions to back up?

@3pointer
Copy link
Contributor Author

3pointer commented Nov 25, 2021

Yes, BR build tasks(one table or index corresponds to one task) and control the task interacting with TiKV. each task corresponds to one or many regions in TiKV.(according to table or index size).
So we had two level of concurrency.

  1. task level (controlled by BR --concurrency)
  2. thread level (controlled by TiKV config num-threads)

But according to our experience. It's hard to believe the normal backup can reach the 3500 PUT per second.

@3pointer
Copy link
Contributor Author

For the long term. we need a flow controller. that control the request do not exceed cloud rate limit. and after that we may abandon such concurrency/num-threads/multi_part_size configurations. and implement auto concurrency easily. @IANTHEREAL

@jebter jebter added affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. labels Jan 11, 2022
@VelocityLight VelocityLight removed the affects-5.2 This bug affects 5.2.x versions. label Apr 12, 2022
@VelocityLight VelocityLight added the affects-6.1 This bug affects the 6.1.x(LTS) versions. label May 20, 2022
ti-chi-bot added a commit to tikv/tikv that referenced this issue Jun 14, 2022
)

ref #11666, ref pingcap/tidb#30087

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
Signed-off-by: 3pointer <luancheng@pingcap.com>

Co-authored-by: 3pointer <luancheng@pingcap.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@morgo
Copy link
Contributor

morgo commented Jun 20, 2022

This causes backup failures for us :( It is not in production (yet), but we don't have a workaround.

@ti-chi-bot ti-chi-bot added may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. labels Jun 20, 2022
@3pointer
Copy link
Contributor Author

3pointer commented Jun 21, 2022

This causes backup failures for us :( It is not in production (yet), but we don't have a workaround.

After tikv/tikv#11666 merge, we do have a workaround by setting s3_multi_part_size config to disable multipart requests to s3.

In our experience, 100+ tikv and set s3_multi_part_size to 30M can finished backup.
but in future for the scenario of 600+ tikv's cluster. this may not work. and in that time we may need to build a high level task scheduler mechanism.

So I leave this issue opened here.

@3pointer
Copy link
Contributor Author

This causes backup failures for us :( It is not in production (yet), but we don't have a workaround.

you can try the workaround first. if it works. I'll mark this issue major not critical.

@morgo
Copy link
Contributor

morgo commented Jun 21, 2022

Thanks, we will try this and get back to you soon!

@morgo
Copy link
Contributor

morgo commented Jun 23, 2022

Bad news: this option worked at the start, but after an hour or so a lot of retrying started. So we are having to set the backup num-threads=4.

To give you an idea of the impact:

  • With multi-part disabled we can backup 16TB in 1hr
  • With num-threads=4, it's only about 1TB/hr (or 700GB, somewhere around there)

Because TiDB doesn't have PITR, we have to rely on full backups. I consider it reasonable to require daily backups complete in 3-4 hours, so that limits our maximum TiDB cluster size to 4TB (which is a serious limitation: its not much larger than MySQL).

Part of the reason why we are hitting this, is because the backup is not following S3 best practices (additional link). There is a quota of 3500 PUT/COPY/POST/DELETE requests per prefix -- but no limits on the number of prefixes. So if the writes can be distributed across prefixes, we can scale to larger than 4TB clusters no problem.

@3pointer
Copy link
Contributor Author

Thanks for feedback.

I also realize the same issue after we discuss this issue internally. we can change the backup structure, the detailed solution is as below, and we are working on it.

Solution
- Adjust the backup organization structure and add a store_id related prefix under the backup path
  - According to https://aws.amazon.com/cn/premiumsupport/knowledge-center/s3-troubleshoot-503-cross-region/,Only requests with the same prefix will receive a 3500 ratelimit limit.
  - So adding the different prefixes for different stores in on backup is natural.


./br backup --pd "127.0.0.1:2379" -s "s3://backup/20220621" 
  - After br command finished, we will have the structure below.
➜  backup tree .
.
└── 20220621
    ├── backupmeta
    ├── store1
    │   └── backup-xxx.sst
    ├── store100
    │   └── backup-yyy.sst
    ├── store2
    │   └── backup-zzz.sst
    ├── store3
    ├── store4
    └── store5
    
- Then we only need to set the correct rate limit for each store can solve the issue.
- For restore. We need to handle the backup path carefully. Make the download reach the correct sst file.

MoCuishle28 added a commit to MoCuishle28/tikv that referenced this issue Jul 5, 2022
Signed-off-by: Gaoming <zhanggaoming028@gmail.com>
MoCuishle28 added a commit to MoCuishle28/tikv that referenced this issue Jul 5, 2022
Signed-off-by: Gaoming <zhanggaoming028@gmail.com>
MoCuishle28 added a commit to MoCuishle28/tikv that referenced this issue Jul 16, 2022
Signed-off-by: zhanggaoming <gaoming.zhang@pingcap.com>
MoCuishle28 added a commit to MoCuishle28/tikv that referenced this issue Jul 16, 2022
Signed-off-by: zhanggaoming <gaoming.zhang@pingcap.com>
MoCuishle28 added a commit to MoCuishle28/tikv that referenced this issue Jul 18, 2022
Signed-off-by: zhanggaoming <gaoming.zhang@pingcap.com>
@kolbe
Copy link
Contributor

kolbe commented Jul 20, 2022

@3pointer is there a PR associated with a final fix for this issue?

@3pointer
Copy link
Contributor Author

@3pointer is there a PR associated with a final fix for this issue?

PR is in tikv: tikv/tikv#12958

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.3 This bug affects 5.3.x versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. component/br This issue is related to BR of TiDB. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

8 participants