Introduces a quota subsystem #420

bom-d-van · 2021-07-22T11:27:02Z

The quota subsystem is made to improve control, reliability and visibility of go-carbon.
It is not a standard graphite component, but it is backward compatible and
could be turned on optionally.

Caveat: the current implementation only supports concurrent/realtime trie index.

The quota subsystem allows user to control how many resources can be consumed
on a patter-matching based basis.

Implemented controls include: data points (based retention policy), disk size
(logical and physical), throughput, metric count, and namespaces (i.e. immediate
sub-directory count).

More details could be found in doc/quotas.md in the PR.

An example configuration:

# This control all the namespaces under root
[*]
metrics       =       1,000,000
logical_size  = 250,000,000,000
physical_size =  50,000,000,000
# max means practically no limit
data_points   =             max
throughput    =             max

[sys.app.*]
metrics       =         3,000,000
logical_size  = 1,500,000,000,000
physical_size =   100,000,000,000
data_points   =   130,000,000,000

# This controls the root/global limits
[/]
namespaces    =                20
metrics       =        10,000,000
logical_size  = 2,500,000,000,000
physical_size = 2,500,000,000,000
data_points   =   200,000,000,000
dropping_policy = new

Throttling control is implemented in carbonserver, while quota config
is implemented in persister (mainly for convenience).

doc/quotas.md

deploy/storage-quota.conf

bom-d-van · 2021-09-07T06:52:23Z

Hi guys, these changes are finally ready for review (took a bit of force-pushes to get deepsource happy). Can you take a look when you have time and let me know if we can get it merged?

The changes are deployed on prod for a few weeks and not more new issues are identified for now. As always, it's intended to be backward-compatible and should have no breaking changes.

cc @deniszh @azhiltsov @Civil

azhiltsov · 2021-09-07T13:54:48Z

How many stat metrics should I expect with this in Quota config?

[*]

or with this

[sys.app.*]

As many as quantity of metrics inside, or this is not going to resolve the * to the actual names?
What will be the naming scheme for those metrics?

bom-d-van · 2021-09-08T07:51:31Z

How many stat metrics should I expect with this in Quota config?
[*]
or with this
[sys.app.*]
As many as quantity of metrics inside, or this is not going to resolve the * to the actual names?
What will be the naming scheme for those metrics?

Hi @azhiltsov , it's a like writing a graphite query. For example, supposed that there are 3 top level namespaces in the go-carbon instance: sys, user, net, the * is the same as writing separate 3 quota rules with the same config:

[*]
metrics = 1,000,000

the config above is the same as the config bellow for the example:

[user]
metrics = 1,000,000

[sys]
metrics = 1,000,000

[net]
metrics = 1,000,000

Same for sys.app.*, it matches the namespaces under sys.app.

Hope this explains it.

The quota subsystem is made to improve control, reliability and visibility of go-carbon. It is not a standard graphite component, but it is backward compatible and could be turned on optionally. Caveat: the current implementation only supports concurrent/realtime trie index. The quota subsystem allows user to control how many resources can be consumed on a patter-matching based basis. Implemented controls include: data points (based retention policy), disk size (logical and physical), throughput, metric count, and namespaces (i.e. immediate sub-directory count). More details could be found in doc/quotas.md in the PR. An example configuration: ```ini [*] metrics = 1,000,000 logical_size = 250,000,000,000 physical_size = 50,000,000,000 data_points = max throughput = max [sys.app.*] metrics = 3,000,000 logical_size = 1,500,000,000,000 physical_size = 100,000,000,000 data_points = 130,000,000,000 [/] namespaces = 20 metrics = 10,000,000 logical_size = 2,500,000,000,000 physical_size = 2,500,000,000,000 data_points = 200,000,000,000 dropping_policy = new ``` Throttling control is implemented in `carbonserver`, while quota config is implemented in persister (mainly for convenience).

> simplifyPathError retruns an error without path in the error message. This > simpplies the log a bit as the path is usually printed separately and the > new error message is easier to filter in elastic search and other tools.

WHY: with concurrent and realtime index, disk scan should be set at am interval like 2 hours or longer. couting the files in trie index gives us more timely visibilitty on how many metrics are known now.

The original implementation would keep resending the quota and usage metrics after the initial flush, which is not helpful. Using a channel to avoid duplicate flushes.

When it comes to point search, map lookup is very fast, much faster than traversing the trie tree. What's more, the current trie implementation adopts a space efficient strategy and children are not sorted. And this means much higher cpu usage. We are seeing up to 700% cpu usage growth in a cluster receving 1 millions+ data points. By adopting a map-based implementation, we are able to cut down up to 600% of cpu usage in that cluster, which means it is much more efficient to collect and control throughput usage.

…base

bom-d-van · 2021-10-05T06:53:28Z

Hi everyone, the subsystem has been on our production for sometime and it appears to be stable.

If no objection, I will proceed to merge the changes today. Feel free to let me know that if you have concerns.

deniszh · 2021-10-05T10:33:32Z

Yes, looks good, @bom-d-van , please proceed.
Will make new release soon after merging this too, we have bunch of changes thanks to you!

bom-d-van · 2021-10-05T16:01:05Z

Thank you Denis. Merging it.

bom-d-van force-pushed the quota branch 3 times, most recently from 851011f to 3775b56 Compare July 22, 2021 12:55

grzkv requested review from grzkv and grlvrl July 22, 2021 15:30

grzkv reviewed Jul 22, 2021

View reviewed changes

doc/quotas.md Show resolved Hide resolved

grzkv reviewed Jul 22, 2021

View reviewed changes

doc/quotas.md Outdated Show resolved Hide resolved

grzkv reviewed Jul 22, 2021

View reviewed changes

doc/quotas.md Show resolved Hide resolved

grzkv reviewed Jul 22, 2021

View reviewed changes

deploy/storage-quota.conf Outdated Show resolved Hide resolved

bom-d-van force-pushed the quota branch 2 times, most recently from a95134c to 30cd0cd Compare July 23, 2021 07:09

bom-d-van self-assigned this Jul 28, 2021

bom-d-van force-pushed the quota branch 4 times, most recently from 0607c62 to 4a173f5 Compare July 29, 2021 08:56

bom-d-van force-pushed the quota branch 2 times, most recently from 7ae9dc1 to 3a7a882 Compare August 10, 2021 10:06

bom-d-van force-pushed the quota branch 4 times, most recently from 57522a3 to 3d77327 Compare September 7, 2021 06:37

bom-d-van marked this pull request as ready for review September 7, 2021 06:47

bom-d-van requested review from deniszh, azhiltsov and Civil September 7, 2021 06:48

bom-d-van added 6 commits October 4, 2021 09:27

persister: simplifies PathError in logs

38829cc

> simplifyPathError retruns an error without path in the error message. This > simpplies the log a bit as the path is usually printed separately and the > new error message is easier to filter in elastic search and other tools.

carbonserver: generate knonwMetrics by counting trie index files

6ab5b4e

WHY: with concurrent and realtime index, disk scan should be set at am interval like 2 hours or longer. couting the files in trie index gives us more timely visibilitty on how many metrics are known now.

quota: reduce dir node allocation in throttle by pre-allocation

c97a982

quota: should only flush quotaAndUsageMetrics.quotaAndUsageMetrics once

cf5a0e9

The original implementation would keep resending the quota and usage metrics after the initial flush, which is not helpful. Using a channel to avoid duplicate flushes.

quota: resolve deepsource warnings/errors

5381b71

bom-d-van force-pushed the quota branch from 3d77327 to 5381b71 Compare October 4, 2021 07:28

quota: fix test bug and deepsource lint error due to origin/master re…

b76e13d

…base

deniszh approved these changes Oct 5, 2021

View reviewed changes

bom-d-van merged commit 72d78a3 into master Oct 5, 2021

bom-d-van deleted the quota branch October 7, 2021 12:01

bom-d-van mentioned this pull request Aug 8, 2022

Could Not Expand Globs - Context Cancelled #479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduces a quota subsystem #420

Introduces a quota subsystem #420

bom-d-van commented Jul 22, 2021 •

edited

Loading

bom-d-van commented Sep 7, 2021

azhiltsov commented Sep 7, 2021

bom-d-van commented Sep 8, 2021

bom-d-van commented Oct 5, 2021 •

edited

Loading

deniszh commented Oct 5, 2021

bom-d-van commented Oct 5, 2021

Introduces a quota subsystem #420

Introduces a quota subsystem #420

Conversation

bom-d-van commented Jul 22, 2021 • edited Loading

bom-d-van commented Sep 7, 2021

azhiltsov commented Sep 7, 2021

bom-d-van commented Sep 8, 2021

bom-d-van commented Oct 5, 2021 • edited Loading

deniszh commented Oct 5, 2021

bom-d-van commented Oct 5, 2021

bom-d-van commented Jul 22, 2021 •

edited

Loading

bom-d-van commented Oct 5, 2021 •

edited

Loading