Docs: interactive capacity planning tool #1988

pracucci · 2022-06-01T16:05:48Z

We're hearing feedback from OSS community (e.g. this Slack thread) that capacity planning doc apparently show more resources than probably required. I think a reason is that there are multiple factors to run a proper capacity planning, while the doc is an oversimplification.

We could provide an interactive capacity planning tool where given some input (e.g. active series, samples/sec, queries/sec, retention, ...) we compute a more accurate capacity plan.

An option could build a Google Spreasheet and embed it in doc.

replay · 2022-06-01T16:34:49Z

I think that a spreadsheet seems like the easiest solution, OTH if we'd create a CLI tool to do the capacity planning then all Mimir contributors could contribute to it and possibly improve it based on their own experience, a spreadsheet would likely have to be restricted in some way because otherwise there is no review process for changes to the spreadsheet.

rojas-diego · 2022-06-01T16:47:01Z

Let me know if you're looking for contributions on this!

pracucci · 2022-06-01T16:53:02Z

a spreadsheet would likely have to be restricted in some way because otherwise there is no review process for changes to the spreadsheet

That's right. And it's also more difficult to test. Writing unit tests in golang is way easier.

if we'd create a CLI tool

In this case we wouldn't have to create a new tool. We already have mimirtool: we could just add a command there.

Let me know if you're looking for contributions on this!

We do! Let's just reach a consensus on how it should work (e.g. spreadsheet vs CLI tool). Let me ping rest of Mimir maintainers / squad, to get a quick feedback loop.

Logiraptor · 2022-06-01T17:09:12Z

I would vote for a CLI tool - because it would allow reviews, change history, etc in a more familiar format for Mimir contributors. Having the logic implemented as go code also makes it easier to eventually extend into more sophisticated use-cases (far in the future) like an auto-scaling operator, generating helm values file automatically, etc. We can start to get feedback on the formulas used and build on that knowledge later.

For now something simple + straightforward like a new command in mimirtool with a simple text output seems like the best place to start to me

osg-grafana · 2022-06-01T17:49:37Z

cc @osg-grafana

pstibrany · 2022-06-02T07:10:34Z

What would CLI tool look like? I have hard time imagining command-line interface that would beat the spreadsheet (or simple webpage with some javascript to do the calculation) in terms of ease of use.

pracucci · 2022-06-03T08:02:58Z

What would CLI tool look like?

Something like:

mimirtool capacity-planning --active-series=100000 --samples-per-second=15000 --queries-per-second=100

I have hard time imagining command-line interface that would beat the spreadsheet (or simple webpage with some javascript to do the calculation) in terms of ease of use.

From an ease of use perspective, I agree a web UI would be easier to use. On the other side, collaborating on a web UI may be more complicated (e.g. no code reviews and no external contributors on spreadsheet, not much JS experience not even enough tooling like unit tests, ...).

Given we publish mimirtool binary for multiple platforms, and assuming that you can run a CLI tool if you want to operate Mimir, I don't see mimirtool as a significant friction.

pstibrany · 2022-06-03T14:24:11Z

On the other side, collaborating on a web UI may be more complicated (e.g. no code reviews and no external contributors on spreadsheet, not much JS experience not even enough tooling like unit tests, ...).

I don't think single HTML page with some javascript would be too difficult to review and collaborate on, but you're right that we don't have tooling for it prepared. (Maybe writing it in Go and compiling into webasm would work just fine? 😄 I have 0 experience with that though.)

Your example isn't too bad just yet, but it gets more complex with more parameters very quickly.

replay · 2022-06-03T15:48:26Z

Your example isn't too bad just yet, but it gets more complex with more parameters very quickly.

If it gets too complex we could consider to provide the tool with a configuration file, where the configuration file defines all the relevant parameters. Then we could deliver the tool together with an example configuration file, so a user could just copy the example configuration file and adjust all the defined parameters there. I think this will be easier to use then looking up lots of cli args from --help and adding them to the cli command.

pstibrany · 2022-06-03T15:53:10Z

Then we could deliver the tool together with an example configuration file, so a user could just copy the example configuration file and adjust all the defined parameters there.

/half-joke: We can distribute jsonnet file with example values and all the math, and let people edit and render that :)

replay · 2022-06-03T16:35:52Z

We can distribute jsonnet file with example values and all the math, and let people edit and render that :)

nice idea, but i kind of suspect that most users will stick to helm and don't know how to use jsonnet.

pracucci · 2022-06-06T06:45:46Z

One of the requirements is that we need to use a language for which it's not complicated to write unit tests. I think jsonnet doesn't fit it.

pstibrany · 2022-06-07T07:16:14Z

One of the requirements is that we need to use a language for which it's not complicated to write unit tests.

I don't see big benefit of unit-testability in this specific case given that the feature is basically set of formulas that show some numbers to the user.

As a user of this feature, I want to:

enter my input parameters
see how math works and understand why it's used
be able to adjust formulas to suit my needs

I see these needs covered better by tools like Google Sheets or Jsonnet rather than tool with hardcoded-formulas in it.

If we wanted to go jsonnet route, we could embed jsonnet interpreter library into mimirtool capacity plan and not require it as separate dependency. We could even parse the jsonnet output and pretty-print it nicely. I suggested jsonnet as a joke, but I don't think it's such a terrible idea.

And we have plenty of tests for our jsonnet config in the Mimir repo already.

pracucci · 2022-06-16T16:03:49Z

My idea is to build two tools:

Run a command with runs a bunch of queries against running Mimir cluster(s) metrics and generates a file containing "constants" (e.g. 1 core every 1M series per ingester, etc...). This tool could be run to extract intelligence from all Mimir clusters running at Grafana Cloud and share it with the rest of the world, comitted to the Mimir repo.
Add capacity planning command to mimirtool, taking in input your estimated usage (e.g. active series, samples per second, queries per second, retention, ...). It reads the constants file and compute the capacity planning based on that.

wilfriedroset · 2022-06-16T19:57:26Z

What would be the output of mimirtool? I reckon that core/memory/disk per mimir module should be enough.
With that users should have enough information to decide how many pods/instances to deploy per module. It also factor in the fact that mimir can be deployed on baremetal as well.

I've be working on a similar tool which address this question the other way around.
The input is the flavor/count of instance per module with 3 additional factors:

replication factor
performance factor, not all cpu are equal. When the documentation says 1 core every 25,000 samples per second., this kind of depends of the underlying cpu
maximum capacity, the deployment of mimir depends on how full you want your cluster to be. should it be at 50% capacity? 60%

Here is an example of the output

{
    "performance": {
        "write path": {
            "distributor samples/sec": 120000,
            "ingester active series": 1920000
        },
        "read path": {
            "query-frontend queries/sec": 1200,
            "query-scheduler queries/sec": 2400,
            "querier queries/sec": 48,
            "store-gateway queries/sec": 192,
            "active series": 36923077
        },
        "compaction": {
            "compactable active series": 60000000
        }
    },
    "specs": {
        "write path": {
            "distributor": {
                "count": 3,
                "flavor": "b2-15"
            },
            "ingester": {
                "count": 3,
                "flavor": "b2-60"
            },
            "compactor": {
                "count": 3,
                "flavor": "b2-60"
            }
        },
        "read path": {
            "query-frontend": {
                "count": 3,
                "flavor": "b2-15"
            },
            "query-scheduler": {
                "count": 3,
                "flavor": "b2-15"
            },
            "querier": {
                "count": 3,
                "flavor": "b2-15"
            },
            "store-gateway": {
                "count": 3,
                "flavor": "b2-60"
            }
        },
    }
}

(the flavor are based on OVHcloud public cloud)

pracucci · 2022-06-17T08:44:45Z

What would be the output of mimirtool? I reckon that core/memory/disk per mimir module should be enough.

I would also add number of replicas per Mimir component. Output format should be configurable, ideally supporting:

Plain english
Jsonnet config to copy-paste
Helm config to copy-paste

maximum capacity, the deployment of mimir depends on how full you want your cluster to be. should it be at 50% capacity? 60%

Right. At Grafana Labs we call it "target capacity" and that should be another input factor too.

osg-grafana · 2022-06-29T13:06:23Z

Scoping estimation high because this doc ticket is large unactionable at its current stage in development.

osg-grafana · 2023-10-18T12:54:58Z

Removing from Docs Squad backlog because @cristiangsp and @osg-grafana agree that it is in Engineering’s hands.

pracucci added type/docs Improvements or additions to documentation ease-of-use labels Jun 1, 2022

09jvilla mentioned this issue Jun 10, 2022

Docs: improve Planning capacity page #1469

Open

mdisibio mentioned this issue Jul 7, 2022

Tempo cluster sizing / capacity planning grafana/tempo#1540

Open

osg-grafana removed the type/docs Improvements or additions to documentation label Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: interactive capacity planning tool #1988

Docs: interactive capacity planning tool #1988

pracucci commented Jun 1, 2022

replay commented Jun 1, 2022

rojas-diego commented Jun 1, 2022

pracucci commented Jun 1, 2022

Logiraptor commented Jun 1, 2022

osg-grafana commented Jun 1, 2022

pstibrany commented Jun 2, 2022 •

edited

Loading

pracucci commented Jun 3, 2022

pstibrany commented Jun 3, 2022

replay commented Jun 3, 2022

pstibrany commented Jun 3, 2022

replay commented Jun 3, 2022

pracucci commented Jun 6, 2022

pstibrany commented Jun 7, 2022 •

edited

Loading

pracucci commented Jun 16, 2022 •

edited

Loading

wilfriedroset commented Jun 16, 2022

pracucci commented Jun 17, 2022

osg-grafana commented Jun 29, 2022 •

edited

Loading

osg-grafana commented Oct 18, 2023

Docs: interactive capacity planning tool #1988

Docs: interactive capacity planning tool #1988

Comments

pracucci commented Jun 1, 2022

replay commented Jun 1, 2022

rojas-diego commented Jun 1, 2022

pracucci commented Jun 1, 2022

Logiraptor commented Jun 1, 2022

osg-grafana commented Jun 1, 2022

pstibrany commented Jun 2, 2022 • edited Loading

pracucci commented Jun 3, 2022

pstibrany commented Jun 3, 2022

replay commented Jun 3, 2022

pstibrany commented Jun 3, 2022

replay commented Jun 3, 2022

pracucci commented Jun 6, 2022

pstibrany commented Jun 7, 2022 • edited Loading

pracucci commented Jun 16, 2022 • edited Loading

wilfriedroset commented Jun 16, 2022

pracucci commented Jun 17, 2022

osg-grafana commented Jun 29, 2022 • edited Loading

osg-grafana commented Oct 18, 2023

pstibrany commented Jun 2, 2022 •

edited

Loading

pstibrany commented Jun 7, 2022 •

edited

Loading

pracucci commented Jun 16, 2022 •

edited

Loading

osg-grafana commented Jun 29, 2022 •

edited

Loading