Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: interactive capacity planning tool #1988

Open
pracucci opened this issue Jun 1, 2022 · 18 comments
Open

Docs: interactive capacity planning tool #1988

pracucci opened this issue Jun 1, 2022 · 18 comments

Comments

@pracucci
Copy link
Collaborator

pracucci commented Jun 1, 2022

We're hearing feedback from OSS community (e.g. this Slack thread) that capacity planning doc apparently show more resources than probably required. I think a reason is that there are multiple factors to run a proper capacity planning, while the doc is an oversimplification.

We could provide an interactive capacity planning tool where given some input (e.g. active series, samples/sec, queries/sec, retention, ...) we compute a more accurate capacity plan.

An option could build a Google Spreasheet and embed it in doc.

@pracucci pracucci added type/docs Improvements or additions to documentation ease-of-use labels Jun 1, 2022
@replay
Copy link
Contributor

replay commented Jun 1, 2022

I think that a spreadsheet seems like the easiest solution, OTH if we'd create a CLI tool to do the capacity planning then all Mimir contributors could contribute to it and possibly improve it based on their own experience, a spreadsheet would likely have to be restricted in some way because otherwise there is no review process for changes to the spreadsheet.

@rojas-diego
Copy link
Contributor

Let me know if you're looking for contributions on this!

@pracucci
Copy link
Collaborator Author

pracucci commented Jun 1, 2022

a spreadsheet would likely have to be restricted in some way because otherwise there is no review process for changes to the spreadsheet

That's right. And it's also more difficult to test. Writing unit tests in golang is way easier.

if we'd create a CLI tool

In this case we wouldn't have to create a new tool. We already have mimirtool: we could just add a command there.

Let me know if you're looking for contributions on this!

We do! Let's just reach a consensus on how it should work (e.g. spreadsheet vs CLI tool). Let me ping rest of Mimir maintainers / squad, to get a quick feedback loop.

@Logiraptor
Copy link
Contributor

I would vote for a CLI tool - because it would allow reviews, change history, etc in a more familiar format for Mimir contributors. Having the logic implemented as go code also makes it easier to eventually extend into more sophisticated use-cases (far in the future) like an auto-scaling operator, generating helm values file automatically, etc. We can start to get feedback on the formulas used and build on that knowledge later.

For now something simple + straightforward like a new command in mimirtool with a simple text output seems like the best place to start to me

@osg-grafana
Copy link
Contributor

cc @osg-grafana

@pstibrany
Copy link
Member

pstibrany commented Jun 2, 2022

What would CLI tool look like? I have hard time imagining command-line interface that would beat the spreadsheet (or simple webpage with some javascript to do the calculation) in terms of ease of use.

@pracucci
Copy link
Collaborator Author

pracucci commented Jun 3, 2022

What would CLI tool look like?

Something like:

mimirtool capacity-planning --active-series=100000 --samples-per-second=15000 --queries-per-second=100

I have hard time imagining command-line interface that would beat the spreadsheet (or simple webpage with some javascript to do the calculation) in terms of ease of use.

From an ease of use perspective, I agree a web UI would be easier to use. On the other side, collaborating on a web UI may be more complicated (e.g. no code reviews and no external contributors on spreadsheet, not much JS experience not even enough tooling like unit tests, ...).

Given we publish mimirtool binary for multiple platforms, and assuming that you can run a CLI tool if you want to operate Mimir, I don't see mimirtool as a significant friction.

@pstibrany
Copy link
Member

On the other side, collaborating on a web UI may be more complicated (e.g. no code reviews and no external contributors on spreadsheet, not much JS experience not even enough tooling like unit tests, ...).

I don't think single HTML page with some javascript would be too difficult to review and collaborate on, but you're right that we don't have tooling for it prepared. (Maybe writing it in Go and compiling into webasm would work just fine? 😄 I have 0 experience with that though.)

Your example isn't too bad just yet, but it gets more complex with more parameters very quickly.

@replay
Copy link
Contributor

replay commented Jun 3, 2022

Your example isn't too bad just yet, but it gets more complex with more parameters very quickly.

If it gets too complex we could consider to provide the tool with a configuration file, where the configuration file defines all the relevant parameters. Then we could deliver the tool together with an example configuration file, so a user could just copy the example configuration file and adjust all the defined parameters there. I think this will be easier to use then looking up lots of cli args from --help and adding them to the cli command.

@pstibrany
Copy link
Member

Then we could deliver the tool together with an example configuration file, so a user could just copy the example configuration file and adjust all the defined parameters there.

/half-joke: We can distribute jsonnet file with example values and all the math, and let people edit and render that :)

@replay
Copy link
Contributor

replay commented Jun 3, 2022

We can distribute jsonnet file with example values and all the math, and let people edit and render that :)

nice idea, but i kind of suspect that most users will stick to helm and don't know how to use jsonnet.

@pracucci
Copy link
Collaborator Author

pracucci commented Jun 6, 2022

One of the requirements is that we need to use a language for which it's not complicated to write unit tests. I think jsonnet doesn't fit it.

@pstibrany
Copy link
Member

pstibrany commented Jun 7, 2022

One of the requirements is that we need to use a language for which it's not complicated to write unit tests.

I don't see big benefit of unit-testability in this specific case given that the feature is basically set of formulas that show some numbers to the user.

As a user of this feature, I want to:

  • enter my input parameters
  • see how math works and understand why it's used
  • be able to adjust formulas to suit my needs

I see these needs covered better by tools like Google Sheets or Jsonnet rather than tool with hardcoded-formulas in it.

If we wanted to go jsonnet route, we could embed jsonnet interpreter library into mimirtool capacity plan and not require it as separate dependency. We could even parse the jsonnet output and pretty-print it nicely. I suggested jsonnet as a joke, but I don't think it's such a terrible idea.

And we have plenty of tests for our jsonnet config in the Mimir repo already.

@pracucci
Copy link
Collaborator Author

pracucci commented Jun 16, 2022

My idea is to build two tools:

  1. Run a command with runs a bunch of queries against running Mimir cluster(s) metrics and generates a file containing "constants" (e.g. 1 core every 1M series per ingester, etc...). This tool could be run to extract intelligence from all Mimir clusters running at Grafana Cloud and share it with the rest of the world, comitted to the Mimir repo.
  2. Add capacity planning command to mimirtool, taking in input your estimated usage (e.g. active series, samples per second, queries per second, retention, ...). It reads the constants file and compute the capacity planning based on that.

@wilfriedroset
Copy link
Collaborator

What would be the output of mimirtool? I reckon that core/memory/disk per mimir module should be enough.
With that users should have enough information to decide how many pods/instances to deploy per module. It also factor in the fact that mimir can be deployed on baremetal as well.

I've be working on a similar tool which address this question the other way around.
The input is the flavor/count of instance per module with 3 additional factors:

  • replication factor
  • performance factor, not all cpu are equal. When the documentation says 1 core every 25,000 samples per second., this kind of depends of the underlying cpu
  • maximum capacity, the deployment of mimir depends on how full you want your cluster to be. should it be at 50% capacity? 60%

Here is an example of the output

{
    "performance": {
        "write path": {
            "distributor samples/sec": 120000,
            "ingester active series": 1920000
        },
        "read path": {
            "query-frontend queries/sec": 1200,
            "query-scheduler queries/sec": 2400,
            "querier queries/sec": 48,
            "store-gateway queries/sec": 192,
            "active series": 36923077
        },
        "compaction": {
            "compactable active series": 60000000
        }
    },
    "specs": {
        "write path": {
            "distributor": {
                "count": 3,
                "flavor": "b2-15"
            },
            "ingester": {
                "count": 3,
                "flavor": "b2-60"
            },
            "compactor": {
                "count": 3,
                "flavor": "b2-60"
            }
        },
        "read path": {
            "query-frontend": {
                "count": 3,
                "flavor": "b2-15"
            },
            "query-scheduler": {
                "count": 3,
                "flavor": "b2-15"
            },
            "querier": {
                "count": 3,
                "flavor": "b2-15"
            },
            "store-gateway": {
                "count": 3,
                "flavor": "b2-60"
            }
        },
    }
}

(the flavor are based on OVHcloud public cloud)

@pracucci
Copy link
Collaborator Author

What would be the output of mimirtool? I reckon that core/memory/disk per mimir module should be enough.

I would also add number of replicas per Mimir component. Output format should be configurable, ideally supporting:

  • Plain english
  • Jsonnet config to copy-paste
  • Helm config to copy-paste

maximum capacity, the deployment of mimir depends on how full you want your cluster to be. should it be at 50% capacity? 60%

Right. At Grafana Labs we call it "target capacity" and that should be another input factor too.

@osg-grafana
Copy link
Contributor

osg-grafana commented Jun 29, 2022

Scoping estimation high because this doc ticket is large unactionable at its current stage in development.

@osg-grafana
Copy link
Contributor

Removing from Docs Squad backlog because @cristiangsp and @osg-grafana agree that it is in Engineering’s hands.

@osg-grafana osg-grafana removed the type/docs Improvements or additions to documentation label Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants