Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

update configs and docs #449

Merged
merged 10 commits into from
Feb 7, 2017
Merged

update configs and docs #449

merged 10 commits into from
Feb 7, 2017

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Jan 5, 2017

No description provided.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 5, 2017

@woodsaj @replay please have a good look at the commit "add script to make maintaining configs easier". this script embodies the approach I've been using and how I believe we should do it.

Some things left to do:

  • numchunks has always been described as number of raw chunks to keep in ring buffer. should be at least 1 more than what's needed to satisfy aggregation rules. I don't remember what the 2nd sentence was supposed to mean. can we remove it? can we just use numchunks 1 everywhere, even with long aggregations?

  • docs/data-knobs.md currently focuses on numchunks, chunkspan settings, and their tradeoffs. I propose we rework this document into a "memory server" document that talks about the two main approaches used: the ringbuffer and the chunk-cache. we should briefly explain both and the pro's and cons of each (which is probably all pro's for chunk-cache and mostly cons for the ringbuffer). we can however still talk about the tradeoffs in tuning numchunks and chunkspan, as another section, and contrasting it to the chunk-cache. @replay what do you think, can you give this a pass?

  • we have an inconsistent chunkspan setting: most configs use 10min, even though the default is 2h (and that's also what's in the sample config). I want to make it consistent. short (e.g. 10min) is good for experimenting and benchmarking, you'll know soon if cassandra is a problem. if you have a single instance and restart it, data loss is limited (I think it will be common enough for newcomers to run a single instance, restart it, and be surprised when they lose data. with 2h chunkspans data loss would be too crazy)
    OTOH longer chunks are a best practice for better compression and resource utilisation. and legit deployments should typically use >10min.
    we can't optimize for both I guess. but i'm currently using to defaulting to 10min. your thoughts @woodsaj @replay ?

@Dieterbe Dieterbe added this to the hosted-metrics-alpha milestone Jan 5, 2017
@replay
Copy link
Contributor

replay commented Jan 5, 2017

  • I could write a short text in data-knobs.md to mention that the query patterns are very important to determine cache efficiency, what do you think?
  • In Basic Guideline that recommendation with chunkspan = 20min & numchunks = 1 doesn't match what the text says anymore:

The standard recommendation is 120 points per chunk and keep at least as much in RAM as what your commonly query for (+1 extra chunk, see below)
E.g. if your most common interval is 10s and most of your dashboards query for 2h worth of data, then the recommendation is:

  • Looks good to me. The scripts/sync-configs.sh script looks useful, probably it's only going to be used by the 3 of us, but good to have it.
  • I agree that many new users might be annoyed if they lose 2h of data when they restart metrictank. Especially because a large number of the new users would probably use it as a drop-in replacement for graphite/whisper, so they would send the data via the carbon input and then there's no log to replay.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 6, 2017

In Basic Guideline that recommendation with chunkspan = 20min & numchunks = 1 doesn't match what the text says anymore

because as of now we should be recommending to use the chunk-cache instead of large ringbuffers. a lot of that page needs to be reworked to take into account the chunk-cache.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 6, 2017

I just realized we can't just use numchunks 1 everywhere because that leaves no margin to save a chunk: on a boundary and shortly after, nodes will keep hitting cassandra looking for chunks that may not be there yet. this reminds me another reason why we had numchunks 5 in the past: should a primary crash or be temporarily unable to do its job, then secondaries can keep serving data up to 5*chunkspan until they start hitting cassandra repeatedly.
So while the ringbuffers are now less effective as a general purpose in-memory cache (the chunk-cache should be better at that), they still serve a purpose at remaining HA when primaries have issues. For this reason i'm going to set it to 5 again everywhere.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 6, 2017

@woodsaj @replay I just pushed the changes which correspond to the above reasoning.
please review and let me know if any objections or comments. thanks.

@replay
Copy link
Contributor

replay commented Jan 6, 2017

Those are two interesting reasons. I'm not sure I agree that more numchunks is the best solution for them, but I guess for now increasing numchunks kind of removes some pressure off of these two problems that you described, assuming that most queries are querying a range where the oldest ts is not older than the oldest chunk in the ring buffer.

As a better solution I'd suggest this:

  1. If saving a chunk takes more than one second (the smallest granularity of time stamps) then, as you described, metrictanks that get queried for a range which includes the not-yet-saved chunk would keep hitting cassandra.
    I can see no reason why we couldn't set numchunks to 2, but persist each chunk at the time it's complete instead of when it gets evicted from the ring buffer. That way we would always have one chunk in the ring buffer -and- in cassandra, and at the time it gets evicted from the ring buffer it has already been in cassandra for (chunkspan - time it takes to save) seconds.
  2. I don't think the purpose of the ring buffer should be to "cover up" HA problems, by doing that we reduce our own flexibility while removing pressure to solve a real problem by covering it up. The solution described at point #1 would help with this too.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 9, 2017

If saving a chunk takes more than one second

We've seen environments where it takes >=20minutes. Hence I added to the docs ".. Based on your deployment this could take anywhere between milliseconds or many minutes..."

persist each chunk at the time it's complete instead of when it gets evicted from the ring buffer. That way we would always have one chunk in the ring buffer -and- in cassandra, and at the time it gets evicted from the ring buffer it has already been in cassandra for (chunkspan - time it takes to save) seconds

this describes how it is now. This is why we need >1 numchunks to combat the first problem.

I don't think the purpose of the ring buffer should be to "cover up" HA problems,

I don't think we're covering anything up. In my view the ringbuffer is simply the mechanism by which we implement (this particular aspect of) HA. it's tunable through numchunks should that people can make a tradeoff that makes sense for them.

@woodsaj
Copy link
Member

woodsaj commented Jan 10, 2017

As soon as chunks are complete we add them to the write queue. However, as all chunks complete at around the same time, the write buffer can take a while to be processed. This is by design so that we dont overwhelm cassandra.
eg.
1million series with agg-settings=10min:6h:2:3mon,1h:6h:2:1y, then every 6hours you will have a burst of 9million writes. At a write throughput of 10k/s it would take 15minutes to write the chunks to cassandra.

If you need to manually fail over a primary then numchunks should be more than the amount of time it takes you to respond to the failure of the primary. That could be anywhere from 5minutes to 8hours depending the the users own response SLA for faults.

In our k8s deployments where we have dedicated read/write nodes we just use numchunks=2

perhaps we should just recommend a numchunks >= 2?

done

echo "updating docs/config.md"
./scripts/config-to-doc.sh > docs/config.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that you are running the script from $GOPATH/src/github.com/raintank/metrictank, which wont always be true. What if a user is in scripts/ and runs ./sync-configs.sh?

We handle this in all other scripts with

# Find the directory we exist within
DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
cd ${DIR}

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 10, 2017

perhaps we should just recommend a numchunks >= 2?

I think we should recommend something that will give people some time to respond to incidents.
currently this PR introduces a default/recommendation of numchunks 5, which for 10min chunks gives you a timewindow of 40 to 50 minutes. I think this is more reasonable than 10 to 20minutes.
perhaps we should even pick 7 instead, then we can say they have an hour time.

@woodsaj
Copy link
Member

woodsaj commented Jan 10, 2017

I dont think we need to get too hung up on the carbon use case as having numchunks >2 is only important if users are replicating metrics to 2 or more MT instances with 1 marked as primary and the others not. I doubt this will be a common deployment model and we should not be encouraging it. If HA is important, users should use kafka.

@Dieterbe
Copy link
Contributor Author

It's not just carbon though? We've been in the situation ourselves a couple times with our worldping infra (which uses kafka): we run a cluster, primary dies, so chunks are not going to cassandra, so the time you have to manually promote a cluster is based on your numchunks. because your nodes can provide gapless responses to render requests as long as they have enough data in the ringbuffer to merge it with what's in cassandra.

Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the spelling/English pickiness... I figured if I already read through it then I might as well point those out.

# 5 min of data, store in a chunk that lasts 1hour, keep 2 chunks in in-memory ring buffer, keep for 3months in cassandra
# 1hr worth of data, in chunks of 6 hours, 2 chunks in in-memory ring buffer, keep for 1 year, but this series is not ready yet for querying.
# When running a cluster of metrictank instances, all instances should have the same agg-settings.
# chunk spans must be valid values as described here https://github.com/raintank/metrictank/blob/master/docs/memory-server.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bookmarks #valid-chunk-spans could be appended to the link, then we just need to remember to update all the links to it if we ever rename it. On the other hand, we'll have to do that anyway because there already are references to it.

retry-interval = 10m
# max number of concurrent connections to ES
max-conns = 20
# max numver of docs to keep in the BulkIndexer buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v in number

max-conns = 20
# max numver of docs to keep in the BulkIndexer buffer
max-buffer-docs = 1000
# max delay befoer the BulkIndexer flushes its buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

befoer

## clustering transports ##
## basic clustering settings ##
[cluster]
# The primary node writes data to cassandra. There should only be 1 primary node per shardGroup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments end with a . and some don't. I'm fine either way, but maybe consistency would make a better impression.


Note:
* the last (current) chunk is always a "work in progress", so depending on what time it is, it may be anywhere between empty and full.
* when metrictank starts up, it will not refill the ring buffer with data from Cassandra. They only fill based on data that comes in. But once data has been seen, the buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrictank is a name so I think it should be upper case.
there are two spaces before the But.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use metrictank uncapitalized in a bunch of places. but we also use Metrictank in a bunch of places. company-wise we used to treat no-caps as part of our branding (see raintank logo). we haven't really discussed this for metrictank yet.
Now that we're "GrafanaLabs" maybe we should start capitalizing everything ... ?
thoughts @bulletfactory ?


#### Warmup and becoming ready for promotion to primary

longer chunk sizes means a longer backfill of more older data (e.g. with kafka oldest offset),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the l in longer should be uppercase because it's the beginning of a sentence

In principle, you need just 1 chunk for each series.
However:
* when the data stream moves into a new chunk, secondary nodes would drop the previous chunk and query Cassandra. But the primary needs some time to save the chunk to Cassandra. Based on your deployment this could take anywhere between milliseconds or many minutes. As you don't want to slam Cassandra with requests at each chunk clear, you should probably use a numchunks of 2, or a numchunks that lets you retain data in memory for however long it takes to flush data to cassandra.
* The ringbuffers are a great tool to let you deal with crashes or outages of your primary node. If your primary went down, or for whatever reason cannot save data to Cassandra, then you won't even feel it if the ringbuffers can "clear the gap" between in memory data and older data in cassandra. So we advise to think about how fast your organisation could resolve a potential primary outage, and then set your parameters such that `(numchunks-1) * chunkspan` is more then that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be more than instead of more then.

### Configuration examples

E.g. if your most common data interval is 10s, then your chunks should be at least `120*10s=20min` long.
If you think your organisation will need up to 2 hours to resolve a primary failure, then you need at always at least 6 such chunks in memory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an at too many need at always at


echo "first make sure metrictank-sample.ini is up to date. its values should match the defaults used by metrictank. and comments should match the descriptions provided by metrictank help menus"
echo "now we will run vimdiff to manually synchronize updates from sample config to other configs:"
echo "try to make every config as closely resembling the sample config as possible, while retaining the customisations that makes each config unique"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no native english speaker, but wouldn't that feel a little more natural:

try to make every config resemble the sample config as closely as possible

and is customisations British spelling? my spell check says it should be customizations

@Dieterbe
Copy link
Contributor Author

@woodsaj any thoughts re #449 (comment) ? i want to make sure we're on the same page re numchunks (in particular, recommending numchunks of 7)

@woodsaj
Copy link
Member

woodsaj commented Jan 23, 2017

It's not just carbon though?

Once a PR #485 is merged it will just be carbon that is affected by numChunks. As the recommended topology when using Kafka will be to use dedicated write nodes. With this topology the cluster will self heal after failure without the operator needing to do anything. So numchunks only needs to give enough time for the write node to replay the kafka log. On modest hardware MT can do a few hundred thousand metrics/s, so replaying the backlog wont take long.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 30, 2017

but you may have a cassandra outage. or a networking problem between MT and cassandra. There's a wide variety of issues that can happen (not just MT itself failing), and that's where numchunks comes in, irrespective of which input plugin you use, you need to have a timeframe to address these sorts of incidents, and it's nice that you can stick a time on how long you have (and make it configurable)

can we agree that there's a valid use case here, and that it makes sense to recommend a sensible numchunks that let's you cover at least an hour worth of whatever issue may appear (e.g. numchunks 7 for chunkspan 10min). I hope we can agree, so that this PR can be merged (i will address the minor points you guys brought up, but first want us to agree on the larger picture described in the doc changes)

@woodsaj
Copy link
Member

woodsaj commented Jan 30, 2017

just set numchunks to 7.

but for the record:
NumChunks has no bearing on fault tolerance when you are using kafka.
if MT crashes, you just replay from kafka
if cassandra dies, the chunks will sit in the write queue until it comes back. Chunks can be bumped out of the ring buffer and still remain in the write queue. If MT dies before the writequeue is flushed to Cassandra, then the data will be replayed from kafka.

Dieterbe and others added 8 commits February 1, 2017 14:46
* numchunks = 1 everywhere, refer to chunk-cache as better method
* make sure all configs have the correct chunk-cache, stats and other recent updates.
* standardize on default raw chunkspan 10min and numchunks 5
* improve descriptions
reorganize things better:
* a memory-server doc that describes ringbuffer and chunk cache, and
then goes into specifics of configuring chunkspan and numchunks.
Move the huge list of considerations closer to the setting they apply
to.
* move compression tips elsewhere
this leaves 60min of data for all series.
+ make the description of the ringbuffer and chunk cache more nuanced.
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Feb 1, 2017

I think it's very important that we agree on what the docs say. we should all stand behind the recommendations that we make. I think the misunderstanding between me and aj is sufficiently cleared up and I gave the docs another pass, see f0794cf, I think this represents the tradeoff around numchunks and how it complements chunk-cache much better. I also changed the default to numchunks 7. if it turns out to be too wasteful for people they can lower it.
so @replay and @woodsaj if you guys don't mind, could you check out that commit or just give https://github.com/raintank/metrictank/blob/configs/docs/memory-server.md a read through, thanks :)

@Dieterbe Dieterbe changed the title WIP: update configs and docs update configs and docs Feb 7, 2017
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Feb 7, 2017

@woodsaj per above comment, can i get a signoff please? thanks :)

@Dieterbe Dieterbe merged commit e726ea5 into master Feb 7, 2017
@Dieterbe Dieterbe deleted the configs branch September 18, 2018 09:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants