-
Notifications
You must be signed in to change notification settings - Fork 105
Metrictank crashes fetching base index #668
Comments
Experiencing the same issue while trying to test the limits of metrictank. Not sure if it's due to corruption, but in my case it's happening while testing out metrictank limits and each tank instance (6 total in our cluster) has ~3.5 million metrics in the index. Only the master dies in my case. |
Ended up adding a nil check right before the offending line and rolling it out to my cluster. That seems to catch the symptom, but not the problem. |
In my case, the problem came from a metric with a leading Issue: https://github.com/raintank/metrictank/blob/master/idx/memory/memory.go#L401 Idk if there are other scenarios where this can happen, but changing that block to do the lookup and continue if doesn't find the node (maybe with a warning) seems like a reasonable workaround. |
@shanson7 and @tehlers320 can you guys try out #694
|
It seems like #694 won't prevent crashing from already indexed data. Let me spin it up and test locally. |
that's correct. is wiping index and starting over an option for you? otherwise we need to think of ways to upgrade/clean up the live index. |
@Dieterbe This is how i resolved this issue for myself. I just wiped the index in cassandra. |
I can wipe the index. It should all just come back anyhow (minus the invalid metrics) |
I tried to reproduce this with the docker-cluster and a simple bash script but i'm not seeing the issue re-appear.
With 0.7.3-63-g159320c
With the the proposed patch on my own docker build: 0.7.3-65-g73cd8c5
Im sorry i must be missing something. |
I figured this out while fiddling with the whisper-writer. I was checking to see if it was writing to the index in a unique... way... i changed the partitions to get a new table entry thinking that i was clever. By having 2 entries in the index with different partitions but on the same key MT crashes as mentioned in this ticket. Here is what my table looks like after adding everything to partition 0 on an import:
However a crash did not occur, here is my error output. Did you implement something to recover from a crash or is this a separate bug?
|
Ok, I tried this patch out and it works as described. echo ".bad.test.metric.whatevs 4 Ends up as 👍 (Sorry for the delay) |
@tehlers320 don't modify the index like that. an index entry should only live in 1 partition at a time. |
thanks for testing guys. fix is now merged into master. |
Version:0.7.2-12-gf9f4389
Query performed:
curl metrictank.test.monitoring.internal.com/metrics/find?query=*
Note... if this is not what builds the base tree the python UI also is failing. Grafana can grab index entries excluding the base entry, you can however fill it out manually and the 2nd/3rd/4th/5th levels work.
All 16 master and all 16 slaves crash in this example and it is repeatable but only once the restarted masters are ready.
The text was updated successfully, but these errors were encountered: