MemoryError during 'run_all_rules' execution #1249

Dmitry1987 · 2017-07-27T08:27:06Z

Hi,
I'm having weird crashes lately all the time:

INFO:elastalert:Queried rule Production Warning Spike from 2017-07-25 21:52 UTC to 2017-07-25 22:31 UTC: 10000 / 10000 hits (scrolling..)
Traceback (most recent call last):
  File "elastalert.py", line 1732, in <module>
    sys.exit(main(sys.argv[1:]))
  File "elastalert.py", line 1728, in main
    client.start()
  File "elastalert.py", line 978, in start
    self.run_all_rules()
  File "elastalert.py", line 1015, in run_all_rules
    self.handle_uncaught_exception(e, rule)
  File "elastalert.py", line 1614, in handle_uncaught_exception
    logging.error(traceback.format_exc())
  File "/usr/lib/python2.7/traceback.py", line 242, in format_exc
    return ''.join(format_exception(etype, value, tb, limit))
MemoryError

any idea what it might be?

it runs in docker. and server is barely at 10%-12% memory utilization, i see in newrelic it never was full of memory.

The crash lines point to here (in run_all_rules(self)):

The text was updated successfully, but these errors were encountered:

Dmitry1987 · 2017-07-27T08:28:27Z

Any chance it's Elasticsearch throws the "MemoryError" message? but then elastalert should proceed execution and not crash. this one crashes the docker completely that it can't be restarted.

Qmando · 2017-07-27T18:16:12Z

That MemoryError is coming from Python, not Elasticsearch.

There are definitely some issues regarding unbounded memory usage. If you have too many documents in Elasticsearch, it's easy to shoot yourself in the foot as elastalert will try to load too much into memory.

You may be running into process level memory limits which is why the server is still at low usage. Try running ulimit (depending on your system, obviously this may be a different command)

$ ulimit -m
6291456

Which means that I have a limit of 6 gb of memory. You can set this to a higher number or unlimited for the elastalert process.

In general I'd like to add some better sanity checking for when a query keeps scrolling and scrolling forever. max_query_size applies to single query only :(

Dmitry1987 · 2017-07-27T22:28:50Z

Hi @Qmando , thanks for another quick reply. Dude you gotta start charging for commercial support, seriously, you support Elastalert here on github better than several paid services I have that charge $ for support subscription ;) . 👍

Can I try to add a simple counter that will limit scrolling to a sane number like 15-20 as default? (with option to modify it), can you point me to, so I submit a PR? is it goes all through single function that controls this retry behavior or each rule has it separately?

Qmando · 2017-07-27T22:36:38Z

There's a quick fix for this and a more in depth fix.

The quick way would be to limit self.total_hits (https://github.com/Yelp/elastalert/blob/master/elastalert/elastalert.py#L362)

Something like

if self.total_hits > self.maximum_allowed_hits:
  self.total_hits = self.maximum_allowed_hits
  elastalert_logger.warning("There are more documents in this query than ElastAlert can handle, truncating to %s" % self.maximum_allowed_hits)

Alternatively you can limit the number of scrolls. Effectively the same thing.
Not sure whats a good default for that. A million? Several million?

The more in depth fix is to restructure how we do scrolling. The issue right now is that because we do recursion, we keep the data from each query in memory because the functions never return until the scroll ends. If we did it iteratively, we can forget each data frame after adding it to the ruletype.

Dmitry1987 · 2017-07-27T23:19:00Z

Interesting.. yeah, I thought to limit the number of scrolls.
What would be the effect of not receiving ALL of the results, if I stop the scrolls and get only like half of what is out there?
Actually, I wonder what does elastalert loading during those scrolls, isn't only the number (count) of the matched events important? (in my case, for example, those are percentage and spike rules... I think only the number of events is what elastalert needs for all rule types? with maybe a small 'sample' chunk of full messages, to populate alert notification fields from)

I'll try to use what you suggest. Do you plan to implement something like that for next version update?

Qmando · 2017-07-27T23:31:47Z

Regarding whether you need the count or the documents in full, there's three different scenarios:

You need ALL the documents
This might occur if you are using query_key or aggregation_key. In these cases, elastalert needs to inspect each document in order to sort it into a bucket for counting. use_terms_query can be used, but only if there is just a single query_key
You need at least one document
Often the rule type only needs to know the count to determine whether to alert. But, if someone has alert_text_args set, you need at least one document to populate the fields in the match.
You need only the count
If you're using frequency type or spike type and not using alert_text_args or query_key, then only the count matters. In these cases you can use use_count_query.

It's definitely possible to make elastalert smart and choose whether or not use_terms_query or use_count_query can be used, but this would require a lot of logic involving what config options are set and would probably not work well with custom alerts and would break easily.

For case 2, we could be smart and make a single query after a match occurs, similar to what happens for top_count_keys.

As for what the consequences of dropping documents after some number are.. The most obvious is that you could miss alerts if you had very high thresholds, or a super high cardinality on a query_key field.

Qmando · 2017-07-27T23:34:02Z

This doesn't seem to be a very commonly reported issue, so all of the "smart" features are probably not necessary at this point, they would be a lot of work.

I guess I'll try to do the refactor to make it more memory efficient at some point soon.

Dmitry1987 · 2017-07-28T18:31:46Z

Hi, thanks for the details. I understand in my case it has to fetch all results, because all our rules are with "query_key" and there are 20+ variants to the key (our different environments).
I am thinking to programmatically 'generate' many YAML files of same rule with hardcoded env name value instead of using "query_key" (because I know all possible variations, so I'll give it a list of all env names). This should completely solve the scroll problem.
Thanks a lot @Qmando !

Qmando · 2017-07-28T19:54:17Z

If you have a single query_key (not a list of query_keys), you can use use_terms_query. This should solve the memory issue

Dmitry1987 · 2017-07-29T08:14:23Z

oh, ok I'll try. Thanks!

caleb15 · 2019-10-25T19:10:18Z

This doesn't seem to be a very commonly reported issue

Some other people (myself included) are having problems with memory usage as well.

See #2399 and #2481

I haven't tried use_count_query: true though, maybe that will help us out.

Zachary-Zhao · 2020-06-22T11:55:23Z

Regarding whether you need the count or the documents in full, there's three different scenarios:

You need ALL the documents
This might occur if you are using query_key or aggregation_key. In these cases, elastalert needs to inspect each document in order to sort it into a bucket for counting. use_terms_query can be used, but only if there is just a single query_key

You need at least one document
Often the rule type only needs to know the count to determine whether to alert. But, if someone has alert_text_args set, you need at least one document to populate the fields in the match.

You need only the count
If you're using frequency type or spike type and not using alert_text_args or query_key, then only the count matters. In these cases you can use use_count_query.

It's definitely possible to make elastalert smart and choose whether or not use_terms_query or use_count_query can be used, but this would require a lot of logic involving what config options are set and would probably not work well with custom alerts and would break easily.

For case 2, we could be smart and make a single query after a match occurs, similar to what happens for top_count_keys.

As for what the consequences of dropping documents after some number are.. The most obvious is that you could miss alerts if you had very high thresholds, or a super high cardinality on a query_key field.

@Qmando
maybe we can add a statement 'del data' in the definition of 'run_query'

kdszh · 2020-08-22T16:21:30Z

Regarding whether you need the count or the documents in full, there's three different scenarios:

You need ALL the documents
This might occur if you are using query_key or aggregation_key. In these cases, elastalert needs to inspect each document in order to sort it into a bucket for counting. use_terms_query can be used, but only if there is just a single query_key

You need at least one document
Often the rule type only needs to know the count to determine whether to alert. But, if someone has alert_text_args set, you need at least one document to populate the fields in the match.

You need only the count
If you're using frequency type or spike type and not using alert_text_args or query_key, then only the count matters. In these cases you can use use_count_query.

It's definitely possible to make elastalert smart and choose whether or not use_terms_query or use_count_query can be used, but this would require a lot of logic involving what config options are set and would probably not work well with custom alerts and would break easily.
For case 2, we could be smart and make a single query after a match occurs, similar to what happens for top_count_keys.
As for what the consequences of dropping documents after some number are.. The most obvious is that you could miss alerts if you had very high thresholds, or a super high cardinality on a query_key field.

@Qmando
maybe we can add a statement 'del data' in the definition of 'run_query'

any idea of 'del data' statement? @nsano-rururu @Qmando

nsano-rururu · 2021-10-05T03:57:33Z

@kdszh

Why not try incorporating the following changes to improve them?

refactor run_query
#2345

Qmando added the bug label Jul 27, 2017

Qmando closed this as completed Jul 27, 2017

Qmando reopened this Jul 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError during 'run_all_rules' execution #1249

MemoryError during 'run_all_rules' execution #1249

Dmitry1987 commented Jul 27, 2017

Dmitry1987 commented Jul 27, 2017

Qmando commented Jul 27, 2017

Dmitry1987 commented Jul 27, 2017

Qmando commented Jul 27, 2017

Dmitry1987 commented Jul 27, 2017

Qmando commented Jul 27, 2017

Qmando commented Jul 27, 2017

Dmitry1987 commented Jul 28, 2017

Qmando commented Jul 28, 2017

Dmitry1987 commented Jul 29, 2017

caleb15 commented Oct 25, 2019

Zachary-Zhao commented Jun 22, 2020 •

edited

Loading

kdszh commented Aug 22, 2020

nsano-rururu commented Oct 5, 2021

MemoryError during 'run_all_rules' execution #1249

MemoryError during 'run_all_rules' execution #1249

Comments

Dmitry1987 commented Jul 27, 2017

Dmitry1987 commented Jul 27, 2017

Qmando commented Jul 27, 2017

Dmitry1987 commented Jul 27, 2017

Qmando commented Jul 27, 2017

Dmitry1987 commented Jul 27, 2017

Qmando commented Jul 27, 2017

Qmando commented Jul 27, 2017

Dmitry1987 commented Jul 28, 2017

Qmando commented Jul 28, 2017

Dmitry1987 commented Jul 29, 2017

caleb15 commented Oct 25, 2019

Zachary-Zhao commented Jun 22, 2020 • edited Loading

kdszh commented Aug 22, 2020

nsano-rururu commented Oct 5, 2021

Zachary-Zhao commented Jun 22, 2020 •

edited

Loading