Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError during 'run_all_rules' execution #1249

Open
Dmitry1987 opened this issue Jul 27, 2017 · 14 comments
Open

MemoryError during 'run_all_rules' execution #1249

Dmitry1987 opened this issue Jul 27, 2017 · 14 comments
Labels

Comments

@Dmitry1987
Copy link

Hi,
I'm having weird crashes lately all the time:

INFO:elastalert:Queried rule Production Warning Spike from 2017-07-25 21:52 UTC to 2017-07-25 22:31 UTC: 10000 / 10000 hits (scrolling..)
Traceback (most recent call last):
  File "elastalert.py", line 1732, in <module>
    sys.exit(main(sys.argv[1:]))
  File "elastalert.py", line 1728, in main
    client.start()
  File "elastalert.py", line 978, in start
    self.run_all_rules()
  File "elastalert.py", line 1015, in run_all_rules
    self.handle_uncaught_exception(e, rule)
  File "elastalert.py", line 1614, in handle_uncaught_exception
    logging.error(traceback.format_exc())
  File "/usr/lib/python2.7/traceback.py", line 242, in format_exc
    return ''.join(format_exception(etype, value, tb, limit))
MemoryError

any idea what it might be?

it runs in docker. and server is barely at 10%-12% memory utilization, i see in newrelic it never was full of memory.

The crash lines point to here (in run_all_rules(self)):
image

@Dmitry1987
Copy link
Author

Any chance it's Elasticsearch throws the "MemoryError" message? but then elastalert should proceed execution and not crash. this one crashes the docker completely that it can't be restarted.

@Qmando
Copy link
Member

Qmando commented Jul 27, 2017

That MemoryError is coming from Python, not Elasticsearch.

There are definitely some issues regarding unbounded memory usage. If you have too many documents in Elasticsearch, it's easy to shoot yourself in the foot as elastalert will try to load too much into memory.

You may be running into process level memory limits which is why the server is still at low usage. Try running ulimit (depending on your system, obviously this may be a different command)

$ ulimit -m
6291456

Which means that I have a limit of 6 gb of memory. You can set this to a higher number or unlimited for the elastalert process.

In general I'd like to add some better sanity checking for when a query keeps scrolling and scrolling forever. max_query_size applies to single query only :(

@Qmando Qmando added the bug label Jul 27, 2017
@Dmitry1987
Copy link
Author

Hi @Qmando , thanks for another quick reply. Dude you gotta start charging for commercial support, seriously, you support Elastalert here on github better than several paid services I have that charge $ for support subscription ;) . 👍

Can I try to add a simple counter that will limit scrolling to a sane number like 15-20 as default? (with option to modify it), can you point me to, so I submit a PR? is it goes all through single function that controls this retry behavior or each rule has it separately?

@Qmando
Copy link
Member

Qmando commented Jul 27, 2017

There's a quick fix for this and a more in depth fix.

The quick way would be to limit self.total_hits (https://github.com/Yelp/elastalert/blob/master/elastalert/elastalert.py#L362)

Something like

if self.total_hits > self.maximum_allowed_hits:
  self.total_hits = self.maximum_allowed_hits
  elastalert_logger.warning("There are more documents in this query than ElastAlert can handle, truncating to %s" % self.maximum_allowed_hits)

Alternatively you can limit the number of scrolls. Effectively the same thing.
Not sure whats a good default for that. A million? Several million?

The more in depth fix is to restructure how we do scrolling. The issue right now is that because we do recursion, we keep the data from each query in memory because the functions never return until the scroll ends. If we did it iteratively, we can forget each data frame after adding it to the ruletype.

@Dmitry1987
Copy link
Author

Interesting.. yeah, I thought to limit the number of scrolls.
What would be the effect of not receiving ALL of the results, if I stop the scrolls and get only like half of what is out there?
Actually, I wonder what does elastalert loading during those scrolls, isn't only the number (count) of the matched events important? (in my case, for example, those are percentage and spike rules... I think only the number of events is what elastalert needs for all rule types? with maybe a small 'sample' chunk of full messages, to populate alert notification fields from)

I'll try to use what you suggest. Do you plan to implement something like that for next version update?

@Qmando
Copy link
Member

Qmando commented Jul 27, 2017

Regarding whether you need the count or the documents in full, there's three different scenarios:

  1. You need ALL the documents
    This might occur if you are using query_key or aggregation_key. In these cases, elastalert needs to inspect each document in order to sort it into a bucket for counting. use_terms_query can be used, but only if there is just a single query_key

  2. You need at least one document
    Often the rule type only needs to know the count to determine whether to alert. But, if someone has alert_text_args set, you need at least one document to populate the fields in the match.

  3. You need only the count
    If you're using frequency type or spike type and not using alert_text_args or query_key, then only the count matters. In these cases you can use use_count_query.

It's definitely possible to make elastalert smart and choose whether or not use_terms_query or use_count_query can be used, but this would require a lot of logic involving what config options are set and would probably not work well with custom alerts and would break easily.

For case 2, we could be smart and make a single query after a match occurs, similar to what happens for top_count_keys.

As for what the consequences of dropping documents after some number are.. The most obvious is that you could miss alerts if you had very high thresholds, or a super high cardinality on a query_key field.

@Qmando
Copy link
Member

Qmando commented Jul 27, 2017

This doesn't seem to be a very commonly reported issue, so all of the "smart" features are probably not necessary at this point, they would be a lot of work.

I guess I'll try to do the refactor to make it more memory efficient at some point soon.

@Qmando Qmando closed this as completed Jul 27, 2017
@Qmando Qmando reopened this Jul 27, 2017
@Dmitry1987
Copy link
Author

Hi, thanks for the details. I understand in my case it has to fetch all results, because all our rules are with "query_key" and there are 20+ variants to the key (our different environments).
I am thinking to programmatically 'generate' many YAML files of same rule with hardcoded env name value instead of using "query_key" (because I know all possible variations, so I'll give it a list of all env names). This should completely solve the scroll problem.
Thanks a lot @Qmando !

@Qmando
Copy link
Member

Qmando commented Jul 28, 2017

If you have a single query_key (not a list of query_keys), you can use use_terms_query. This should solve the memory issue

@Dmitry1987
Copy link
Author

oh, ok I'll try. Thanks!

@caleb15
Copy link
Contributor

caleb15 commented Oct 25, 2019

This doesn't seem to be a very commonly reported issue

Some other people (myself included) are having problems with memory usage as well.

See #2399 and #2481

I haven't tried use_count_query: true though, maybe that will help us out.

@Zachary-Zhao
Copy link

Zachary-Zhao commented Jun 22, 2020

Regarding whether you need the count or the documents in full, there's three different scenarios:

  1. You need ALL the documents
    This might occur if you are using query_key or aggregation_key. In these cases, elastalert needs to inspect each document in order to sort it into a bucket for counting. use_terms_query can be used, but only if there is just a single query_key
  2. You need at least one document
    Often the rule type only needs to know the count to determine whether to alert. But, if someone has alert_text_args set, you need at least one document to populate the fields in the match.
  3. You need only the count
    If you're using frequency type or spike type and not using alert_text_args or query_key, then only the count matters. In these cases you can use use_count_query.

It's definitely possible to make elastalert smart and choose whether or not use_terms_query or use_count_query can be used, but this would require a lot of logic involving what config options are set and would probably not work well with custom alerts and would break easily.

For case 2, we could be smart and make a single query after a match occurs, similar to what happens for top_count_keys.

As for what the consequences of dropping documents after some number are.. The most obvious is that you could miss alerts if you had very high thresholds, or a super high cardinality on a query_key field.

@Qmando
maybe we can add a statement 'del data' in the definition of 'run_query'

@kdszh
Copy link

kdszh commented Aug 22, 2020

Regarding whether you need the count or the documents in full, there's three different scenarios:

  1. You need ALL the documents
    This might occur if you are using query_key or aggregation_key. In these cases, elastalert needs to inspect each document in order to sort it into a bucket for counting. use_terms_query can be used, but only if there is just a single query_key
  2. You need at least one document
    Often the rule type only needs to know the count to determine whether to alert. But, if someone has alert_text_args set, you need at least one document to populate the fields in the match.
  3. You need only the count
    If you're using frequency type or spike type and not using alert_text_args or query_key, then only the count matters. In these cases you can use use_count_query.

It's definitely possible to make elastalert smart and choose whether or not use_terms_query or use_count_query can be used, but this would require a lot of logic involving what config options are set and would probably not work well with custom alerts and would break easily.
For case 2, we could be smart and make a single query after a match occurs, similar to what happens for top_count_keys.
As for what the consequences of dropping documents after some number are.. The most obvious is that you could miss alerts if you had very high thresholds, or a super high cardinality on a query_key field.

@Qmando
maybe we can add a statement 'del data' in the definition of 'run_query'

any idea of 'del data' statement? @nsano-rururu @Qmando

@nsano-rururu
Copy link
Contributor

@kdszh

Why not try incorporating the following changes to improve them?

refactor run_query
#2345

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants