Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI tests are hitting OOM #312

Closed
metral opened this issue Jan 27, 2020 · 2 comments
Closed

CI tests are hitting OOM #312

metral opened this issue Jan 27, 2020 · 2 comments
Assignees
Milestone

Comments

@metral
Copy link
Contributor

metral commented Jan 27, 2020

Problem description

Seeing intermittent fatal error: runtime: out of memory errors in Travis CI due to what seems to be leaky tests.

These leaks seem tied specifically to the pulumi/eks test surface, and not quantity of tests as eks is < 15 tests currently, and pulumi/examples has ~90 tests.

One possible theory is the over use of dynamic providers in this repo than any other repo. Another theory is that when failures occur in tests, this creates a compounding effect on more failures to occur, leading to further resource starvation.

Tests are run in parallel with a current max of 20 jobs set.

  • We've started testing in a slimmed VM in EC2 to mimic Travis CI runtime with less resources than travis (using t2.medium)
  • AWS Region: us-west-2
  • Swap is not enabled by default in the VM

Errors & Logs

  • Output of /var/log/kern.log:
    kern.log

  • Output of ps aux | grep node && ps aux | grep pulumi after repro:

ps-node-pulumi

  • Output of top after repro:
    image

Reproducing the issue

  • Run all tests using make test_all in the nodejs/eks directory
  • After a few failures occur, Ctrl-C to interrupt tests
  • Leaked processes and slow response times from SSH, general use of VM etc should start to be noticeably slow.

Related Issues

@metral
Copy link
Contributor Author

metral commented Jan 29, 2020

Recent update to mirror slack thread:

We repro’d the starvation issue on a test VM, with no failures occurring - seems that just the concurrent runs of all tests is enough to do the machine in, and noticeably node processes shot up to consuming most of cpu, til just now where kswapd0, snapd, and a couple of pulumi-language processes are coming in for a total of over 150% cpu usage (see pics below for data).

OTOH, in a separate travis run I’ve set TESTPARALLELISM=3 vs current default of 20 tests, and that seems to be humming along for now with no failures, but will inevitably hit the max travis 2 hour test run limit at this pace.

ps aux | grep node:

node

ps aux | grep pulumi:

pulumi

top:

image

@metral
Copy link
Contributor Author

metral commented Feb 3, 2020

Closed with pulumi/pulumi-kubernetes#974

@metral metral closed this as completed Feb 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants