Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macOS+rust e2e tests in CI are pinned to an over two year old version of Docker #145

Closed
codefromthecrypt opened this issue Mar 27, 2021 · 10 comments · Fixed by #190
Closed

Comments

@codefromthecrypt
Copy link
Contributor

codefromthecrypt commented Mar 27, 2021

macOS+rust e2e tests in CI are pinned to an over two year old version of Docker, handled in ./ci/e2e/macos/install_docker.sh. The rest of the matrix uses latest Docker per #159.

The impact is that we are blind to any issues that present only on recent versions of Docker + OS/x, such as #147. To mitigate this, people should manually run make e2e on a macbook on occasion.

The underlying problem is not that tests break, it is that they take too long to complete. The rust performance problem is being tracked in #152 and this issue should be revisited on any significant progress. However, this isn't a sure thing. For example, special casing in #158 resulted in half the time spent in macOS+rust e2e (on a modern MacBook Pro). However, GH actions run of those same tests failed on timeout after an hour.

There means there could be some possibilities that only change CI. For example, the descending performance could be related to memory or other environment details that differ between CI and a macbook pro. In other words, even if work on the root problem is slow progressing, there could be some changes to make things "fast enough" (40m?) to use normal Docker again.

@lizan
Copy link
Contributor

lizan commented Mar 29, 2021

IIRC it is pinged to older version because newer Docker Desktop doesn't install on CI machines, if we can address that it would be great.

@codefromthecrypt
Copy link
Contributor Author

@lizan thx. to clarify the script I mention above does install, build images, run etc. Just we have some tests that are failing (in a bad way, just hanging). Hopefully the general cleanup work on tests can get us to a point where we can understand why and put the unpinned script back in (or some variant)

@codefromthecrypt
Copy link
Contributor Author

I made a branch try-again off master, now that #130 is merged, then clicked in the actions pane to run the build against it (without a PR)

The good news is that tests are running (on the unpinned OS/X docker). The bad news is it is less a run and more a crawl: the rust compilation is super super slow
image

While we have work to do anyway on that topic (see #152), @mathetake had a hunch that writes to volumes could be slower on OS/x and there could be other topics to investigate also.

Meanwhile, I'll leave the branch there in case someone wants to experiment. Hope this helps.
https://github.com/tetratelabs/getenvoy/compare/try-again
https://github.com/tetratelabs/getenvoy/runs/2235491506?check_suite_focus=true

@codefromthecrypt
Copy link
Contributor Author

the go test -timeout flag kicked in to kill the tests at 45m. Since locally I had a problem resolved by #147, it is possible we are only down to runtime slowness.

I'll bump the timeout on that branch to see the baseline of an unpinned docker+macos.

Screen Shot 2021-04-01 at 7 10 08 AM

@codefromthecrypt
Copy link
Contributor Author

I found that a plugin I used in the past supports OS/x with the same approach as here. Since we are not in the business of hosting shell scripts, I changed my branch to use this instead https://github.com/marketplace/actions/setup-docker

I've hesitated on running against it until #156 as it takes a long time to run both extension languages sequentially. Running these isolated will allow us to better understand the extent of things such as cargo flags.

@codefromthecrypt
Copy link
Contributor Author

codefromthecrypt commented Apr 1, 2021

the setup-docker action works, but there's a serious performance problem building rust on osx+docker at least in GH actions. I suggest we do anything we can to address perf issues on rust based "getenvoy extension build" and revisit when that occurs.

Specifically, when not caching tinygo is 20m and rust fails after 2hrs

codefromthecrypt pushed a commit that referenced this issue Apr 2, 2021
…ocker

This moves special-casing to rust+macOS cell in a normal build matrix.
Issue #152 still needs investigation on how to get around the slowness.

Fixes #153
Fixes #145

Signed-off-by: Adrian Cole <adrian@tetrate.io>
@codefromthecrypt
Copy link
Contributor Author

#159 isolates the two year old docker install to only the rust+macOS cell. This should help future PRs test improvements as all they have to do is remove that special casing.

@codefromthecrypt codefromthecrypt changed the title macos CI build pinned to old version of Docker macOS+rust CI build pinned to an over two year old version of Docker Apr 3, 2021
@codefromthecrypt codefromthecrypt changed the title macOS+rust CI build pinned to an over two year old version of Docker macOS+rust e2e tests in CI are pinned to an over two year old version of Docker Apr 3, 2021
@codefromthecrypt
Copy link
Contributor Author

I revised the description here because the facts have become more clear over the last week of investigation.

@codefromthecrypt
Copy link
Contributor Author

macOS is a primary development environment and old docker is near impossible to install in real life. No one would. I found some developers have a linux workstation to avoid "macos problems" but that isn't something we can expect users to do.

I spent a lot of time with a large amount of review help by @mathetake on getting CI to a place where rust might be able to work in CI on macOS like tinygo does. This included upgrading our docker image, rewrote the scripts to cache outside the volume, reviewed every dependency removed one and rewrote out another... many iterations on this topic all week and every weekend also.

This displaced time from useful work, not just myself but other people's time reviewing it. I have no idea why this language is so vitally important and also ok to leave so slow it is broken on osx. This is one, but not the only reason I opened #189 However, it is fair to say this topic consumed more time than any sane person would spend and we are lucky I am insane or would have given up months ago.

Here's the latest run which uses only the dependencies absolutely required: https://github.com/tetratelabs/getenvoy/runs/2397652908?check_suite_focus=true

codefromthecrypt pushed a commit that referenced this issue Apr 21, 2021
This removes the custom script for rust at expense of longer runtimes,
last checked at 1hr for the e2e.

Fixes #145

Signed-off-by: Adrian Cole <adrian@tetrate.io>
@codefromthecrypt
Copy link
Contributor Author

saw this pass once so raised a PR to close this topic, though it will be painful to wait an hour #190

mathetake pushed a commit that referenced this issue Apr 21, 2021
This removes the custom script for rust at expense of longer runtimes,
last checked at 1hr for the e2e.

Fixes #145

Signed-off-by: Adrian Cole <adrian@tetrate.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants