Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github Actions Limitations #4876

Open
ariskotsomitopoulos opened this issue Jan 7, 2022 · 13 comments
Open

Github Actions Limitations #4876

ariskotsomitopoulos opened this issue Jan 7, 2022 · 13 comments
Labels
T-Task Refactoring, enabling or disabling functionality, other engineering tasks X-DevOps Issues that require some infrastructure support

Comments

@ariskotsomitopoulos
Copy link
Contributor

ariskotsomitopoulos commented Jan 7, 2022

Following the Trying to fix integration tests PR. After the fixes. The tests can run most of the time and are published manually as a PR comment, splitting the tests to smaller chunks also helped.

Running an emulator from within linux GitHub action server has a lot of problems and limitations (on mac slave is much better)
ReactiveCircus/android-emulator-runner#62

This is a solution but not a stable one. It would be good if we can increase slaves hardware
Other than that we can use another CI/CD tool like Jenkins, specific for integration tests so the flow will be as follows:
Github Action triggers Jenkins -> Jenkins run Integrations Tests and Post results back to GitHub

@manuroe manuroe added the X-DevOps Issues that require some infrastructure support label Jan 7, 2022
@ouchadam ouchadam added the T-Task Refactoring, enabling or disabling functionality, other engineering tasks label Jan 13, 2022
@ouchadam
Copy link
Contributor

to add on some more ideas from previous discussions

  • Run the device tests on firebase test lab
  • Create abstractions to avoid needing the android runtime

@michaelkaye
Copy link
Contributor

We are using the macos environment for the element-ios builds so just turning that on (with the hardware acceleration that comes with it) seems reasonable; it's more expensive but if it works we just need to pay for it.

How was it not working? Unreliable tests or timeouts or similar (do we have an example of a GHA based test that is failing atm, a lot seem to be green in the actions tab)

We're also using buildkite as our main non-GHA CI tool; we could see if running on the linux instances of that might be better than the GHA ones; but they'll still (i believe) not have hardware acceleration available to them, which will mean they will likely still run slow, but we might be able to run the tests for an extended period.

Last option is to use a local farm (or single machine) of real machines with hardware acceleration available to them to run these actions on as custom runners, but that's a bit of an investment.

@ouchadam
Copy link
Contributor

ouchadam commented Feb 9, 2022

mainly unreliable https://github.com/vector-im/element-android/actions/workflows/sanity_test.yml these same tests pass consistently locally without issue

we're also using the osx runner for the nightly UI test suite, the android emulator is notoriously picky when running headless without a gpu

my recommendation would be to avoid using VMs all together and use a dedicated service like firebase test lab but it would require an externally accessible synapse instance

@michaelkaye
Copy link
Contributor

Yeah; i was going to see how easy it would be to move the synapse outside of the build process first, in case the synapse itself is causing some overheads. If that works then moving further onto firebase or elsewhere would be fairly easy.

@michaelkaye michaelkaye self-assigned this Feb 9, 2022
@ariskotsomitopoulos
Copy link
Contributor Author

ariskotsomitopoulos commented Feb 9, 2022

The main problem is something like this:


> Task :app:validateSigningDevDebugAndroidTest UP-TO-DATE
> Task :app:packageDevDebugAndroidTest UP-TO-DATE
[PropertyFetcher]: ShellCommandUnresponsiveException getting properties for device emulator-5554: null
[PropertyFetcher]: ShellCommandUnresponsiveException getting properties for device emulator-5554: null

> Task :app:connectedDevDebugAndroidTest
Skipping device 'test(AVD)' for 'app:DEV': Unknown API Level

DEV > : No compatible devices connected.[TestRunner] FAILED 
Found 1 connected device(s), 0 of which were compatible.

This is caused mainly due to missing hardware acceleration. I believe that with iOS slaves will work much better, can you verify if we can also use macOs slaves for the android builds? If thats not the case maybe we should see for your other suggested solutions.

@michaelkaye
Copy link
Contributor

https://github.com/vector-im/element-android/pull/5193/files

So i tried this; which was to take the settings for the integration test (which seems to reliably start the emulator) and move them across to the sanity test (i also forced the sanity test to run each time we push to my branch, so don't merge).

It seems to be OK, other than some flaky tests, I haven't seen an emulator start error here. Perhaps the problem was the android level 29 or the exact version of the pixel etc - is that something we explicitly wanted to test with or is it independent?

@ariskotsomitopoulos
Copy link
Contributor Author

ariskotsomitopoulos commented Feb 11, 2022

Thanks for your update Michael, nice changes! Well the main issue is that there are errors that are not even persistent, so we cant rely on the results. For example in your brach here there are 3 failures ( I guess that is after your changes). The emulator error for example happens to me with about 20% in every run with the previous settings.

The android api level is not that important, I tried a lot of different settings, API levels and emulator-builds to conclude using the settings we have while it produced the less errors. But still I am not sure if GHA is made for that kind of runs, maybe using Macos slaves and hardware accelerated will help

Maybe we can apply your changes and check about improvements in our every day builds

@michaelkaye
Copy link
Contributor

Yeah, this fixes the "it reliably starts an emulator and runs" - we need to do more changes to make it actually fail the build on a failure (for instance the integration tests also don't fail the build).

I'll tidy the branch up into a real PR and offer it for review

@michaelkaye
Copy link
Contributor

https://github.com/vector-im/element-android/actions/workflows/sanity_test.yml is what i was actually trying to get working, btw.

I think the integration tests might fail because the synapse that demo.sh starts does not have the additional configuration to enable the threading logic on the server side, but there's tests that use the threading that are failing.

@michaelkaye
Copy link
Contributor

Received this in a sanity test run:

/bin/sh -c adb root
adb: unable to connect for root: closed
Error: The process '/bin/sh' failed with exit code 1

Adding a loop around adb root to see if it just needs retrying.

@michaelkaye
Copy link
Contributor

michaelkaye commented Feb 15, 2022

So there's various manifestations of the unreliability on the runners recently:

  • the emulator starts (adb set-property completes) but then adb root fails a little later [macos]. Occurred 1 in 9 attempts. Haven't seen whether the adb root loop will help or not; (last seen 2022-02-14)
  • the emulator starts including adb root, but halfway through a test the emulator just dies [ubuntu]. Occurs irregularly. Could be same symptom as the above macos failure; but hasn't been seen on macos so far.

@michaelkaye
Copy link
Contributor

The emulator dying is possibly due to a CPP level failure in (eg) the realm code which causes a signal 9 which causes the emulator to stop responding mid-run, which has become visible due to the logcat logs now being visible.

So it's possible that a bunch of the errors that we thought were the emulator failing mid-run, are actually the tests doing the right thing and highlighting a real code failure.

@ariskotsomitopoulos
Copy link
Contributor Author

hmm interesting, I wonder why this is not happening locally

@michaelkaye michaelkaye removed their assignment Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-Task Refactoring, enabling or disabling functionality, other engineering tasks X-DevOps Issues that require some infrastructure support
Projects
None yet
Development

No branches or pull requests

4 participants