Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add a few basic integration tests #13

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

daniel-noland
Copy link
Contributor

@daniel-noland daniel-noland commented Jun 5, 2021

I created a few basic integration tests to complement the bindgen auto-generated test suite.

I'm not sure if this is the best approach but it is a place to start.

@daniel-noland
Copy link
Contributor Author

Leaving this in draft for the moment because I expect it to fail in CI. Working on that now.

@daniel-noland daniel-noland force-pushed the primitive-integration-testing branch 13 times, most recently from 44a2ab0 to 8160f99 Compare June 5, 2021 19:35
@daniel-noland
Copy link
Contributor Author

I managed to make this compile and run some basic integration tests in CI.

Unfortunately, even the newest version of Focal with group: edge still fails on an illegal instruction error with the loopback test.

I'm promoting this change set from draft because I think it can be meaningfully discussed at this point.

@daniel-noland daniel-noland marked this pull request as ready for review June 5, 2021 19:42
@daniel-noland daniel-noland changed the title Add a few basic integration tests [RFC] Add a few basic integration tests Jun 5, 2021
@jonhoo
Copy link
Owner

jonhoo commented Jun 6, 2021

This is a really interesting approach to integration testing on CI, and I'd love to land something like this! I wonder if you're getting bitten by the fact that Rust runs all tests in parallel by default. You may need to pass -- --test-threads=1 to cargo test to get it to work? The illegal instruction is usually just Rust's way of signaling that it panicked somewhere, though not always.

If you try to break the failing test into many smaller ones that have increasingly larger prefixes of the failing test, what does it point to as the failing operation?

@jonhoo jonhoo mentioned this pull request Jun 6, 2021
@daniel-noland daniel-noland force-pushed the primitive-integration-testing branch 2 times, most recently from b69ae0d to ea82a90 Compare June 8, 2021 04:16
@daniel-noland
Copy link
Contributor Author

daniel-noland commented Jun 8, 2021

I did as you suggested and I limited the number of concurrent tests to 1.

Regretably, this did not solve the core problem.

I went ahead and ran the integration tests against valgrind (somewhat muddling this PR in the process).

The result here seems to indicate that the VM run by TravisCI simply isn't equiped to run all of the instructions in the generated binary.

For what it is worth, this test consistently passes under valgrind on my home machine.

Not sure how to procede here as I am not convinced that the test failure is the fault of the test or the library; it seems to be a limitiation of the CI environment.

Thoughts?

@daniel-noland
Copy link
Contributor Author

I may try setting up an integration test suite on an AWS free tier VM. Then we could run the integration tests during development and before cutting a release.

I have one ConnectX-6 VPI card and a bunch of ConnectX-5 VPI cards as well. I may also have some ConnectX-3 cards running around. I think I also have at least one iwarp capable NIC in a server somewhere. Thus, I could set up a containerized test and run some integration tests on real hardware every once in a while as well.

I also think AWS elastic fabric adapter supports RoCE (although I have not used it yet). I don't think the AWS machines with EFA are freely available tho.

@daniel-noland daniel-noland force-pushed the primitive-integration-testing branch from ea82a90 to 07a5c7a Compare June 8, 2021 15:15
@jonhoo
Copy link
Owner

jonhoo commented Jun 8, 2021

Hmm, that's interesting. I'd be curious to see if moving to something like Azure might let us do what's needed? Alternatively, we may have to run our own CI infrastructure on a host that supports the required instructions, though that quickly becomes expensive and complex. Hmm...

@daniel-noland
Copy link
Contributor Author

Are you open to adjusting the way CI works for this project?

I was thinking I would generate an AMI with the linux kernel config necessary for SoftRoCE (I checked and the stock community AMIs don't have it as far as I can see).

Then I can set up something like terraform-cloud to spin a spot instance of that AMI on PR updates to this repo. It can be a t2 micro spot instance so it can be really inexpensive (or more likely free). That EC2 instance can just run the extant ci.sh script to spin a SoftRoCE device and run things like the loopback test in this pr. I think it should work fine.

That also open the possibility of spinning SoftRoCE devices on a geneve tunnel connecting two different EC2 instances in the same VPC. That would let us do a "host to host" test (I suggest the geneve tunnel because I'm not 100% sure how AWS will respond to ethernet frames with RoCEv2's ethertype coming out of a system which isn't "supposed" to be emitting them. If you wrap it in a point to point geneve tunnel then it will be invisible to AWS).

I have a partial implementation of this plan now. I generated a machine image with the necessary kernel, iproute2, and rdma-core. The entire machine image build process is containerized so it should work on any machine with docker. I just need to convert it to an AMI with packer and set up some simple terraform to deploy it.

I'm completely happy to do all of that (I really like rust and I really like rdma and this project is of considerable interest to me).

I just don't want to do it if that isn't the direction you want to take things. I am open to alternatives, I am just running out of simple ideas and I guess I'm reaching for somewhat more complex ones at this point 🤷.

@jonhoo
Copy link
Owner

jonhoo commented Jun 19, 2021

Sorry again for taking so long to get back to you!

I am open to changing CI, though I worry about having to maintain dedicated CI infrastructure because I know that realistically I won't be able to stay on top of that. If you'd be willing to host it though, I would be super happy to point the repo at it! I'm also looking for maintainers for this crate since I'm not using it myself, and you'd make a good candidate if you're also responsible for the integration tests!

On your proposal more generally, I think the "right" CI for this is indeed dedicate CI. I do think there's value in also having a test suite that users can reasonably run locally, so we should find ways of mocking or using SoftRoCE where possible, but some end-to-end is extremely valuable. You'll have to be mindful of abuse once you're hosting your own CI though. Chances are someone will come along with a PR that adds a bitcoin mining script or something, and suddenly your EC2 use goes through the roof 😅

@daniel-noland
Copy link
Contributor Author

No worries on the response time, after all, I didn't get back to you promptly either :)

Sorry about that, work has been a challenge lately.

I agree with your thoughts regarding the CI. I'm actually working on a solution to that problem in the background a bit. The big problem so far is that I'm not having great luck with SoftRoCE. I can make a lot of test cases pass on my hardware but those tests fail every time (or sometimes ~70% of the time) with the SoftRoCE device swapped in. Not really sure why honestly.

In any case, you are right about the danger of hosting your own CI (especially if it is running on a powerful VM or host). Will need to think about that carefully. Maybe only run basic tests in something like Travis CI and then run a full integration test on hardware after the PR is reviewed (and doesn't obviously run a mining rig or a botnet or something).

As for maintaining the crate, I would be pleased and honored to do so. I think this is a pretty solid foundation with a lot of room for improvement. I also use RoCE at work so this is, at minimum, a good excuse for me to study it further :D

Daniel Noland and others added 14 commits February 6, 2022 15:29
I created a few basic integration tests to complement the bindgen auto-generated test suite.
Trying @jonhoo suggestion for reduing test thread count to make CI run integration tests.
I am concerned that the test failure is transient.  That is very irritating.
Trying to parse test output more cleanly.

Trying with sudo now.
WIP patch.  This is an incomplete attempt at building an AMI which can
run integration tests.
I still need to make the process automatic and I still need to ensure
that the integration tests actually pass on the generated VM.
We were generating multiple images with the same /etc/machine-id parameter.  This causes multiple machines to get the same MAC address in virtual systems which is very annoying.  Just remove /etc/machine-id in the build and it will be generated automatically on first system boot.
@rajachan
Copy link

rajachan commented Jul 5, 2022

On your proposal more generally, I think the "right" CI for this is indeed dedicate CI. I do think there's value in also having a test suite that users can reasonably run locally, so we should find ways of mocking or using SoftRoCE where possible, but some end-to-end is extremely valuable.

FWIW, SoftRoCE (the rxe driver in the RDMA subsystem) is not limited to a single node. SoftRoCE encapsulates IB packets inside UDP, so you can run it across instances within a VPC just fine. From a CI/integration testing perspective for this project, rxe gets you all that you need for end-to-end testing.

Now, if we want to test libibverbs on a more "real" RDMA NIC, you could run with the AWS EFA provider natively (on the n instances that support 100Gbps and EFA); but that might be overkill for your needs here. AWS does not support RoCE, but it does use libibverbs as the interface to work with EFA.

What troubles did you have with rxe? Happy to help debug the RDMA side of things. I am not a Rust expert or even a novice, so can't help with Rust<->C API binding issues, but I'm guessing you all have that covered.

@rajachan
Copy link

rajachan commented Jul 5, 2022

Also, with Travis having shuttered the free tier support for OSS projects, do you want to move over to Github Actions or the like?

@daniel-noland
Copy link
Contributor Author

@rajachan

I actually gave that exact thing a pretty good effort a little while back. (you can see the record of my flailing attempt here)

The problem I ran into was that the VMs spun by Github actions are running a linux kernel (and modules) signed by Microsoft.

Unfortunately the rxe (SoftRoCE) module is not provided by default, nor could I find it anywhere in the repos they provide.

I tried compiling it out of band and loading it via insmod but I have yet to make that work.

I actually have a (perhaps too ambitious) plan to work around this limitation using Kata containers with a custom compiled linux kernel (with rxe compiled in).

You can see some work I did in this direction here: https://github.com/daniel-noland/docker-in-kata

The basic idea looks like:

  1. make the CI (possibly github actions) run in response to a PR/branch (as normal)
  2. the runner compiles rust-ibverbs (as normal)
  3. the runner launches two or more kata containers connected to a common linux bridge, and with the compiled rust-ibverbs binary volume mounted in
  4. the runner then checks that rust-ibverbs test suite passes using the rxe interfaces in the containers.

This plan has the upshot of allowing for proper integration testing in addition to checking the functionality of the bindings.

That said, this is a pretty involved thing to set up.

If you (or anybody) have a simpler plan I'm all about it. I have a bunch of the components of this plan working but it really is no small thing to make it all work together.

@daniel-noland
Copy link
Contributor Author

Also, @rajachan

What troubles did you have with rxe? Happy to help debug the RDMA side of things. I am not a Rust expert or even a novice, so can't help with Rust<->C API binding issues, but I'm guessing you all have that covered.

Would you know about the state of rxe (or RDMA in general) and network namespaces? I have read the docs (man rdma dev) but it seems that I need to pass the kernel some boot time parameter to get it to namespace rdma devices from very early on. The manual suggests that this configuration is possible at runtime, but after an RDMA device has been created I no longer seem to be able to put the kernel into a mode where rdma devices are permitted to move between namespaces.

This limitation has persistently foiled my attempts to containerize the test suite for this project.

@rajachan
Copy link

rajachan commented Jul 6, 2022

I've had success bringing up RDMA devices in a container namespace and running datapath tests on them. It has been a while, but some of the key things I had to plumb through via the container runtime were --- exposing the uverbs command device from /dev into the container to allow control path commands go through and setting up (or rather removing) locked memory limits for the container to allow pinning/mlock()'ing of pages that get registered via ibv_reg_mr().

For the rxe driver, I know the Amazon Linux AMI did not have it but we might be able to get it to work with Canonical's newer Ubuntu AMIs? I can take a look at the Github Actions VM environments and see if I have any success. Will find some time in the coming week to look into it.

jonhoo pushed a commit that referenced this pull request Jan 19, 2024
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants