Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: libp2p implementation interoperability dashboard on various dimensions. Uses testplans-dsl. #85

Closed
wants to merge 9 commits into from

Conversation

MarcoPolo
Copy link
Contributor

@MarcoPolo MarcoPolo commented Nov 30, 2022

This PR adds interop testing across various dimensions (transports, muxer, secure channel) against the Go and Rust libp2p implementations. This is the foundation work on this project, and enables adding more libp2p implementations to the versions.ts file. This also renders the result into a markdown file to give an overview of the test results. Example result here: https://gist.github.com/MarcoPolo/965e1f6fd16df004173131199444c62d

Recommendation for reviewer:

  1. Look at ping-with-dsl/testplans.ts first to get an idea of what the DSL looks like.
  2. Review the DSL implementation in dsl/.
  3. Look at the multidimensional testing:
    a. The version.ts file defines the capabilities of each version.
    b. This file is consumed by the composer and loaded into an in-memory sqlite table.
    c. The table is queried to produce the various tests we run in the form of the test-plans dsl.
    d. The DSL is then transformed to the testground files and then run.
    Feedback appreciated!

@@ -0,0 +1,5 @@
image.json: Dockerfile main.go go.mod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term, I see CI building these images, uploading them to some container registry, then committing this json file directly into the repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share the same vision apart from the committing to the repo part. Can you elaborate why do you think that is necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I don't quite understand the purpose of the JSON files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why do you think that is necessary?

Depends on our use case, but maybe for this it's not worth it. It would be nice if you could pull the images built in CI if your code doesn't affect these things. It's like using it as a cache.

In general, I don't quite understand the purpose of the JSON files.

It's the simplest way to record the container image id. Since it's json we can import it in our typescript code directly as opposed to reading and parsing a file manually.

ping-with-dsl/testplans.ts Outdated Show resolved Hide resolved
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this. I am in favor of the simpler API.

Do I understand correctly that the (automated) steps now are:

  1. User builds images resulting in images and image.json files.

  2. Composer generating test plans based on versions.ts which itself references above image.json files. It does so by loading the information into a sqlite database, querying it and returning the test matrix.

  3. User runs npm run tests which triggers the dsl package.

  4. The dsl package takes the generated test matrix, transforms it into a file format that Testground understands, sets up the symlinks, and then calls the Testground binary.

  5. From here Testground takes over, itself walking through its various abstraction layers.

I might be missing something. From the above, I am not sure introducing yet another layer of abstraction (dsl) is the way forward.

I understand the strategy of creating a simple API (dsl) on top of the much more complex Testground API first, and only then iteratively simplifying the Testground API. I am afraid that we do the former but never get to the latter, resulting in yet another layer.

In my eyes this project is important but not urgent. At this point I think we know of multiple changes (no symlinking, prebuild images, ...) that we want inside of Testground for sure.

Instead of adding another layer (dsl) what do you think of pushing these changes upstream (Testground) right away?

I am sure I don't have the full picture at this point. I suggest we discuss this in our 1on1 on Monday @MarcoPolo and then follow up here.

ping-with-dsl/testplans.ts Outdated Show resolved Hide resolved
@MarcoPolo MarcoPolo force-pushed the marco/testplans-dsl branch 3 times, most recently from 94f4390 to 0c5cd29 Compare December 8, 2022 17:10
@MarcoPolo
Copy link
Contributor Author

In my eyes this project is important but not urgent. At this point I think we know of multiple changes (no symlinking, prebuild images, ...) that we want inside of Testground for sure.

I think this project is urgent in that we've committed to doing this this year, and adding more time may lead to a rabbit hole rather than solving the actual problem of creating this multidimensional test cases.

Instead of adding another layer (dsl) what do you think of pushing these changes upstream (Testground) right away?

I want to try this API first before committing to it. I also don't want to block on first changing testground before being able to advance with this project.

@MarcoPolo
Copy link
Contributor Author

This is ready for review. I'll keep iterating on making the CI a bit better, but this does run in CI.

Recommendation for reviewer:

  1. Look at ping-with-dsl/testplans.ts first to get an idea of what the DSL looks like.
  2. Review the DSL implementation in dsl/.
  3. Look at the multidimensional testing:
    a. The version.ts file defines the capabilities of each version.
    b. This file is consumed by the composer and loaded into an in-memory sqlite table.
    c. The table is queried to produce the various tests we run in the form of the test-plans dsl.
    d. The DSL is then transformed to the testground files and then run.

@MarcoPolo MarcoPolo marked this pull request as ready for review December 8, 2022 18:37
@MarcoPolo
Copy link
Contributor Author

@mxinden / @laurentsenta Could I get a review here when you get a chance please :)

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MarcoPolo for pushing this forward.

Do I understand correctly that:

  • This would entirely replace the ping/ folder? I would be in favor of that.
  • ping-with-dsl/ is only here to showcase the dsl and would be removed before merging as well?

multidim-interop/go/v0.22/Makefile Show resolved Hide resolved
multidim-interop/go/v0.23/Makefile Show resolved Hide resolved
multidim-interop/rust/manifest.toml Outdated Show resolved Hide resolved
multidim-interop/versions.ts Show resolved Hide resolved
.github/workflows/run-testplans-dsl.yml Outdated Show resolved Hide resolved
dsl/src/testground-runner.ts Show resolved Hide resolved
]
}
]
}
Copy link
Collaborator

@laurentsenta laurentsenta Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're creating a DSL / configuration abstraction above Compositions:

This PR does two things: move the build step out of Testground and introduce a new configuration format.

The new configuration looks MUCH simpler because it is: we removed the essential complexity of the building configuration.

With a regular composition (no change to Testground required), this is what the same workflow would look like:
https://gist.github.com/laurentsenta/bde72a8ad2e8cfdcecfb3add75338a44

Turned into yaml which is more familiar than toml:
https://gist.github.com/laurentsenta/f85b2249b1598586a01071b61862e68a

I would recommend not going for a complete "DSL" abstraction yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a regular composition (no change to Testground required), this is what the same workflow would look like:

I know this is an example, but this doesn't work today. You still need a plan and a test case. These are vestiges that I think unnecessarily complicated.

If we want to iterate to something closer to this DSL (minimal config) and change this DSL later, that's fine by me. But I don't want to block this work on first changing testground.

Copy link
Contributor Author

@MarcoPolo MarcoPolo Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple other things when comparing this to a regular composition file:

  • The DSL leverages typescript to typecheck the runtime params
  • The composition file has the idea of "groups" and "runs" as separate things. This is vestige of having testground build things. But really we only need "runs" here.

Here are both configs in JSON: https://gist.github.com/MarcoPolo/e4488ce57c58469298918af896ded192. This is subjective, but I find the DSL version simpler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is an example, but this doesn't work today. You still need a plan and a test case. These are vestiges that I think unnecessarily complicated.

The example would work, except for a typo with artifacts maybe.
I'm confused because we could do:

params -- (generator) --> composition

instead of:

params -- (generator) --> "dsl" -- (transformer) --> composition
  • typing is supported just as well, type the boundaries of your generator, you could even have fun with generics there,
  • the main change in the DSL is that we're going to duplicate every group, so code generation will be the default, no one is going to manually edit this file.
  • there's nothing to block for in Testground.

It seems more efficient. Let's have a chat async on why we can't do this and why we call this a DSL instead of a new configuration format, Feel free to resolve this whenever you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example would work, except for a typo with artifacts maybe.

At the very least you currently need a manifest.toml that defines a the plan name, defaults, builders, runners, and testcases.

the main change in the DSL is that we're going to duplicate every group, so code generation will be the default, no one is going to manually edit this file.

I don't follow, sorry.

there's nothing to block for in Testground.

What's blocking now?

why we call this a DSL instead of a new configuration format,

A rose by any other name. Happy to call this a new configuration format.

@laurentsenta
Copy link
Collaborator

laurentsenta commented Dec 9, 2022

Thanks for sharing and pinging for review @MarcoPolo,
I left a few comments on dropping the builds and the configuration part,

The rest looks great, especially that move from templates to ts generation!

(created features requests from that PR in testground/testground#1546 feel free to add what I missed).

@MarcoPolo
Copy link
Contributor Author

This would entirely replace the ping/ folder? I would be in favor of that.

Not yet. I think that would be a great next step, but that's not a goal right now.

ping-with-dsl/ is only here to showcase the dsl and would be removed before merging as well?

I wasn't planning on removing it since it's useful as a stepping stone to the DSL, but I can remove it if requested.

@MarcoPolo
Copy link
Contributor Author

I think some blockers right now for this are:

There's a question of whether we should merge the DSL changes into testground first. I think our goal here is to have a straightforward interop testing system. To accomplish this we are only blocked on the things above. There's no reason to block on making the upstream changes. This doesn't mean that we will never upstream these changes. Likely other users of testground want a simpler configuration. But we should prioritize the things we all agreed are important in the roadmap first.

@MarcoPolo MarcoPolo changed the title Example test-plans DSL for testground feat: libp2p implementation interoperability dashboard on various dimensions. Uses testplans-dsl. Dec 9, 2022
@p-shahi
Copy link
Member

p-shahi commented Dec 12, 2022

@Jorropo caught a bug in go-libp2p v0.24.0 where a node configured to use Noise could dial a node configured to use TLS (libp2p/go-libp2p#1946)

In a subsequent pull request, we should extend the test cases in here to test the non happy-path (either as a part of the same github action as these or as a separate one that runs as a regression test in a nightly job.)

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction this is going, left a couple of comments :)

Comment on lines +45 to +92
// Lots of duplication because I, Marco, couldn't figure out how to make the type system happy.
let boxed_transport = match transport_param.as_str() {
"tcp" => {
let builder = TcpTransport::new(GenTcpConfig::new())
.upgrade(libp2p::core::upgrade::Version::V1Lazy);

let authenticated = match secure_channel_param.as_str() {
"noise" => builder
.authenticate(libp2pv0490::noise::NoiseAuthenticated::xx(&local_key).unwrap()),
_ => panic!("Unsupported secure channel"),
};

match muxer_param.as_str() {
"yamux" => authenticated
.multiplex(yamux::YamuxConfig::default())
.timeout(Duration::from_secs(5))
.boxed(),
"mplex" => authenticated
.multiplex(mplex::MplexConfig::default())
.timeout(Duration::from_secs(5))
.boxed(),
_ => panic!("Unsupported muxer"),
}
}
"ws" => {
let builder = WsConfig::new(TcpTransport::new(GenTcpConfig::new()))
.upgrade(libp2p::core::upgrade::Version::V1Lazy);

let authenticated = match secure_channel_param.as_str() {
"noise" => builder
.authenticate(libp2pv0490::noise::NoiseAuthenticated::xx(&local_key).unwrap()),
_ => panic!("Unsupported secure channel"),
};

match muxer_param.as_str() {
"yamux" => authenticated
.multiplex(yamux::YamuxConfig::default())
.timeout(Duration::from_secs(5))
.boxed(),
"mplex" => authenticated
.multiplex(mplex::MplexConfig::default())
.timeout(Duration::from_secs(5))
.boxed(),
_ => panic!("Unsupported muxer"),
}
}
_ => panic!("Unsupported"),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is, do we want to have a node that only speaks the requested configuration or should it just be compatible with it.

Even in the first case, the easier option here is probably to use OptionalTransport and OptionalUpgrade. That allows you move the conditional code into a value that you can compose as if it were unconditional!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want something that only speaks the requested configuration. That way we are sure we are testing the requested configuration and not accidentally making something else work.

This code is definitely hacked together. I'll take a look at the OptionalTransport, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jxs Do you think you can help me clean this up? You'll certainly do this faster than me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to using an EitherUpgrade

import { Version } from "../versions";


export async function buildTestplans(versions: Array<Version>): Promise<TestPlans> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a test for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. This whole thing is the test, isn't it?

@@ -0,0 +1,5 @@
image.json: Dockerfile main.go go.mod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share the same vision apart from the committing to the repo part. Can you elaborate why do you think that is necessary?

await fs.mkdir(dir)
await fs.mkdir(path.join(dir, "testplans"))

const filesToCreate = transform(testplans)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within transform we assume that all containers have been built already (which they are currently via the Makefile). What do you think of moving the building of the images also into here?

I am asking because a next step for this could be to pull images instead of building them but that would probably be a bit clunky to express in the Makefile so it might be better done in here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testground/testground#1522 is what I am aiming at!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best if we don't try to build here. I think that's a sizable bit of complexity in testground. The Makefile is just one strategy to build the container image. Using Nix can be another. Maybe we want to use warpforge at some point.

It's nice that this just has to reference some existing image id, and then it'll work.

multidim-interop/versions.ts Outdated Show resolved Hide resolved
@@ -0,0 +1,5 @@
image.json: Dockerfile main.go go.mod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I don't quite understand the purpose of the JSON files.

ping-with-dsl/testplans.ts Outdated Show resolved Hide resolved
ping-with-dsl/testplans.ts Outdated Show resolved Hide resolved
@MarcoPolo
Copy link
Contributor Author

MarcoPolo commented Dec 30, 2022

@thomaseizinger I haven’t started investigating this issue yet, but you might have a lot more context here and can probably help me out a ton. I’m seeing flakiness in the Rust tests. This shows up mostly in the rust x go, but also a rust x rust test (https://github.com/libp2p/test-plans/actions/runs/3798668092 scroll to summary below). And has shown up before this test here.

At first I suspect it’s a testground networking issue, but it’s curious that it doesn’t affect the Go side. One total guess is that the rust test isn’t properly waiting for the network to be ready.

edit:

this is due to a race condition in the Go test that wouldn't actually wait until it got all connections before emitting "connected". Fixed now.

)
await fs.writeFile(path.join(dir, ".env.toml"), `
[daemon.scheduler]
task_timeout_min = 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better for this to be configured in the DSL rather than hardcoded here.

AND ha.hash < hb.hash
-- quic only uses its own muxer/securechannel
AND a.transport != "quic"
AND a.transport != "quic-v1";`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also applies to webtransport, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Although the quic-v1/webtransport vs quic/webtransport change is why I haven't added webtransport yet here.

@marten-seemann
Copy link
Contributor

ping-with-dsl/ is only here to showcase the dsl and would be removed before merging as well?

I wasn't planning on removing it since it's useful as a stepping stone to the DSL, but I can remove it if requested.

I think I share @mxinden's question here. The multidimensional configuration looks, whereas ping-with-dsl contains the ping-with-dsl/testplans.ts file, which in my understanding should be an intermediate step and not committed to VCS (I might be misunderstanding things here).

@MarcoPolo Can you clarify the relationship between ping-with-dsl and multidim?

@MarcoPolo
Copy link
Contributor Author

Lets get rid of ping-with-dsl. It was mostly used as an example for now and I think it complicates things.

@MarcoPolo
Copy link
Contributor Author

@laurentsenta / @mxinden CI is passing and I've done a pass at some cleanup and addressed comments. Could I get another review here please?

In a follow up PR I'm going to:

  • Add the distinction of A dials B and B dials A.
  • Add a test where A dials A.

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the missing go-libp2p and rust-libp2p versions, this looks good to me.

@jxs in case you have some capacity, can you help with the outstanding refactorings in the Rust transport construction? If not, I am happy to jump in.

@@ -50,6 +53,7 @@ runs:
shell: bash

- name: Run the daemon or configure the client
if: ${{ inputs.start_testground == 'true' }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, with this merged, nothing would require this to be true anymore, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing ping tests use the run-composition GH action which depends on this being true.

The multidim tests starts its own testground daemon so that we can control the home location and the .env.toml file. In this case we don't want the setup-testground action to start the testground daemon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing ping tests use the run-composition GH action which depends on this being true.

Do I understand correctly that long term that GH action would go away? Do I also understand correctly that we can not yet remove it here given that that would break CIs in rust-libp2p and go-libp2p?

For the sake of cleaning it up right away, would that breakage of our CI be that bad?

Copy link
Contributor Author

@MarcoPolo MarcoPolo Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that long term that GH action would go away?

I wonder if there's still value in these existing ping tests. We get most of the value from them in the new multidim interop tests. The other value is having a fast test that checks if current master works against the other implementations. This GH action would go away when we stop using the ping tests. Right now there's still some value in the ping tests that isn't captured by this PR.

Do I also understand correctly that we can not yet remove it here given that that would break CIs in rust-libp2p and go-libp2p?

Yeah, and we wouldn't have anything to replace them.

For the sake of cleaning it up right away, would that breakage of our CI be that bad?

We don't have a plan right now to replace the existing ping tests. I'd rather focus on replacing it and then update the other repo's CI rather than breaking CI and then forcing us to fix it in a reactive way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other value is having a fast test that checks if current master works against the other implementations.

That would as well happen via the multidimensional tests, no?

Right now there's still some value in the ping tests that isn't captured by this PR.

What value would they provide long term?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would as well happen via the multidimensional tests, no?

Long term yes. Right now, no.

What value would they provide long term?

Long term, we should get rid of them. Right now though, I think the priority is to have the multidim interop tests. Replacing the ping tests imo is lower priority.

What the ping test gives us that multidim doesn't right now:

  • CI testing in the impl repo.
  • Interop testing with Nim (I haven't spent any time getting Nim into the multidim)

Long term, multidim will do this and we can get rid of the ping tests. I would save that for a future PR rather than add to this one. And I would avoid removing the existing ping tests until we're ready to replace them.

@MarcoPolo
Copy link
Contributor Author

MarcoPolo commented Jan 4, 2023

I'm a bit worried on the scalability of testground for this. Looking at the last run: https://github.com/libp2p/test-plans/actions/runs/3833933397/jobs/6525831766 each test took and average of 5.5 seconds.

I can reduce the time a bit here by only doing a single ping instead of the two currently. and there are probably some other optimizations I can make. What would be really nice is to keep the simplicity of two instances where A pings B (not a test where we have many instances all pinging each other), have these run faster, and maybe have these run in parallel. What can we do to reduce the 5.5 seconds? I'm not sure why this ping test would take more than a second to complete. So my guess here is that there is some overhead with testground for each test.

exit 1
else
exit 0
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check might break silently if we change the dashboard format, we could use a different exit code in renderResults to mark the case "the render completed, but it contains an error".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer this to a future change.

@Jorropo
Copy link

Jorropo commented Jan 5, 2023

caught a bug in go-libp2p v0.24.0 where a node configured to use Noise could dial a node configured to use TLS
@p-shahi

No the dial didn't worked, the bug I caught was the error was shadowed (so it was returning a nil error even tho it failed.)

@p-shahi
Copy link
Member

p-shahi commented Jan 10, 2023

closing in favor of #97

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants