Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Private VPC cloud runners? #1472

Open
act-mreeves opened this issue Aug 7, 2024 · 3 comments
Open

Private VPC cloud runners? #1472

act-mreeves opened this issue Aug 7, 2024 · 3 comments

Comments

@act-mreeves
Copy link

First of all this is going to be a very "AWS focused" comment so apologies.
I was wondering if there were any plans to support private subnet runners or at least a way to specify an elastic IP.

My core issue is I want my runner to connect to our mlflow which is behind a security group that only allows certain IPs and security groups to access. I can't use complementary security groups (e.g. allow runner sg to connect to mlflow sg on port 443) because the runner ec2 is public.

I see cml runner launch uses terraform so if you can point me to the correct repo for the runner client and terraform generation code I could try to carry my own water.

Ideally I'd like to see a "private vpc" runner mode and instead of needing to use SSH to connect to the runner we could use aws ssm start-session or some other callback or api to not require direct network access over the public internet from the github actions endpoints. Is there any reason for this direct network access besides the initial health check?

@0x2b3bfa0
Copy link
Member

You can probably use cml runner launch --cloud-aws-subnet to choose a subnet in a private VPC:

cloudAwsSubnet: {

See the SDK code we run here.

Your mileage may vary if you intend to use CML without a publicly reachable IP address and SSH server, but it might be possible.

@act-mreeves
Copy link
Author

You can probably use cml runner launch --cloud-aws-subnet to choose a subnet in a private VPC:

cloudAwsSubnet: {

See the SDK code we run here.

Your mileage may vary if you intend to use CML without a publicly reachable IP address and SSH server, but it might be possible.

I may be mistaken but I think I tried that initially and the github action never realized the machine was healthy/ready when it was on a private subnet.
Instead of this success on public subnet with security group open to the world:

{"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m20s elapsed]"}
{"level":"info","message":"iterative_cml_runner.runner: Creation complete after 1m25s [id=cml-8bsk91decf-ztewkt24-3nng3tlz]"}

I got:

{"level":"info","message":"iterative_cml_runner.runner: Creation errored after 19m22s"}
{"level":"error","message":"terraform error: Error: Error checking the runner status"}

and I had terraform logging set to DEBUG:

2024-08-06T00:12:58.610Z [INFO]  provider.terraform-provider-iterative: 2024/08/06 00:12:58 [TRACE] Waiting 10s before next try: timestamp=2024-08-06T00:12:58.609Z
2024-08-06T00:13:10.610Z [INFO]  provider.terraform-provider-iterative: 2024/08/06 00:13:10 [DEBUG] Connection errors: &net.OpError{Op:"dial", Net:"tcp", Source:net.Addr(nil), Addr:(*net.TCPAddr)(0xc000afa720), Err:(*net.timeoutError)(0x594aa00)}: timestamp=2024-08-06T00:13:10.610Z
2024-08-06T00:13:10.610Z [INFO]  provider.terraform-provider-iterative: 2024/08/06 00:13:10 [TRACE] Waiting 10s before next try: timestamp=2024-08-06T00:13:10.610Z
2024-08-06T00:13:18.832Z [INFO]  provider.terraform-provider-iterative: 2024/08/06 00:13:18 [WARN] WaitForState timeout after 19m0s: timestamp=2024-08-06T00:13:18.832Z
2024-08-06T00:13:18.832Z [INFO]  provider.terraform-provider-iterative: 2024/08/06 00:13:18 [WARN] WaitForState starting 30s refresh grace period: timestamp=2024-08-06T00:13:18.832Z
2024-08-06T00:13:19.719Z [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
2024-08-06T00:13:19.722Z [INFO]  provider: plugin process exited: plugin=.terraform/providers/registry.terraform.io/iterative/iterative/0.11.20/linux_amd64/terraform-provider-iterative id=2186
2024-08-06T00:13:19.722Z [DEBUG] provider: plugin exited
{"level":"error","message":"terraform apply error","stack":"Error: terraform apply error\n    at Object.apply (/usr/local/lib/node_modules/@dvcorg/cml/src/terraform.js:55:11)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async runTerraform (/usr/local/lib/node_modules/@dvcorg/cml/bin/cml/runner/launch.js:184:5)\n    at async runCloud (/usr/local/lib/node_modules/@dvcorg/cml/bin/cml/runner/launch.js:193:19)\n    at async run (/usr/local/lib/node_modules/@dvcorg/cml/bin/cml/runner/launch.js:433:14)\n    at async exports.handler (/usr/local/lib/node_modules/@dvcorg/cml/bin/cml/runner/launch.js:446:5)"}

So it seems the github action is doing something to see if the runner is ready.

@0x2b3bfa0
Copy link
Member

Yes, it is doing something, and it does require being able to reach out the EC2 instance's SSH server.

The quickest/hackiest workaround could be ignoring the exit code of cml runner launch completely and using the GitHub/GitLab/... API to wait-for/check-if the runner has registered correctly. This would eliminate the need for SSH access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants