Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote controllers #105

Closed
wlandau opened this issue Aug 9, 2023 · 5 comments
Closed

Remote controllers #105

wlandau opened this issue Aug 9, 2023 · 5 comments
Assignees

Comments

@wlandau
Copy link
Owner

wlandau commented Aug 9, 2023

Similar to mirai's dispatcher, what if crew could run its controller in a separate local R process? This could have significant advantages:

  1. Auto-scaling will no longer need polling or block the local R session.
  2. Certain cloud architectures would become easier. For example: a remote controller on the local machine, an SSH connection to a sentinel EC2 instance which runs the actual controller inside a VPC, and then AWS Batch jobs that act as workers and run inside the same VPC (so ports can be open for mirai).

Screenshot 2023-08-09 at 4 27 53 PM

Thoughts about the design:

  • A remote controller could have the same interface as a regular controller but forward all its requests (e.g. push() and pop()) over an NNG req/rep abstract/inproc socket to a process where the actual controller is running.
  • The process could use a non-polling mechanism just like mirai: sleep/wait on a condition variable until there is something to do or it is time to exit.
  • We might want to run the local controller locally or over an SSH connection, so there needs to be some modularity/flexibility to expose options for how best to launch it.
  • We should delay (un)serializing objects until they reach their final destination. @shikokuchuo, you recently implemented this at the mirai level. Is it legal/safe to send resolved mirai objects to another process and then initiate the download there?
@wlandau wlandau self-assigned this Aug 9, 2023
@shikokuchuo
Copy link
Contributor

It seems you want to implement a network / SSH / cloud launcher but have conflated the idea with having your controller run in a separate process.

  • the diagram definitely doesn't help. It just shows a networked worker scenario with an SSH connection. All this, and the permutations of where to put the controller == host and/or dispatcher is already built in to mirai's capabilities
  • it did make me realise that crew currently only has local and cluster launchers but not a network launcher - and this is probably the piece you are missing
  • mirai's launch_remote() launches remote workers on EC2 connecting back to localhost, or a host within the cloud, or a dispatcher instance running in the cloud. It does the job. But it obviously doesn't do what crew does
  • if you want to put controller == host in the cloud and control it locally, using NNG messages to supplant SSH commands is not a good idea
  • if actually what you want is to manipulate SSH commands, there are probably R packages that can help you without having to re-invent the wheel
  • passing task objects unnecessarily through another process is obviously to be avoided. I shouldn't think this prevents you putting the controller logic in a separate process.

@wlandau
Copy link
Owner Author

wlandau commented Aug 10, 2023

My first choice would be to implement an SSH cloud launcher without running the controller in a separate process. If that's possible, it would be super exciting. But I have trouble seeing how. Workers from the cloud have to dial into the user's local machine, which would mean each worker has to initiate the SSH connection. That would require exposing the local machine's IP address and setting its up like a server, which does not seem secure even with TLS. Even if it is secure, it's an uphill battle with AWS. When you SSH into an EC2 instance, you spin up the instance synchronously and then SSH from the client into the instance. That seems at odds with the host/daemon model we want, and AWS makes it hard/impossible do to this another way. But if we were to spin up a single sentinel EC2 in advance inside a VPC where ports were exposed inside the barrier, then you could submit Batch jobs or other EC2s that would connect back to the sentinel. The connection from host to sentinel could be initiated by the host. Make sense?

@shikokuchuo
Copy link
Contributor

Maybe it's the diagram that's confusing me, but it still seems what you need is an SSH launcher.

Assume you're just within the VPC box, or what you deem to be the equivalent - or alternatively within your local corporate network. You want to spin up workers on other machines - you SSH in and run your Rscript command creating your TLS connection back to your localhost. Instead of doing that manually on the command line, I imagine you'd have a crew launcher that does that.

Moving back out to a hypothetical AWS case, there are mitigants that mean you could SSH in and TLS back out to your machine. But let's assume you can't get comfortable, and need to enforce a cordon as you describe. You would SSH into a machine within this VPC and simply run the crew launcher from there. If you want to abstract away even this step, then it seems to be a question of manipulating SSH commands rather than creating some kind of NNG solution which doesn't sound secure.

Again, this all seems orthogonal to having your controllers run non-blocking in a background process.

@wlandau
Copy link
Owner Author

wlandau commented Aug 10, 2023

Assume you're just within the VPC box, or what you deem to be the equivalent - or alternatively within your local corporate network. You want to spin up workers on other machines - you SSH in and run your Rscript command creating your TLS connection back to your localhost. Instead of doing that manually on the command line, I imagine you'd have a crew launcher that does that.

One of my colleagues prototyped this at wlandau/crew.cluster#17. It's a start, but it's completely synchronous: the host sends an API call to start an EC2, then initiates an SSH connection to that EC2 after it starts. This process could take several minutes, depending on the size of the AMI and the instance, and in the meantime other launches etc. are blocked. It would be better if I could set up one end of an SSH connection and return control to the host while the EC2 instance is starting, then allow that instance to asynchronously connect through the tunnel when it is ready. But I am not sure this is possible.

Moving back out to a hypothetical AWS case, there are mitigants that mean you could SSH in and TLS back out to your machine.

Any asynchronous ones you know of?

But let's assume you can't get comfortable, and need to enforce a cordon as you describe. You would SSH into a machine within this VPC and simply run the crew launcher from there.

That's what I was going for with the diagram, except that sending just the launcher seems a bit harder to understand because that would put the mirai aio objects on the host.

If you want to abstract away even this step, then it seems to be a question of manipulating SSH commands rather than creating some kind of NNG solution which doesn't sound secure.

I guess there might be a way to have the controller and launcher on the host if all the network programming happens automatically through the SSH tunnel. The most similar model I know of is clustermq's SSH connecter: https://mschubert.github.io/clustermq/articles/userguide.html#ssh-connector

@wlandau
Copy link
Owner Author

wlandau commented Aug 21, 2023

I guess there might be a way to have the controller and launcher on the host if all the network programming happens automatically through the SSH tunnel. The most similar model I know of is clustermq's SSH connecter: https://mschubert.github.io/clustermq/articles/userguide.html#ssh-connector

Maybe I should learn more about how to do that before I prototype remote controllers.

Repository owner locked and limited conversation to collaborators Aug 21, 2023
@wlandau wlandau converted this issue into discussion #106 Aug 21, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

2 participants