Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More options for using AiiDA with locked-down supercomputers #3929

Open
ltalirz opened this issue Apr 14, 2020 · 24 comments
Open

More options for using AiiDA with locked-down supercomputers #3929

ltalirz opened this issue Apr 14, 2020 · 24 comments

Comments

@ltalirz
Copy link
Member

ltalirz commented Apr 14, 2020

In cases where HPC centers don't offer access via SSH keys (e.g. requiring 2-factor authentication), the only way to use AiiDA currently is to install it on the cluster (which is possible and has become a lot easier since the introduction of the aiida-core and aiida-core.services conda packages).

However, there are also alternative routes we could explore, which I list below such that they don't get lost:

  1. opening an SSH connection once and keep reusing it (e.g. using ControlPersists yes, and ControlPath ~/.ssh/cm_socket/%r@%h:%p in the ~/.ssh/config file). This is currently not supported by paramiko but there are alternative python bindings like ssh2-python and parallel-ssh we could look into.

  2. the guys at nersc have developed a small set of scripts called sshproxy that grants temporary access to SSH keys on the cluster side. I.e. you authenticate once, then something on the cluster "enables" your key for a period of time (say 24h), and after that time the key is disabled again. It seems they haven't put it on their github yet, but if they were asked perhaps they would be fine with open sourcing it. Of course, this route would always require action from the cluster administrator.

P.S. We might anyhow want to look into ssh2-python for performance reasons. See also here and here for a comparison with paramiko.

Mentioning @sphuber for info

@ConradJohnston
Copy link
Contributor

Just to comment - I think this could become an increasingly important issue, if current trends continue. There was a security incident affecting Tier 1 and 2 HPC machines in the UK this week and it looks possible that SSH-alone access may no longer be possible. If these sorts of highly-targeted (and possibly state-sponsored?) attacks continue, it seems highly likely that many more HPC machines will become locked down with 2FA and similar measures. This is quite a significant threat to usefulness and impact of AiiDA, if not addressed.

@sphuber
Copy link
Contributor

sphuber commented May 15, 2020

I fully agree with @ConradJohnston that it might be likely that we will rather start to see more centres adopting 2FA than less and this does potentially pose a great problem of us. Either the model will have to shift to having AiiDA run on the cluster, but this will likely face resistance of admins as well given the database and amqp services we require. In addition, this way AiiDA loses one of its strongsuits where it is easy to target multiple compute resources from one central machine.

The options mentioned by @ltalirz are certainly worth looking into, but they have serious restrictions. One of the nice mechanisms of AiiDA 1.0 is that it can gracefully recover from temporary connection problems. By going for persistent connections that have to be manually authorized, we severely undermine the automation of AiiDA. It will require a lot more human interaction, which is clearly undesirable.

Option number 2 runs the risk that we will end up with many custom solutions for the various center if a standard does not percolate over time. This will likely result in making the configuration machines for users even more complicated (never mind the developmental overhead for the AiiDA team) and more burdensome to use.

I guess what I am saying here is that we should try to start addressing these issues with big computational centers themselves. If we think that operational schemes and tools like AiiDA are going to be more and more common, we should reach out to them to make them aware of these use cases so they can estimate what the impact of their changes might be. @ConradJohnston if you would be willing to help us out as the AiiDA ambassador and envoy for the UK 😉 that'd be great

@ltalirz
Copy link
Member Author

ltalirz commented May 15, 2020

I was just about to write to the AiiDA mailing list concerning this.
We'll need to wait until the reports are out but indeed it may be that the reason why European and not US HPC centers were attacked would be that in the US they've adopted 2FA. In that case there will likely be a strong push for it over here as well.

Archer mentions you will need "SSH key and a password" https://www.archer.ac.uk/status/
Not sure what this means - if it is a password-protected SSH key one could still work with ssh-agent. Let's see...

Also pinging @giovannipizzi for info

@ConradJohnston
Copy link
Contributor

@ltalirz So previously it was the case that you could access the ARCHER service with just a password. You could have then installed your own SSH key to be able to use AiiDA. My understanding is that going forward, and also for ARCHER2, you will now need both an SSH key (password protected or not) and to enter a password in the shell.

@sphuber - Happy to help on that front! However, if this is going to be an ongoing issue across different HPC services, we should perhaps draft a standard letter to outline what AiiDA is and what the problem is. Otherwise, the risk is is that it's dismissed as a niche use-case.

@aturner-epcc
Copy link

I was pointed at this thread by a researcher who wants to use AiiDA on ARCHER following the changes to access mechanisms. I think that we (by "we", I mean the community of HPC professionals, RSEs, tool developers and researchers) all need to work together to find solutions that let people use these tools while maintaining security on HPC systems. It is my opinion that the use of 2FA (likely with TOPT solutions) is going to become much more widespread on HPC systems.

One solution to this issue that some US centres have used is to provide dedicated workflow nodes (see, for example: https://docs.nersc.gov/jobs/workflow/workflow_nodes/). I appreciate that this does not allow users to use different resources but at least provides a way to allow them to run in some way.

Time-limited key access is also an option, as mentioned above, maybe this is a more attractive solution. This was how things used to work back in the days of the grid (with grid proxy certificates). We (ARCHER) are gathering requirements for use of these tools in the world of 2FA so I will feed the useful comments here into that wider discussion. If we were to put together a virtual event to discuss requirements and possible solutions, would the AiiDA team be interested in being involved?

@ltalirz
Copy link
Member Author

ltalirz commented May 21, 2020

Hi @aturner-epcc , thanks for stopping by :-)
We would definitely be interested in participating in a discussion on how to best approach this issue. You can drop me an email at leopold.talirz@gmail.com and I will mention it at the next AiiDA team meeting.
At some point we may even want to mention this on the AiiDA mailing list, as I believe this will likely affect AiiDA users all around the world at some point in the future.

@aturner-epcc
Copy link

@ltalirz Thanks, the virtual meeting is still just varpourware at the moment but I think it is an important issue so will try and find a way to make it happen. I will drop you a line once we are a bit further on with organising such a meeting.

@zhubonan
Copy link
Contributor

zhubonan commented Jul 8, 2020

@ltalirz @aturner-epcc Any update with this? I would love to continue using archer with AiiDA in a way without any workarounds.
I am aware there are a few other researchers in my group and other groups that are also looking forward to using AiiDA for running big calculations on ARCHER and ARCHER2 in the future.

@ezpzbz
Copy link
Member

ezpzbz commented Jul 8, 2020

Our group in University of Bath also has increasing number of AiiDA users and we would be happy if we can use ARCHER to perform our calculations.

@aturner-epcc
Copy link

@zhubonan @pzarabadip No update yet but this use case has been flagged to the service.

@tsohier
Copy link

tsohier commented Jul 21, 2020

Sorry, nothing constructive... Just a heavy AiiDA user who is just starting a project on ARCHER and would really need this...

@ltalirz
Copy link
Member Author

ltalirz commented Feb 19, 2021

I discussed with @pzarabadip this evening - he has a temporary solution for ARCHER2 that he discussed with the responsibles but may (or may not) still need some tweaks before releasing publicly (feel free to contact him if you're interested).

As ARCHER2 starts to be opened from next week, he will be in contact with the responsibles to get feedback on how an "official" version could look like.

@zhubonan
Copy link
Contributor

FYI: I have made a solution for ARCHER2 (in the form of transport+scheduler plugins) avalaible here: https://github.com/zhubonan/aiida-archer2-scheduler.
I hope this is OK with EPCC @aturner-epcc? One can encrypt the password with GPG and release them as environmental variables temporarily once when the daemon starts, or type it in manually.
This way, no unencrypted password will be stored on the disk.

@ConradJohnston
Copy link
Contributor

Hi @zhubonan ,

Great to have another proposed work around to this issue. :)

However, I don't think this is a better solution then simply opening a connection to a locked-down supercomputer the canonical way using SSH and then just forwarding the AiiDA traffic to it locally.
Forwarding in this way also allows for computers with 2FA such as Authenticator to be accessed.
It may be less convenient to have to maintain such a connection, but I think it's worth it for the security.
The thought of seeing my password in plain text next to "ARCHER2_PASS" on stdout upon running printenv is bit alarming, and there are a host of reasons never to do this.

I very much appreciate the work that went in to this, and we do need a solution for sure, but I wouldn't be supportive of this. There's a risk of reputational damage to the AiiDA project - if a lot of users are using this kind of solution, HPC administrators may choose to disallow AiiDA on their systems. I'd rather be inconvenienced and respect the access that
has been granted to a given system, than have convenience by plastering over the cracks.

@zhubonan
Copy link
Contributor

zhubonan commented Nov 22, 2021

Hi @ConradJohnston, thanks for your reply. I think these are valid concerns. Having the password plain text in the environmental varialbe is certainly not ideal, so user should use at its own risk, and minimise the exploure as suggested in the README.md file of the plugin. If there are better way to pass secret information to a long-running python process, I am more than happy to implement it for the scheduler plugin.

better solution then simply opening a connection to a locked-down supercomputer the canonical way using SSH and then just forwarding the AiiDA traffic to it locally.

Can you please elaborate how to do this, is that already supported by AiiDA?

There's a risk of reputational damage to the AiiDA project - if a lot of users are using this kind of solution, HPC administrators may choose to disallow AiiDA on their systems.

I don't think this is quite true - most people manually launching SSH are likely to store password in plain text, somewhere. If a password manager is used then the password will leak to the clipboard/buffer anyway. If the system is compromised then the chance of leaking the password is the same, if not more.

@ConradJohnston
Copy link
Contributor

ConradJohnston commented Nov 22, 2021

better solution then simply opening a connection to a locked-down supercomputer the canonical way using SSH and then just forwarding the AiiDA traffic to it locally.

Can you please elaborate how to do this, is that already supported by AiiDA?

It's supported by AiiDA natively in the sense that AiiDA doesn't know or care that you're doing it - it's some SSH config magic.
You make a connection to your supercomputer, forwarding its traffic to a spare local port, say 2222, for example.
Then you setup the computer object in AiiDA and tell it that the supercomputer is at localhost:2222.
Every time AiiDA needs to SSH, it instantly has access through this secure tunnel, without any need to authenticate again.
One could argue that this is bad because other nefarious processes could use it, but this would also be true of ssh-agent or passwordless SSH-key pairs.

There's a risk of reputational damage to the AiiDA project - if a lot of users are using this kind of solution, HPC administrators may choose to disallow AiiDA on their systems.

I don't think this is a valid - most people manually launching SSH are likely to store password in plain text, somewhere. If a password manager is used then the password will leak to the clipboard/buffer anyway. If the system is compromised then the chance of leaking the password is the same, if not more.

You may be right there. However, I still think we need to tread lightly when it comes to this business and engage the administrators as best we can. For example, the move to 2FA and keyboard-interactive on many systems came as a reaction to the 2020 attacks on HPC which reportedly exploited users' passwordless SSH keys..
There's a quote in the article from an anonomous HPC insider in the UK:


“I work in HPC in the UK. Yesterday I had to revoke all the SSH certificates on our system because unfortunately some f’ing idiot users have been using private keys without passcodes. These have been used to hop from system to system as many HPC users have accounts on different systems. They are managing local privilege escalation on some systems and then looking for more unsecured SSH keys to jump to other systems."


I think this quote captures well the tension that exists. Admins have users who do not maintain best practice or even actively try to circumvent measures. At the same time, job and workflows managers exist and are growing in popularity and sophistication.

There's perhaps a need to arrange a short workshop on this issue and try to invite as many relevant system admins as possible. Do you think this would be feasible, @giovannipizzi ?

@ltalirz
Copy link
Member Author

ltalirz commented Nov 22, 2021

PASC would have been one possible venue for this but the deadline for minisymposia suggestions just passed (Nov 13th).

Anyhow, I agree that a meeting on this with broad participation could be very useful and would probably be a good time investment

@aturner-epcc
Copy link

@zhubonan From a service provider perspective, I do not think we would be able to endorse this as an appropriate approach for connecting to ARCHER2 from AiiDA. I think you are generally correct that the risk is low but the precedent of coding your own workarounds to security setup is definitely not something we can support.

At the moment, for this type of use case, we generally recommend the use of SSH multiplexing which, I think, gets close to what you are trying to achieve. You setup the SSH connection using your credentials and then all SSH traffic is routed through the already established connection. For us, the advantage over your approach is that you are using a standard feature of SSH rather than trying to code your own workaround.

In the longer term, we are looking at better ways to support workflow managers given the rise in popularity (given that MFA seems definitely here to stay for HPC access) so would definitely be interested in participating in a workshop to look at this. We have a few ideas of how to go about this and it would be really valuable to get input from AiiDA users and developers.

@ltalirz
Copy link
Member Author

ltalirz commented Nov 23, 2021

Hi @aturner-epcc , thanks a lot for your input!

SSH multiplexing is the first option mentioned in the original post in this thread - unfortunately, the python library AiiDA uses for handling the SSH connections does not support ControlMaster / ControlPath. The issue is open since 2016 without any work on it... which suggests to me that either we add it to paramiko ourselves or we would need to look for alternative libraries that support it.

  • ssh2-python is designed to map 100% of the libssh2 C API to Python, so in principle it should support it. Might be worth giving a look, but here people mention "There are some issues installing this package that complicates their usage."

  • parallel-ssh builds on ssh2-python (and is supposed to focus on performance, which might be useful for us). It might provide more convenience but would likely suffer from the same installation issues as ssh2-python, if there are any.

  • fabric builds on top of paramiko, so also does not support multiplexing

I think it is worth for someone of the AiiDA team (or outside!) to once have a look at ssh2-python/parallel-ssh, see whether multiplexing is supported, and think whether swapping out paramiko could be a way forward.

@aturner-epcc From your experience, can there be any performance issues with multiplexing compared to opening multiple connections?
And do you have any time limits on the server side currently for how long SSH connections can remain open?

If we organize a meeting/workshop around this question, we'll make sure to invite you.

@zhubonan
Copy link
Contributor

zhubonan commented Nov 23, 2021

@ltalirz @ConradJohnston @aturner-epcc thanks for getting in touch and the dicussing this. @aturner-epcc I agree that having to "workaround" the security setup is not good.

Multiplexing is probably the way to go moving forward. One potential issue would be that AiiDA daemon will not able to "reconnect" to the cluster unattended. The master connection can be watch with autossh perhaps. But having to input the password still requires human interaction, by definition.

ssh2-python is designed to map 100% of the libssh2 C API to Python, so in principle it should support it. Might be worth giving a look, but here people mention "There are some issues installing this package that complicates their usage."

I just had a test of this, pip install ssh2-python actually work just fine, which installs the binary wheel package. I think the problem is mainly when installing from source distribution. On my WSL based Ubuntu machine I need to install both cmake and libssl-dev for this to work (pip install --no-binary ssh2-python ssh2-python).

I am not sure how much work it needs to be done for switch from paramiko to ssh2-python in aiida-coreWith the plugin system one can put together a ssh2 transport plugin without replacing the existing ssh transport that uses paramiko. Although the commandline interface for setting up the machines may also needs to be updated.

@zhubonan
Copy link
Contributor

I did some further research - it seems that the "ControlMaster" style multiplexing is a feature of OpenSSH, and I did not find many other library supports it, e.g. using an existing socket from OpenSSH's ControlMaster. I have tried parallel-ssh (uses ssh2-python), and can confirm that it does not support the ControlMaster multiplexing.

There is this PR for paramiko that has been there for a while: paramiko/paramiko#1341, which seems to add support to it, but it is not merged.

@ltalirz
Copy link
Member Author

ltalirz commented Nov 23, 2021

Thanks a lot for checking @zhubonan !

It looks to me like what is holding up the rebased version of the paramiko PR is just an issue in the CI setup.
Since later PRs have passing CI perhaps another rebase will move things forward?
I'll add a comment in the PR, but don't get your hopes up too high - paramiko has 217 open pull requests...

@zhubonan
Copy link
Contributor

Thanks for kick starting that PR again. Hopefully it can be merged soon, I will do some experiments in the mean time and see if it works.

@giovannipizzi
Copy link
Member

Thanks for the useful discussion here everybody! I agree that the solution of @zhubonan while practical should be considered a workaround and not be used in production, or at least not suggested as an edorsed solution - it should be very clear in the readme that we actually suggest not to use it in practice. I agree with @ConradJohnston that when it comes to these things, people will think that AiiDA is doing it in an insecure way. I'm happy to have a discussion around how to move forward. At the time I had looked, indeed it seems that by design it's hard to support the ControlMaster feature outside of the SSH executable. But let's see what happens with the paramiko PR and we can discuss if this would work! Luckily AiiDA will pause the processes if there are connection issues (to be tested in case the connection issues come from the multiplexing not working anymore because the underlying connection was closed) so we'll just have to live with someone re-opening the connection if it goes down, and re-playing the processes. Also, @aturner-epcc, have you already checked/discussed with the CSCS people with the solution they want to provide with their supercomputers in Switzerland? E.g. https://products.cscs.ch/firecrest/ ? We're going to discuss whether we can add support for FirecREST in AiiDA at the next coding week, in 2 weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants