Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for submitting jobs to Kubernetes #181

Open
rptaylor opened this issue Nov 22, 2023 · 4 comments
Open

support for submitting jobs to Kubernetes #181

rptaylor opened this issue Nov 22, 2023 · 4 comments

Comments

@rptaylor
Copy link

Hello,

Do you envision that diracx might have support for submitting jobs to kubernetes clusters (as kubernetes-native batch/v1 jobs), similar to the kubernetes plugin of Harvester for Panda, along with submitting to traditional batch clusters?

Thanks!

@fstagni
Copy link
Contributor

fstagni commented Nov 23, 2023

Hello, we have no experience/knowledge about kubernetes-native batch. At at first look it seems to me that could be just another plugin to add to https://github.com/DIRACGrid/DIRAC/tree/integration/src/DIRAC/Resources/Computing (in DIRAC, or later in DiracX, does not seem different to me).

What would it be the use case?

@rptaylor
Copy link
Author

rptaylor commented Nov 23, 2023

Hi @fstagni ,

Thanks for the info. Kubernetes is quite popular and, aside from providing a wide array of capabilities that are not possible in traditional batch systems, is also gaining in feature parity for batch system scheduling functionality. There are a few ATLAS T2 sites that are native kubernetes batch clusters thanks to the k8s plugin of Harvester for Panda that was developed in ~2018 or so (for reference: CHEP2023 presentation CHEP2023 paper). I was wondering if experiments adopting DIRAC would also be able to support kubernetes sites. Particularly for new experiments, it can be more feasible and attractive to start developing a distributed computing framework using modern cloud native technologies.

It looks like the development effort involved would mainly involve writing a KubernetesComputingElement.py file? Just curious at this point. Authentication to the Kubernetes API can be done with X509 certificates (not proxies) or OIDC and tokens, presumably Dirac already has some support for that?

Thanks!

@fstagni
Copy link
Contributor

fstagni commented Nov 23, 2023

It looks like the development effort involved would mainly involve writing a KubernetesComputingElement.py file?

That would be the way to do. DIRAC supports through these plugins the traditional HTCondor and ARC CEs as well as "SSH" CEs and computing clouds (https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/Resources/Computing/CloudComputingElement.py which uses libcloud under the hood).
DiracX will very likely use this DIRAC code for reaching the same goal, so this could be implemented in DIRAC already.

Authentication to the Kubernetes API can be done with X509 certificates (not proxies) or OIDC and tokens, presumably Dirac already has some support for that?

That should not be pose an issue.

Normally, since we are a small and busy group, we do not embark in developments without a requirement (from a VO using DIRAC). Questions:

  • which other sites, apart from Victoria, you know use Kubernetes as batch?
  • which VOs do they support?
  • Do you know of other sites that would follow Victoria's example?

@rptaylor
Copy link
Author

Okay thanks. For know I was just gathering information to see how much work it would take, how much of a priority it might be, or if it would be straightforward for a potential contributor to work on, etc.

In ATLAS, the NET2 in the US is also k8s native, and the ATLAS Google Cloud project, and a site in Taiwan. Several other sites are also interested and experimenting; in total there are 7 Panda queues for kubernetes in ATLAS. I'm not sure what other VOs they might support, but if ATLAS is the only VO using a workflow management system (Panda + Harvester) that supports Kubernetes (as far as I know, could be wrong), that would limit the options for adoption by other VOs. As for new experiments, SKAO is looking into a kubernetes-based approach and has considered using DIRAC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants