-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiencing issues with the PNS executor #1256
Comments
Update:
|
@Tomcli - is it possible to construct a smaller, portable workflow which can reproduce this? Also, there's some caveats to PNS that people need to be aware of, which is: collection of artifacts from the base image layer is subject to race conditions when the main container exits too quickly. Basically the main container needs to be running for a few seconds for the wait sidecar to reliably secure the filehandle on it's root filesystem. If the main container exits too quickly, then the wait sidecar may not have been able to secure the file handle to successfully collect artifacts.
Yes I don't expect privileged mode to help. However, an alternative workaround is to output the artifacts into an emptyDir volume, mounted in the main container. In v2.3, when a volumes are used, they are now mirrored to the wait sidecar, which eliminates the race with artifact collection, because the wait sidecar has access to the volume long after the main container completed. |
Actually I'm wrong. SYS_PTRACE is indeed needed when the user id of the main container is different than the wait sidecar.
I'm also experiencing this race condition. Trying to find a solution, but it does seem timing related. |
Hi @jessesuen, Thanks for the reply. Since adding |
Just to be clear, privileged is unnecessary, but SYS_PTRACE is. The latter is much more secure than having privileged pods. |
Thanks @jessesuen - we will give it a try. With respect to k8sapi executor - you have a viewpoint? Ideally that should be the solution to use with CRI-O? |
@animeshsingh there are pros and cons to each executor:
IMO, PNS is the closest thing to the docker executor, without the security concerns, and is what I recommend, except for the fact that it is most immature. |
Thanks @jessesuen for this comparison. Would the overhead of going through k8s apis bypass the demerits introduced through some randomness using PNS? Given that workflows are expected to be long running jobs, as opposed to a serverless model where bypassing k8s api has its merits vis a vis response time, would it matter too much? Also how important it is to store the artifacts in base image layer? |
My feeling is PNS is the best compromise between security and functionality.
The "randomness" of failing to collect artifacts is usually a non-issue unless containers are completing too quick. Even then, you can mitigate this by outputting the artifact to an emptyDir, and this would never be an issue.
Not necessary at all. It's just slightly more convenient not to have to define a emptyDir volume to collect artifacts. Closing bug since PNS has merged. |
this is causing a bunch of race conditions in our stuff, should we open a separate issue for this on PNS, or do you have any recommendations on how to deal with it properly? |
@booninite yes. To ensure that the wait sidecar is able to collect outputs, instead of outputting outputs into the base image layer (such as /tmp), output artifacts into an empty dir (which gets mirrored into the wait sidecar). This ensures that the wait sidecar can collect the artifact without subject to timing problems. |
@jessesuen we are still experiencing intermittent artifact passing issues using emptyDir. Does the emptyDir additionally need to be mounted to a path that does not exist in the base image? |
|
We are running ~5k workflows per month that all use PNS. We only see consistent issues with extremely short duration steps, under 15 seconds. |
Tying this some other folks raising these issues coming on Kubeflow community |
I see this same issue trying to pass a single file between my workflows, is the volume mount the solution? |
yeah, seeing this with pns as well. Not sure what to do here... |
Having the same issue. Running K3OS with CR-IO. So I can't use the docker executor. The other two, kubelet and k8api, simply won't work. Kubelet gives me a certificate error, which the helm chart doesn't give an option for ignoring and k8api gives me errors like "function not found"... |
@sarabala1979 is the workaround for this |
I was finally able to get it running using the k8sapi
Sadly this breaks the functionality of the built-in git solution, because apparently it can not write into a volume. I had to write my own git clone script. Also this kind of makes the artifact passing redundant, as I could just use this volume in every stage. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Test it and see? |
I'm experiencing a similar problem on some of my containers, using af03a74 with PNS. Other containers doing almost identical work succeed, and if I keep retrying the workflow everything succeeds eventually. Seems particular to PNS. Here's an example wait container log:
|
I think |
We do not see |
OK. Diagnosis - There is a timeout trying to determine if the pod has finished. We allow three attempts at 1-second intervals. The main container has completed (which we use the shared process namespace to determine), but we ask the Kubernetes API for the actual result. The API has not been updated yet. This could be mitigated by increasing the amount of time we allow the executor to poll for on line 375 of |
@alexec Sure, I can play with the timing and see if I come up with a good PR-worthy solution. Thanks for the detailed analysis. |
Thank you! |
Hi guys, I was experiencing a lot the same issue recently. Following the comment from @alexec above I've tried to install a previous argo version and everything works well as usual. The downgraded version I've installed is using |
It appears to me today that in some cases you must grant privileged for PNS to work with output artifacts.
|
Maybe fixed in #4954. |
v3.0 will have a controller envvar name |
Signed-off-by: meijin <meijin@tiduyun.com> Co-authored-by: Derek Wang <whynowy@gmail.com>
Hi @jessesuen, we are experimenting with the Argo PNS executor from PR #1214 and running it as the KubeFlow Pipeline backend. The Workflow runs smoothly for most of the containers, except we are experiencing some race condition with the last container in every Workflow. Below are the workflow definition we have and the corresponding error logs from the argoexec.
Failed wait container logs:
Workflow yaml file:
Related issues: #970
cc: @animeshsingh
Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT
What happened:
Filehandle not being secured before the main container started.
What you expected to happen:
Filehandle should be secured before the main container started.
How to reproduce it (as minimally and precisely as possible):
Run the workflow definition above with the PNS executor
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: