-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect when container process fails to start #550
Detect when container process fails to start #550
Conversation
Hi @mcshooter. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @jeremyje |
d87fd60
to
02854b3
Compare
"reason": "CorruptContainerImageLayer", | ||
"pattern": ".*failed to pull and unpack image.*failed to extract layer.*" | ||
}, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering there are various specific errors that contain this pattern like shown in the description, it would be good to have a more generic pattern to catch what we might miss in the specific type of errors and have the error messages sent to the exporter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the examples I think wrong diff id calculated on extraction
is the best one. archive/tar: invalid tar header: unknown
is also a good candidate. These imply that the image was downloaded but a checksum failed or bad data was pulled.
Anything with unauthorized shouldn't raise a signal. That's generally user error (permissions) not an issue due to the node becoming damaged.
The system cannot find the file specified.
implies that the container image itself is broken but I'm curious on how you got this to repro.
400 Bad Request - Server message: unknown: 'digest' parameter is not a valid digest
almost feels like a node level problem but I'm thinking this can happen if a user creates an image and then deletes it and pushes a Deployment using the deleted image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @jeremyje offline, and will remove this for now to avoid false positives where this error may also result from a user error. We do not necessarily want NPD to report this as an Unhealthy Node if that is the case, but only want to report what will need node remediation.
}, | ||
{ | ||
"type": "temporary", | ||
"reason": "ContainerDContainerCreationFailed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be called out as containerd.
"reason": "CorruptContainerImageLayer", | ||
"pattern": ".*failed to pull and unpack image.*failed to extract layer.*" | ||
}, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the examples I think wrong diff id calculated on extraction
is the best one. archive/tar: invalid tar header: unknown
is also a good candidate. These imply that the image was downloaded but a checksum failed or bad data was pulled.
Anything with unauthorized shouldn't raise a signal. That's generally user error (permissions) not an issue due to the node becoming damaged.
The system cannot find the file specified.
implies that the container image itself is broken but I'm curious on how you got this to repro.
400 Bad Request - Server message: unknown: 'digest' parameter is not a valid digest
almost feels like a node level problem but I'm thinking this can happen if a user creates an image and then deletes it and pushes a Deployment using the deleted image.
02854b3
to
cc02ac6
Compare
"reason": "IncompatibleContainer", | ||
"pattern": ".*CreateComputeSystem.*" | ||
"reason": "ContainerProcessStartFailed", | ||
"pattern": ".*failed to start.*hcsshim::System::CreateProcess.*The system cannot find the file specified.*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this match if the customer tries to run a container with an entry point of a binary that doesn't exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same as attempting to deploy a container that does not exist? If so, the error that I see is this:
time="2021-04-28T18:06:11.499594200Z" level=error msg="PullImage \"gcr.io/michelletamdua\" failed" error="failed to pull and unpack image \"gcr.io/michelletamdua:latest\": failed to resolve reference \"gcr.io/michelletamdua:latest\": unexpected status code [manifests latest]: 400 Bad Request"
Here I attempted to deploy an non existent image called gcr.io/michelletamdua to the windows node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FROM mcr.microsoft.com/windows/nanoserver:1809
ENTRYPOINT ["/does-not-exist-lol.exe"]
Deploy an image like that what do you see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this error
time="2021-04-29T19:01:41.047663300Z" level=error msg="StartContainer for \"50e3cda4fd5d6100ab71e8b55f49a31b20afc50030e2af61986dc85868c165ee\" failed" error="failed to start containerd task \"50e3cda4fd5d6100ab71e8b55f49a31b20afc50030e2af61986dc85868c165ee\": hcsshim::System::CreateProcess 50e3cda4fd5d6100ab71e8b55f49a31b20afc50030e2af61986dc85868c165ee: The system cannot find the file specified.\n(extra info: {\"CommandLine\":\"/does-not-exist-lol.exe\",\"User\":\"ContainerUser\",\"WorkingDirectory\":\"C:\\\\\",\"Environment\":{\"KUBERNETES_PORT\":\"tcp://10.0.0.1:443\",\"KUBERNETES_PORT_443_TCP\":\"tcp://10.0.0.1:443\",\"KUBERNETES_PORT_443_TCP_ADDR\":\"10.0.0.1\",\"KUBERNETES_PORT_443_TCP_PORT\":\"443\",\"KUBERNETES_PORT_443_TCP_PROTO\":\"tcp\",\"KUBERNETES_SERVICE_HOST\":\"10.0.0.1\",\"KUBERNETES_SERVICE_PORT\":\"443\",\"KUBERNETES_SERVICE_PORT_HTTPS\":\"443\"},\"CreateStdOutPipe\":true,\"CreateStdErrPipe\":true}): unknown" time="2021-04-29T19:01:41.347059100Z" level=info msg="RemoveContainer for \"4bef6ef5458daf929e13dac1fc6b5ee571097d0c71d34378488bbc5e0d2e7c93\"" time="2021-04-29T19:01:41.370855400Z" level=info msg="RemoveContainer for \"4bef6ef5458daf929e13dac1fc6b5ee571097d0c71d34378488bbc5e0d2e7c93\" returns successfully"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so looks like it produces the same error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user can accidentally trigger this error with a bad image then I'd remove this condition.
"reason": "IncompatibleContainer", | ||
"pattern": ".*CreateComputeSystem.*" | ||
"reason": "ContainerProcessStartFailed", | ||
"pattern": ".*failed to start.*hcsshim::System::CreateProcess.*The system cannot find the file specified.*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user can accidentally trigger this error with a bad image then I'd remove this condition.
cc02ac6
to
da15eb9
Compare
/ok-to-test |
@mcshooter: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
@@ -13,13 +13,13 @@ | |||
"rules": [ | |||
{ | |||
"type": "temporary", | |||
"reason": "MissingPigz", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the old problem any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's being removed.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jeremyje, mcshooter, Random-Liu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This change adds another rule to detect from the containerd log file. It will detect when a containerD fails to start a process due to a corrupt image.
Sample logs:
"reason": "ContainerDContainerCreationFailed":
time="2021-04-28T00:01:18.901493400Z" level=error msg="CreateContainer within sandbox \"ce6f3a2bc2a13b86be77417271d4f5b72210418eb2277fd9bc27622cfbe34012\" for &ContainerMetadata{Name:testimage,Attempt:0,} failed" error="failed to create containerd container: error unpacking image: wrong diff id calculated on extraction \"sha256:9bddaf2b9de37115dd4657ce3d726e3d794158da99e107fbeb15df4c1ecc648b\""
"reason": "CorruptContainerImageLayer":
time="2021-04-28T00:14:50.181394800Z" level=error msg="PullImage \"gcr.io/jeremyje/chatterbox:head\" failed" error="failed to pull and unpack image \"gcr.io/jeremyje/chatterbox:head\": failed to extract layer sha256:cecc319109e39102773e8154e23ac2c672402c9e4c7e1a801a62208ddf9407c4: archive/tar: invalid tar header: unknown"
The following errors are not included to be detected either because they are errors that we do not want to detect to falsely identify that the node is unhealthy.
time="2021-04-28T00:14:15.311477800Z" level=error msg="PullImage \"gcr.io/jeremyje/chatterbox:head\" failed" error="failed to pull and unpack image \"gcr.io/jeremyje/chatterbox:head\": failed to copy: httpReaderSeeker: failed open: unexpected status code https://gcr.io/v2/jeremyje/chatterbox/blobs/sha256:5675a081083222834c96d41259098a1ae95d2822f5dd72f3e7d66cc4ef8d0f7ec: 400 Bad Request - Server message: unknown: 'digest' parameter is not a valid digest 'sha256:5675a081083222834c96d41259098a1ae95d2822f5dd72f3e7d66cc4ef8d0f7ec'."
time="2021-04-28T18:06:11.499594200Z" level=error msg="PullImage \"gcr.io/michelletamdua\" failed" error="failed to pull and unpack image \"gcr.io/michelletamdua:latest\": failed to resolve reference \"gcr.io/michelletamdua:latest\": unexpected status code [manifests latest]: 400 Bad Request"
time="2021-04-28T18:05:40.152050900Z" level=error msg="PullImage \"gcr.io/michelletandya-gke-dev/gowiki:v1\" failed" error="failed to pull and unpack image \"gcr.io/michelletandya-gke-dev/gowiki:v1\": failed to resolve reference \"gcr.io/michelletandya-gke-dev/gowiki:v1\": unexpected status code [manifests v1]: 401 Unauthorized"
time="2021-04-22T21:39:40.972814800Z" level=error msg="StartContainer for \"8559f453e383cfc3dfa960c4550bc1fd4a3809b20cbfd1a121f9d71c666cc5ce\" failed" error="failed to start containerd task \"8559f453e383cfc3dfa960c4550bc1fd4a3809b20cbfd1a121f9d71c666cc5ce\": hcsshim::System::CreateProcess 8559f453e383cfc3dfa960c4550bc1fd4a3809b20cbfd1a121f9d71c666cc5ce: The system cannot find the file specified.\n(extra info: {\"CommandLine\":\"/app/chatterbox.exe -addr :8080\",\"User\":\"ContainerUser\",\"WorkingDirectory\":\"C:\\\\\",\"Environment\":{\"KUBERNETES_PORT\":\"tcp://10.0.0.1:443\",\"KUBERNETES_PORT_443_TCP\":\"tcp://10.0.0.1:443\",\"KUBERNETES_PORT_443_TCP_ADDR\":\"10.0.0.1\",\"KUBERNETES_PORT_443_TCP_PORT\":\"443\",\"KUBERNETES_PORT_443_TCP_PROTO\":\"tcp\",\"KUBERNETES_SERVICE_HOST\":\"10.0.0.1\",\"KUBERNETES_SERVICE_PORT\":\"443\",\"KUBERNETES_SERVICE_PORT_HTTPS\":\"443\",\"PATH\":\"c:\\\\Windows\\\\System32;c:\\\\Windows\"},\"CreateStdOutPipe\":true,\"CreateStdErrPipe\":true}): unknown"