Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect when container process fails to start #550

Merged

Conversation

mcshooter
Copy link
Contributor

@mcshooter mcshooter commented Apr 28, 2021

This change adds another rule to detect from the containerd log file. It will detect when a containerD fails to start a process due to a corrupt image.

Sample logs:
"reason": "ContainerDContainerCreationFailed":
time="2021-04-28T00:01:18.901493400Z" level=error msg="CreateContainer within sandbox \"ce6f3a2bc2a13b86be77417271d4f5b72210418eb2277fd9bc27622cfbe34012\" for &ContainerMetadata{Name:testimage,Attempt:0,} failed" error="failed to create containerd container: error unpacking image: wrong diff id calculated on extraction \"sha256:9bddaf2b9de37115dd4657ce3d726e3d794158da99e107fbeb15df4c1ecc648b\""

"reason": "CorruptContainerImageLayer":
time="2021-04-28T00:14:50.181394800Z" level=error msg="PullImage \"gcr.io/jeremyje/chatterbox:head\" failed" error="failed to pull and unpack image \"gcr.io/jeremyje/chatterbox:head\": failed to extract layer sha256:cecc319109e39102773e8154e23ac2c672402c9e4c7e1a801a62208ddf9407c4: archive/tar: invalid tar header: unknown"

The following errors are not included to be detected either because they are errors that we do not want to detect to falsely identify that the node is unhealthy.
time="2021-04-28T00:14:15.311477800Z" level=error msg="PullImage \"gcr.io/jeremyje/chatterbox:head\" failed" error="failed to pull and unpack image \"gcr.io/jeremyje/chatterbox:head\": failed to copy: httpReaderSeeker: failed open: unexpected status code https://gcr.io/v2/jeremyje/chatterbox/blobs/sha256:5675a081083222834c96d41259098a1ae95d2822f5dd72f3e7d66cc4ef8d0f7ec: 400 Bad Request - Server message: unknown: 'digest' parameter is not a valid digest 'sha256:5675a081083222834c96d41259098a1ae95d2822f5dd72f3e7d66cc4ef8d0f7ec'."

time="2021-04-28T18:06:11.499594200Z" level=error msg="PullImage \"gcr.io/michelletamdua\" failed" error="failed to pull and unpack image \"gcr.io/michelletamdua:latest\": failed to resolve reference \"gcr.io/michelletamdua:latest\": unexpected status code [manifests latest]: 400 Bad Request"

time="2021-04-28T18:05:40.152050900Z" level=error msg="PullImage \"gcr.io/michelletandya-gke-dev/gowiki:v1\" failed" error="failed to pull and unpack image \"gcr.io/michelletandya-gke-dev/gowiki:v1\": failed to resolve reference \"gcr.io/michelletandya-gke-dev/gowiki:v1\": unexpected status code [manifests v1]: 401 Unauthorized"
time="2021-04-22T21:39:40.972814800Z" level=error msg="StartContainer for \"8559f453e383cfc3dfa960c4550bc1fd4a3809b20cbfd1a121f9d71c666cc5ce\" failed" error="failed to start containerd task \"8559f453e383cfc3dfa960c4550bc1fd4a3809b20cbfd1a121f9d71c666cc5ce\": hcsshim::System::CreateProcess 8559f453e383cfc3dfa960c4550bc1fd4a3809b20cbfd1a121f9d71c666cc5ce: The system cannot find the file specified.\n(extra info: {\"CommandLine\":\"/app/chatterbox.exe -addr :8080\",\"User\":\"ContainerUser\",\"WorkingDirectory\":\"C:\\\\\",\"Environment\":{\"KUBERNETES_PORT\":\"tcp://10.0.0.1:443\",\"KUBERNETES_PORT_443_TCP\":\"tcp://10.0.0.1:443\",\"KUBERNETES_PORT_443_TCP_ADDR\":\"10.0.0.1\",\"KUBERNETES_PORT_443_TCP_PORT\":\"443\",\"KUBERNETES_PORT_443_TCP_PROTO\":\"tcp\",\"KUBERNETES_SERVICE_HOST\":\"10.0.0.1\",\"KUBERNETES_SERVICE_PORT\":\"443\",\"KUBERNETES_SERVICE_PORT_HTTPS\":\"443\",\"PATH\":\"c:\\\\Windows\\\\System32;c:\\\\Windows\"},\"CreateStdOutPipe\":true,\"CreateStdErrPipe\":true}): unknown"

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @mcshooter. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 28, 2021
@mcshooter
Copy link
Contributor Author

/cc @jeremyje

@k8s-ci-robot k8s-ci-robot requested a review from jeremyje April 28, 2021 01:06
config/windows-containerd-monitor-filelog.json Outdated Show resolved Hide resolved
config/windows-containerd-monitor-filelog.json Outdated Show resolved Hide resolved
@mcshooter mcshooter force-pushed the addContainerDProblemsToDetect branch from d87fd60 to 02854b3 Compare April 28, 2021 19:03
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 28, 2021
"reason": "CorruptContainerImageLayer",
"pattern": ".*failed to pull and unpack image.*failed to extract layer.*"
},
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering there are various specific errors that contain this pattern like shown in the description, it would be good to have a more generic pattern to catch what we might miss in the specific type of errors and have the error messages sent to the exporter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the examples I think wrong diff id calculated on extraction is the best one. archive/tar: invalid tar header: unknown is also a good candidate. These imply that the image was downloaded but a checksum failed or bad data was pulled.

Anything with unauthorized shouldn't raise a signal. That's generally user error (permissions) not an issue due to the node becoming damaged.

The system cannot find the file specified. implies that the container image itself is broken but I'm curious on how you got this to repro.

400 Bad Request - Server message: unknown: 'digest' parameter is not a valid digest almost feels like a node level problem but I'm thinking this can happen if a user creates an image and then deletes it and pushes a Deployment using the deleted image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @jeremyje offline, and will remove this for now to avoid false positives where this error may also result from a user error. We do not necessarily want NPD to report this as an Unhealthy Node if that is the case, but only want to report what will need node remediation.

},
{
"type": "temporary",
"reason": "ContainerDContainerCreationFailed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be called out as containerd.

"reason": "CorruptContainerImageLayer",
"pattern": ".*failed to pull and unpack image.*failed to extract layer.*"
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the examples I think wrong diff id calculated on extraction is the best one. archive/tar: invalid tar header: unknown is also a good candidate. These imply that the image was downloaded but a checksum failed or bad data was pulled.

Anything with unauthorized shouldn't raise a signal. That's generally user error (permissions) not an issue due to the node becoming damaged.

The system cannot find the file specified. implies that the container image itself is broken but I'm curious on how you got this to repro.

400 Bad Request - Server message: unknown: 'digest' parameter is not a valid digest almost feels like a node level problem but I'm thinking this can happen if a user creates an image and then deletes it and pushes a Deployment using the deleted image.

@mcshooter mcshooter force-pushed the addContainerDProblemsToDetect branch from 02854b3 to cc02ac6 Compare April 29, 2021 01:32
"reason": "IncompatibleContainer",
"pattern": ".*CreateComputeSystem.*"
"reason": "ContainerProcessStartFailed",
"pattern": ".*failed to start.*hcsshim::System::CreateProcess.*The system cannot find the file specified.*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this match if the customer tries to run a container with an entry point of a binary that doesn't exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as attempting to deploy a container that does not exist? If so, the error that I see is this:
time="2021-04-28T18:06:11.499594200Z" level=error msg="PullImage \"gcr.io/michelletamdua\" failed" error="failed to pull and unpack image \"gcr.io/michelletamdua:latest\": failed to resolve reference \"gcr.io/michelletamdua:latest\": unexpected status code [manifests latest]: 400 Bad Request"

Here I attempted to deploy an non existent image called gcr.io/michelletamdua to the windows node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FROM mcr.microsoft.com/windows/nanoserver:1809
ENTRYPOINT ["/does-not-exist-lol.exe"]

Deploy an image like that what do you see?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this error
time="2021-04-29T19:01:41.047663300Z" level=error msg="StartContainer for \"50e3cda4fd5d6100ab71e8b55f49a31b20afc50030e2af61986dc85868c165ee\" failed" error="failed to start containerd task \"50e3cda4fd5d6100ab71e8b55f49a31b20afc50030e2af61986dc85868c165ee\": hcsshim::System::CreateProcess 50e3cda4fd5d6100ab71e8b55f49a31b20afc50030e2af61986dc85868c165ee: The system cannot find the file specified.\n(extra info: {\"CommandLine\":\"/does-not-exist-lol.exe\",\"User\":\"ContainerUser\",\"WorkingDirectory\":\"C:\\\\\",\"Environment\":{\"KUBERNETES_PORT\":\"tcp://10.0.0.1:443\",\"KUBERNETES_PORT_443_TCP\":\"tcp://10.0.0.1:443\",\"KUBERNETES_PORT_443_TCP_ADDR\":\"10.0.0.1\",\"KUBERNETES_PORT_443_TCP_PORT\":\"443\",\"KUBERNETES_PORT_443_TCP_PROTO\":\"tcp\",\"KUBERNETES_SERVICE_HOST\":\"10.0.0.1\",\"KUBERNETES_SERVICE_PORT\":\"443\",\"KUBERNETES_SERVICE_PORT_HTTPS\":\"443\"},\"CreateStdOutPipe\":true,\"CreateStdErrPipe\":true}): unknown" time="2021-04-29T19:01:41.347059100Z" level=info msg="RemoveContainer for \"4bef6ef5458daf929e13dac1fc6b5ee571097d0c71d34378488bbc5e0d2e7c93\"" time="2021-04-29T19:01:41.370855400Z" level=info msg="RemoveContainer for \"4bef6ef5458daf929e13dac1fc6b5ee571097d0c71d34378488bbc5e0d2e7c93\" returns successfully"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so looks like it produces the same error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user can accidentally trigger this error with a bad image then I'd remove this condition.

"reason": "IncompatibleContainer",
"pattern": ".*CreateComputeSystem.*"
"reason": "ContainerProcessStartFailed",
"pattern": ".*failed to start.*hcsshim::System::CreateProcess.*The system cannot find the file specified.*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user can accidentally trigger this error with a bad image then I'd remove this condition.

@mcshooter mcshooter force-pushed the addContainerDProblemsToDetect branch from cc02ac6 to da15eb9 Compare April 29, 2021 23:47
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 29, 2021
@mcshooter
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot
Copy link
Contributor

@mcshooter: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jeremyje
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 30, 2021
@@ -13,13 +13,13 @@
"rules": [
{
"type": "temporary",
"reason": "MissingPigz",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the old problem any more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's being removed.

@Random-Liu
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jeremyje, mcshooter, Random-Liu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2021
@k8s-ci-robot k8s-ci-robot merged commit 5ef09ca into kubernetes:master May 3, 2021
@mcshooter mcshooter deleted the addContainerDProblemsToDetect branch May 3, 2021 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants