Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pull fails on unexpected EOF #31

Closed
wants to merge 111 commits into from
Closed

Fix pull fails on unexpected EOF #31

wants to merge 111 commits into from

Conversation

anmaxvl
Copy link
Collaborator

@anmaxvl anmaxvl commented Dec 27, 2021

Currently, containerd doesn't restart pull when it encounters unexpected EOF of
blob strem withtout error codes.
There are cases where this lead to pull failure.
This commit tries to fix this issue.

Signed-off-by: Kohei Tokunaga ktokunaga.mail@gmail.com
(cherry picked from commit 7bc5aa7)

kzys and others added 30 commits August 24, 2020 13:14
The rollback mechanism is implemented by calling deleteDevice() and
RemoveDevice(). But RemoveDevice() is internally calling
deleteDevice() as well.

Since a device will be deleted by first deleteDevice(),
RemoveDevice() always will see ENODATA. The specific error must be
ignored to remove the device's metadata correctly.

Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
[release/1.4 backport] snapshots/devmapper: fix rollback
Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 73b1449)
Signed-off-by: Wei Fu <fuweid89@gmail.com>
[release/1.4] runtime: ignore ErrNotExist when remove rootfs
…SYSLOG

This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.

Relates to moby/moby#37897 "docker exposes dmesg to containers by default"

See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 267a0cf)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 7e7545e)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Enabled adjtimex in the default profile without requiring CAP_SYS_TIME privilege.
The kernel will check CAP_SYS_TIME and won't allow setting the time.

Fixes: Getting the system time with ntptime returns an error in an unprivileged
container

To verify, inside a CentOS 7 container:

    yum install -y ntp
    ntptime
    # ntp_gettime() returns code 0 (OK)

    ntpdate -v time.nist.gov
    # ntpdate[84]: Can't adjust the time of day: Operation not permitted

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 1746a19)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit fc9e5d1)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
From personality(2):

    Have uname(2) report a 2.6.40+ version number rather than a 3.x version
    number.  Added as a stopgap measure to support broken applications that
    could not handle the  kernel  version-numbering  switch  from 2.6.x to 3.x.

This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".

Fixes: "setarch broken in docker packages from Debian stretch"

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 117d678)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
On a ppc64le host, running postgres (tried with 9.4 to 9.6) gives the following
warning when trying to flush data to disks (which happens very frequently):

     WARNING: could not flush dirty data: Operation not permitted.

A quick dig in postgres source code indicate it uses sync_file_range(2) to
flush data; which on ppe64le and arm64 is translated to sync_file_range2(2)
for alignements reasons.

The profile did not allow sync_file_range2(2), making postgres sad because
it can not flush its buffers. arm_sync_file_range(2) is an ancient alias to
sync_file_range2(2), the syscall was renamed in Linux 2.6.22 when the same
syscall was added for PowerPC.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 5862285)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This allows the quotactl syscall in the default seccomp profile, gated by
CAP_SYS_ADMIN.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 5cdb6e8)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 0a5ee7e)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Adds the io-uring related system call introduced in kernel 5.1 to the
seccomp whitelist. With older kernels or older versions of libseccomp,
this configure will be omitted.

Note that io_uring will grow support for more syscalls in the future
so we should keep an eye on this.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 325bac7)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
…p_updates

[release/1.4] backport seccomp profile updates
When containerd is restarted, only v1 tasks are monitored again. This
leads to the lack of existing v2 task metrics.

Signed-off-by: Li Yuxuan <liyuxuan04@baidu.com>
(cherry picked from commit 4422ae3)
Signed-off-by: Wei Fu <fuweid89@gmail.com>
[release/1.4] tasks: Monitor v2 tasks in initFunc as well
Previously the signal loop can end up racing with the process exiting.
Intead of logging and continuing the loop, exit early.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 6650510)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Starting with go1.14, the go runtime hijacks SIGURG but with no way to
not send to other signal handlers.

In practice, we get this signal frequently.
I found this while testing out go1.15 with ctr and multiple execs with
only `echo hello`. When the process exits quickly, if the previous
commit is not applied, you end up with an error message that it couldn't
forward SIGURG to the container (due to the process being gone).

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 899b4e3)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
…d_signal_not_found

[release/1.4 backport] Fix some signal forwarder issues
This `Info` log shows up for all exec processes that use the v1 shim
with Docker because Docker deletes the process once it receives the exit
event from containerd.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 5f9d15e)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
related to https://patchwork.kernel.org/patch/11167585/

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
(cherry picked from commit e28e55f)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
related to https://patchwork.kernel.org/patch/11545287/

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
(cherry picked from commit 6a915a1)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
bump cni dependencies so we can benefits from its
bugfixes and improvements

Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
(cherry picked from commit e3d27f9)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
* `-linkmode external` is required since Go 1.15 for static builds: golang/go#40711
* Clarify the meaning of "loading plugins"

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 43cbdf8)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This prevents packages with no Go included files due to build constraints
being included in the package list. These packages cause the test command
to fail with "can't load package build constraints exclude all Go files".

Signed-off-by: Derek McGowan <derek@mcg.dev>
(cherry picked from commit 3275a21)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Windows container stats were reporting incorrect metrics for cpu kernel runtime.

Signed-off-by: Prashant Bhutani <prbhutan@microsoft.com>
(cherry picked from commit 35b63c0)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
…xec_p_debug

[release/1.4 backport] shimv1: downgrade poroccess missing log to debug
…enat2_syscall

[release/1.4] seccomp: add `openat2` and `faccessat2` syscall
…atic_plugin

[release/1.4 backport] BUILDING.md: fix description about static builds
…le_test_tags

[release/1.4 backport] Update go list to respect build tags
dcantah and others added 16 commits December 1, 2020 13:45
Update fork/release/1.4 to upstream v1.4.3
Signed-off-by: Kathryn Baldauf <kabaldau@microsoft.com>
(cherry picked from commit 7d926d28178fc8b7158dd4363080aff9975d495e)
Signed-off-by: Kathryn Baldauf <kabaldau@microsoft.com>
Signed-off-by: Kathryn Baldauf <kabaldau@microsoft.com>
(cherry picked from commit 4c4f3d1811657e41c39fe27396503ceea45a73b6)
Signed-off-by: Kathryn Baldauf <kabaldau@microsoft.com>
[fork/release/1.4] Backport signals parsing for LCOW
This changes the stack dumping code so that we split the giant string
containing every goroutine's stack and log each stack as a separate log
message. This makes the output easier to parse, and especially helps in
cases where the log output is sent through a system with a log message
size limit, such as Windows's ETW.

Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
(cherry picked from commit dbb117b)
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
[fork/release/1.4] Dump each goroutine stack as a separate log message
…n the final snapshot

For LCOW currently we copy (or create) the scratch.vhdx for every single snapshot
so there ends up being a sandbox.vhdx in every directory seemingly unnecessarily. With the default scratch
size of 20GB the size on disk is about 17MB so there's a 17MB overhead per layer plus the time to copy the
file with every snapshot. Only the final sandbox.vhdx is actually used so this would be a nice little
optimization.

For WCOW we essentially do the exact same except copy the blank vhdx from the base layer.

Signed-off-by: Daniel Canter <dcanter@microsoft.com>
(cherry picked from commit a91c298)
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Currently we would create a new disk and mount this into the LCOW UVM for every container but there
are certain scenarios where we'd rather just mount a single disk and then have every container share this one
storage space instead of every container having it's own xGB of space to play around with.

This is accomplished by just making a symlink to the disk that we'd like to share and then
using ref counting later on down the stack in hcsshim if we see that we've already mounted this
disk.

Signed-off-by: Daniel Canter <dcanter@microsoft.com>
(cherry picked from commit 3e5acb9)
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
…inID

Signed-off-by: Kathryn Baldauf <kabaldau@microsoft.com>
(cherry picked from commit 14df541)
[fork/release/1.4] Backport ctr image pull support to print image chain ID
We already have code in hcsshim that logs the contents of the panic.log file during shim
delete command. However, that doesn't handle all of the cases. This change should fix
that.

Signed-off-by: Amit Barve <ambarve@microsoft.com>
Log shim panic logs before shim cleanup
During snapshotter cleanup lcow/wcow snapshotters try to rename the snapshot directory
before cleaning it up. This rename operation sometimes fails if the sandbox.vhdx file
inside that directory is still open in some process (ideally it should not be open but can
stay open in case of a crash or exception) or if the directory is busy for some other
reason. However, if this rename operation fails we return that error and then the entire
snapshot garbage collection operation fails. Due to this we end up not cleaning other
snapshots which otherwise could have been cleaned up without any errors. If this goes on
for a long time we will end up filling the entire disk with stale snapshots. This change
fixes that issue by continuing the snapshot cleanup if the rename operation fails because
of such open handles.

Signed-off-by: Amit Barve <ambarve@microsoft.com>
Currently on windows we preserve only 2 containerd panic.log
files at a time, this means that in case of multiple containerd
restarts we may loses old panic.log information. This PR attempts
to preserve that info, by logging it to debug via logrus.

Logging individual goroutines can get too noisy, instead log the
file content in 16KB chunks. There seems to be a limit on the
message size and the maximum power of 2 that worked was 16KB.

This also enables for quicker containerd restart RCAs, since
the time of restart and the dumps become available to the event/log
collectors.

Signed-off-by: Maksim An <maksiman@microsoft.com>
@anmaxvl
Copy link
Collaborator Author

anmaxvl commented Dec 27, 2021

what's our approach in backporting fixes from upstream? do we backport from upstream into our fork/master and then cherry-pick from fork/master into fork/release-* or cherry-pick from upstream directly into fork/release-*?

@helsaawy
Copy link
Collaborator

I am unsure the backport approach, but lgtm regardless

@dcantah
Copy link
Collaborator

dcantah commented Dec 28, 2021

I was also a little curious if we'd ever backported a change from main into fork/release/1.4 but seems like we did here #14. But Maksim I think the approach you laid out rings a bell

@dcantah
Copy link
Collaborator

dcantah commented Dec 28, 2021

This is a shoddy example as the change made isn't actually in upstream but we first got this into master and then brought it back to 1.4

master: #10
1.4: #11

Truth be told the happy path is if this change got into the upstream 1.4 branch and then we can just merge in all of the changes to 1.4.x at once instead of cherry-picking, but doesn't seem like that change will be getting backported. There is an example of us pulling in a change that was only added to main and not backported to 1.4 here though #5

@anmaxvl
Copy link
Collaborator Author

anmaxvl commented Dec 28, 2021

Thanks, @dcantah . I agree that ideally the fix would make it into upstream 1.4 branch and we just merge it into our fork, but oh well. We probably should merge upstream 1.4 into our fork at some point. I'll draft this for now, until we get people back from vacation and can discuss this.

@anmaxvl anmaxvl marked this pull request as draft December 28, 2021 19:29
Currently, containerd doesn't restart pull when it encounters unexpected EOF of
blob strem withtout error codes.
There are cases where this lead to pull failure.
This commit tries to fix this issue.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
(cherry picked from commit 7bc5aa7)
@anmaxvl anmaxvl closed this Jan 3, 2022
@anmaxvl anmaxvl deleted the cherry-pick-fix-fail-precondition branch January 3, 2022 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.