Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial Images Feature is slower than not using it #1928

Open
antheas opened this issue May 22, 2024 · 21 comments
Open

Partial Images Feature is slower than not using it #1928

antheas opened this issue May 22, 2024 · 21 comments
Labels
area/zstd:chunked Issues relating to zstd:chunked kind/bug

Comments

@antheas
Copy link

antheas commented May 22, 2024

Issue Description

Hi, since we're hearing a lot about the zstd:chunked encoding, I ran some tests today, and across the board it appears that the partial images feature is 2-3x slower, regardless of the cached amount (although the tests do not show partially cached layers; it was tested).

When the partial images feature is on, under a registry and client that have more than ~100mbps of bandwidth, image pulling with the partial mages feature is 2-3x slower. Its closer to 2x slower when compared to zstd:chunked with the feature turned off and 3x slower compared to gzip'd layers.

Steps to reproduce the issue

I ran the following with a 1gbps connection (as confirmed by speedtest.net ~900mbps) from github with a ping of 19ms, saved to a gen 4 nvme ssd, backed with a 7840U class AMD CPU (8c/16t/16GB RAM).

# Without partial images
# enable_partial_images="false"
podman system prune -a
time podman run -it --rm ghcr.io/castrojo/bluefin:40-20240522 ls
# 54.45s user 36.50s system 193% cpu 47.015 total
time podman run -it --rm ghcr.io/castrojo/bluefin-dx:40-20240522 ls
# 32.62s user 18.76s system 69% cpu 1:14.38 total

# Downloading the large image on its own (!!!)
podman system prune -a
podman run -it --rm ghcr.io/castrojo/bluefin-dx:40-20240522 ls 
# 74.91s user 46.67s system 159% cpu 1:16.46 total

# With partial pulls
# enable_partial_images="true"
podman system prune -a
time podman run -it --rm ghcr.io/castrojo/bluefin:40-20240522 ls 
# 34.64s user 30.79s system 59% cpu 1:50.81 total
time podman run -it --rm ghcr.io/castrojo/bluefin-dx:40-20240522 ls 
# 19.51s user 17.06s system 22% cpu 2:40.20 total

For context, the image bluefin-dx extends bluefin with 2 layers, each 1GB (total 2GB).

REPOSITORY                   TAG          IMAGE ID      CREATED      SIZE
ghcr.io/castrojo/bluefin-dx  40-20240522  2ed1c6dd4200  4 hours ago  10.6 GB
ghcr.io/castrojo/bluefin     40-20240522  f7d19966c0e1  4 hours ago  7.09 GB

Describe the results you received

Across dozens of tests, downloading zstd:chunked images is 2-3x slower regardless of the cached amount. This is when using the same image, compressed the same way, just by turning off the partial images feature.

Describe the results you expected

zstd:chunked images pulled with partial images set to true should have 90% to 100% of the speed of not using the feature. If there is cache, they should be 2-3x faster (depending on cached files).

podman info output

host:
  arch: amd64
  buildahVersion: 1.35.4
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: /usr/bin/conmon is owned by conmon 1:2.1.12-1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: e8896631295ccb0bfdda4284f1751be19b483264'
  cpuUtilization:
    idlePercent: 93.47
    systemPercent: 1.75
    userPercent: 4.78
  cpus: 16
  databaseBackend: sqlite
  distribution:
    distribution: ...
    version: unknown
  eventLogger: journald
  freeLocks: 2048
  hostname: antheas-go
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.6.30-...
  linkmode: dynamic
  logDriver: journald
  memFree: 1024684032
  memTotal: 15910768640
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: Unknown
    package: /usr/lib/podman/netavark is owned by netavark 1.10.3-1
    path: /usr/lib/podman/netavark
    version: netavark 1.10.3
  ociRuntime:
    name: crun
    package: /usr/bin/crun is owned by crun 1.15-1
    path: /usr/bin/crun
    version: |-
      crun version 1.15
      commit: e6eacaf4034e84185fd8780ac9262bbf57082278
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: /usr/bin/pasta is owned by passt 2024_05_10.7288448-1
    version: |
      pasta 2024_05_10.7288448
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: /usr/bin/slirp4netns is owned by slirp4netns 1.3.1-1
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.8.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.5
  swapFree: 11512954880
  swapTotal: 18253606912
  uptime: 116h 41m 48.00s (Approximately 4.83 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
    ...
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/dev/.local/share/containers/storage
  graphRootAllocated: 383876333568
  graphRootUsed: 272384294912
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 2
    ...
version:
  APIVersion: 5.0.3
  Built: 1715595915
  BuiltTime: Mon May 13 12:25:15 2024
  GitCommit: d08315df35cb6e95f65bf3935f529295c6e54742-dirty
  GoVersion: go1.22.3
  Os: linux
  OsArch: linux/amd64
  Version: 5.0.3

Podman in a container

No

Privileged Or Rootless

None

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

It seems that the partial images feature is not able to saturate a fast internet connection. This is unlike the stock behavior of podman, which can. Here are some network screens of downloading the same image without a cache (not the one above; same image both cases; cache was reset between screenshots).

Stock podman (finishes within 40s and fits within graph)
image

Partial Images Feature (took 108s so is truncated)
image

@cgwalters
Copy link
Contributor

Thanks for trying this out! It's a bit unfortunate no one else had tried to measure this yet, and we lack CI coverage for a lot of this.

Just glancing at the code offhand I'm pretty sure there's a lot more serialization happening in the main goroutine; e.g. it looks like the entire tar stream is copied to an O_TMPFILE and then checksummed? And there's a lot more processing going on than just "untar" - so I think we may be losing out on the natural parallelization from having a thread/goroutine reading from the network, another handling decompression, and another executing writes.

This one may be better filed against containers/storage where things can be more focused; we can transfer the issue if e.g. @giuseppe agrees.

@antheas
Copy link
Author

antheas commented May 22, 2024

Indeed the CPU is not doing much. It is doing more when the feature is off.

Also worth considering the cost of additional requests, e.g., if I have chunks A, B, C and I want A, C, if B is smaller than say 1MB on a fast connection it might be worth bridging the range and grabbing B too. I'm not sure of the perf characteristics of multirange requests.

@rhatdan rhatdan transferred this issue from containers/podman May 22, 2024
@rhatdan
Copy link
Member

rhatdan commented May 22, 2024

@giuseppe PTAL

@giuseppe
Copy link
Member

Just glancing at the code offhand I'm pretty sure there's a lot more serialization happening in the main goroutine; e.g. it looks like the entire tar stream is copied to an O_TMPFILE and then checksummed?

that code path is taken only when we are converting a "non partial" image to zstd:chunked. This feature is disabled by default, because it is really slow, but we have added it to allow composefs also with images that are not in the zstd:chunked format. In theory the conversion could be optimized and each file checksummed on the fly, but it would be a completely different implementation, so the easiest way to enable composefs with traditional images was just to fetch the entire tarball, convert it to zstd:chunked and then work on the converted file so all the existing code works with it. The only benefit is that we have fs-verity at runtime for these images, but the conversion cost is quite high at the moment.

@giuseppe
Copy link
Member

Also worth considering the cost of additional requests, e.g., if I have chunks A, B, C and I want A, C, if B is smaller than say 1MB on a fast connection it might be worth bridging the range and grabbing B too. I'm not sure of the perf characteristics of multirange requests.

merging of adjacent chunks is already implemented here:

func mergeMissingChunks(missingParts []missingPart, target int) []missingPart {

@antheas
Copy link
Author

antheas commented May 23, 2024

Indeed. Perhaps then it is not aggressive enough. If I understand the code correctly, you keep merging chunks while you have more than maxNumberMissingChunks (1024) or the gap between them is less than autoMergePartsThreshold (128 bytes (? kbytes)).

maxNumberMissingChunks = 1024
autoMergePartsThreshold = 128 // if the gap between two ranges is below this threshold, automatically merge them.

Provided either holds you will keep merging.

Both of these seem very lenient. Maybe it is worth defining a new parameter called transferOverheadMb which is user defined and describes the MB cost of marshaling a new connection/range. This would depend on device/connection but a sane default for most servers might be 4MB. For a gigabit connection, maybe it is around 10MB.

maxNumberMissingChunks = min(1024, image_size / transferOverheadMb)
autoMergePartsThreshold = transferOverheadMb

Therefore if a layer is say 50MB it will at most become 5 chunks, and more likely 1-2. So it would be downloaded immediately. Perhaps you do this already though.

@antheas
Copy link
Author

antheas commented May 23, 2024

Also, the progress slider of partial pulls has an update rate of .5hz, whereas non partial pulls are around 5hz. It makes a substantial difference in the DX/UX of partial pulls but should be filed under a new issue.

@giuseppe
Copy link
Member

Indeed. Perhaps then it is not aggressive enough. If I understand the code correctly, you keep merging chunks while you have more than maxNumberMissingChunks (1024) or the gap between them is less than autoMergePartsThreshold (128 bytes (? kbytes)).

maxNumberMissingChunks = 1024
autoMergePartsThreshold = 128 // if the gap between two ranges is below this threshold, automatically merge them.

Provided either holds you will keep merging.

Both of these seem very lenient. Maybe it is worth defining a new parameter called transferOverheadMb which is user defined and describes the MB cost of marshaling a new connection/range. This would depend on device/connection but a sane default for most servers might be 4MB. For a gigabit connection, maybe it is around 10MB.

maxNumberMissingChunks = min(1024, image_size / transferOverheadMb)
autoMergePartsThreshold = transferOverheadMb

Therefore if a layer is say 50MB it will at most become 5 chunks, and more likely 1-2. So it would be downloaded immediately. Perhaps you do this already though.

we keep merging until the server accepts the request, since when we request too many chunks the header can be too big.

I liked the idea @owtaylor had and that is not implemented yet, that instead of trying to keep merging chunks, we simply do more HTTP range requests.

@rhatdan
Copy link
Member

rhatdan commented May 23, 2024

@antheas Great conversation, lets pick out the best goldilocks default. Most users will have no idea to twiddle the settings. If the tool can pick better defaults based on connection, that would be better.

@antheas
Copy link
Author

antheas commented May 23, 2024

we keep merging until the server accepts the request, since when we request too many chunks the header can be too big.

Seems so. Ran some tests under Wireshark and the github servers do not seem very happy. 1M packets with a lot of errors vs 300k clean during an image pull.

Also, the progress slider of partial pulls has an update rate of .5hz, whereas non partial pulls are around 5hz. It makes a substantial difference in the DX/UX of partial pulls but should be filed under a new issue.

To answer my own question, the progress slider freezing/unfreezing coincides with Wireshark logging failed requests.

I liked the idea @owtaylor had and that is not implemented yet, that instead of trying to keep merging chunks, we simply do more HTTP range requests.

I do not see how this would improve things. Splitting a large layer (e.g 1-2GB) into multiple streams may be a way to speed up downloads as an end-game optimization. But as podman already pulls layers in parallel I do not know if it would help in most cases.

Also, you could just download the layers ordered by size, to minimize large last layers being stuck downloading by themselves like they are right now.

As for prodding the server to find the max layer size, I think following the spec of partial requests might be better and more respectful to registries. Note that AWS charges for failed requests as well.

Under it, you are supposed to send a range request and depending on the response:

  • if its 200: you will be returned and consume the whole layer
  • if its 206: you will be returned and consume the range

https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests#partial_request_responses

Therefore, a solid way to implement the pulls would be formulate a range plan that is sane (maybe at most 100 chunks; example: #1928 (comment) ), send it to the registry without checking Accept-Ranges: bytes since you already know the size of the manifest, and if you get 200 consume the whole layer.

Note that I do not know the details of multirange requests. If there is no overhead per chunk, it might make sense to use a large number of chunks. But it seems there is.

If there is a small overhead, 5MB might be too large of a gap and maybe 1MB would be better. I doubt 128 bytes is good though. Under wireshark packet sizes for partial pulls were on average 6kb vs 30kb for without them.

If you have another registry with a ~5GB chunked image (e.g., quay) I can also give it a go.

@giuseppe
Copy link
Member

I do not see how this would improve things. Splitting a large layer (e.g 1-2GB) into multiple streams may be a way to speed up downloads as an end-game optimization. But as podman already pulls layers in parallel I do not know if it would help in most cases.

depends how you look at it. @owtaylor was looking at reducing the network traffic so splitting the chunks in multiple requests would help to not have to merge chunks too aggressively. So I think we need to balance between the two needs, should this be configurable?

Under it, you are supposed to send a range request and depending on the response:

  • if its 200: you will be returned and consume the whole layer
  • if its 206: you will be returned and consume the range

https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests#partial_request_responses

We already handle the 200 response code if the registry returns that

@antheas
Copy link
Author

antheas commented May 23, 2024

We already handle the 200 response code if the registry returns that

So the registry returns a different error if your chunk ranges are too large I would guess then.

depends how you look at it. @owtaylor was looking at reducing the network traffic so splitting the chunks in multiple requests would help to not have to merge chunks too aggressively. So I think we need to balance between the two needs, should this be configurable?

Indeed. There are 3 conflicting goals:

  1. Bandwidth
  2. Request number/registry CPU util
  3. Download times

I am not fronting the Quay hosting budget, so I focused on number 3 here.

I also focused on n3 because we were waiting on zstd:chunked to speed up bootc, rpm-ostree downloads. Which is why I also looked into the rpm-ostree tree file builds. Perhaps there is an alternate partial images pull algorithm, however referencing here containers/bootc#215 bootc will be using this one.

It is understandable and a good justification to have the download take a bit longer if it means the registry has bandwidth savings. However, I think 2 and 3 need a bit of attention here.

should this be configurable?

Yeah it should be configurable but how will take a bit of experimentation. Perhaps theres a turnkey parameter that works great though so it wont be needed.

giuseppe added a commit to giuseppe/storage that referenced this issue Jun 3, 2024
Increase the threshold for auto-merging parts from 128 to 1024. This change
aims to reduce the number of parts in an HTTP multi-range request, thus
increasing the likelihood that the server will accept the request.

The previous threshold of 128 often resulted in a large number of small
ranges, which could lead to HTTP multi-range requests being rejected by
servers due to the excessive number of parts.

It partially addresses the reported issue.

Reported-by: containers#1928

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe
Copy link
Member

giuseppe commented Jun 3, 2024

I've seen some improvements with: #1937

@antheas
Copy link
Author

antheas commented Jun 3, 2024

Something changed in 5.1.0 and I can not reproduce this issue anymore when I do not have cached layers. It downloads properly now.

I can still reproduce the issue in 5.0.3. I will run some tests when there is a cache with the new ver + PR.

podman system prune -a
time podman run -it --rm ghcr.io/castrojo/bluefin@sha256:a79bfaa1554ef9d89c5ee46baef4b0464c6be74417be0c7fe726e800cd99f1dd ls
# Podman 5.0.3 29.57s user 28.86s system 69% cpu 1:24.07 total
# Podman 5.1.0 59.70s user 38.65s system 149% cpu 1:05.94 total
#  Podman Master + Pull request59.89s user 38.83s system 148% cpu 1:06.45 total

@antheas
Copy link
Author

antheas commented Jun 3, 2024

Since its been two weeks, this repo has a large number of images with little differences we can test with.
https://github.com/castrojo/bluefin/pkgs/container/bluefin-dx

In the following tests, I will download two images created one week apart so that around half of the layers are identical and half of the layers are not, here are the commands:

podman system prune -a
time podman run -it --rm ghcr.io/castrojo/bluefin-dx:40-20240526 ls
time podman run -it --rm ghcr.io/castrojo/bluefin-dx:40-20240602 ls
# 5.0.3
# 40.39s user 35.47s system 71% cpu 1:46.11 total
# 38.24s user 49.03s system 89% cpu 1:37.13 total
# 5.1.0
# 82.06s user 50.08s system 142% cpu 1:32.46 total
# 74.50s user 53.24s system 147% cpu 1:26.76 total
# 5.1.0 no partial images
# 74.26s user 42.45s system 196% cpu 59.306 total
# 68.12s user 48.18s system 176% cpu 1:05.83 total
# Pull Request with master
# 81.26s user 51.13s system 145% cpu 1:31.25 total
# 73.56s user 53.39s system 151% cpu 1:23.98 total

If we bump the numbers we get the following:

	maxNumberMissingChunks  = 128
	autoMergePartsThreshold = 1024*1024
#  81.18s user 50.09s system 147% cpu 1:28.86 total
# 72.23s user 53.70s system 153% cpu 1:21.96 total

It seems that the partial images feature broke in 5.1.0. I always have 0 bytes saved now (skipped: 0.0b = 0.00%). So unless that is fixed, we can not test further.

@giuseppe
Copy link
Member

giuseppe commented Jun 3, 2024

It seems that the partial images feature broke in 5.1.0. I always have 0 bytes saved now (skipped: 0.0b = 0.00%). So unless that is fixed, we can not test further.

I think you might need #1936

@antheas
Copy link
Author

antheas commented Jun 3, 2024

Did the zstd:chunked spec change? If so we would need to start generating images with a new podman to test with.

Edit: Indeed that PR fixed it

@antheas
Copy link
Author

antheas commented Jun 3, 2024

There is significant bandwidth savings here indeed. Upwards of 60% in the last layers of the second image in both below.

# Original
39.36s user 33.82s system 71% cpu 1:42.66 total
34.24s user 47.66s system 69% cpu 1:58.16 tota
#	maxNumberMissingChunks  = 128
#	autoMergePartsThreshold = 1024*1024
39.31s user 33.79s system 70% cpu 1:44.29 total
38.05s user 42.86s system 85% cpu 1:34.42 total
#	maxNumberMissingChunks  = 64
#	autoMergePartsThreshold = 10*1024*1024
40.70s user 31.59s system 71% cpu 1:40.73 total
35.25s user 49.30s system 71% cpu 1:57.87 total

Original
image

Last Example (10MB gap)
image

However, i am getting a lot of the following during downloading the first image in all examples. Maybe skip partial pull for a layer if diff is less than 2%?
image

And those layers are causing slowness in the overall download process without bandwidth savings.

@giuseppe
Copy link
Member

giuseppe commented Jun 6, 2024

We can't compare the two code paths. The new code does more than just trying to reduce the amount of data requested from the registry.

When using partial pulls, we need to use the new code paths even for images that do not share any files with local storage; otherwise, we won't be able to reuse their files for deduplication later.

The old code path (the layers pulled without the skipped: info) was just pulling a tarball and extracting it locally. That is. There is nothing more involved and don't know what files were pulled or even attempt to deduplicate them.

Differently, the new code path used for partial images not only tries to reduce the amount of data requested from the server but also calculates/validates the checksum for each file in the archive and attempts to deduplicate with already existing files in the storage (using reflinks). This process is also necessary to know which files are in the local storage so that future partial pulls can reuse these files.

Said so, it might be good to investigate avoiding a multirange request when the client is about to request most of its data anyway, so we can avoid the multipart reader and see what impact it has on the performance.

@cgwalters
Copy link
Contributor

The old code path (the layers pulled without the skipped: info) was just pulling a tarball and extracting it locally. That is. There is nothing more involved and don't know what files were pulled or even attempt to deduplicate them.

But nothing stops us from doing this without zstd:chunked either. It's basically what the ostree-container storage is doing, just parsing classic tarballs and computing the ostree sha256 on them. (But that's the ostree-sha256, not the content-sha256, which is again different from the fsverity-sha256 that I'd argue we should be centralizing on, with content-sha256 as an alternative index perhaps)

@giuseppe
Copy link
Member

giuseppe commented Jun 6, 2024

But nothing stops us from doing this without zstd:chunked either. It's basically what the ostree-container storage is doing, just parsing classic tarballs and computing the ostree sha256 on them.

that is what the convert to zstd:chunked code path does at the moment. That can be optimized of course, without requiring a conversion. It would be much more complicated than what we could do in a few lines of code now.

@cgwalters cgwalters added the area/zstd:chunked Issues relating to zstd:chunked label Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/zstd:chunked Issues relating to zstd:chunked kind/bug
Projects
None yet
Development

No branches or pull requests

4 participants