Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharing memory between containers #996

Closed
stac47 opened this issue Aug 16, 2021 · 11 comments
Closed

Sharing memory between containers #996

stac47 opened this issue Aug 16, 2021 · 11 comments

Comments

@stac47
Copy link

stac47 commented Aug 16, 2021

Hello,
I am trying to figure out whether it is possible to share the loaded shared library between containers. The rationale behind this is that if I build twice an image (without using the cache --no-cache) and provided I take care a creating reproducible layer FS diff, then the instantiated container will not use twice the amount of memory.
To make it more explicit and without talking about shared libraries, let's have a look at the following image:

% cat Dockerfile
FROM ubuntu
RUN apt update && \
    apt install --yes vmtouch
RUN touch /output.dat && \
    dd if=/dev/zero of=/output.dat  bs=1M  count=24 && \
    touch -d @0 /output.dat
CMD vmtouch -l /output.dat
% podman build --no-cache -t img1 .
% podman run --rm img1

Running the container from the described image, I can see the output.dat file is mapped once in memory. The Proportial Set Size matches the size of the file mapped in memory (see Pss column).

% ps -eF | grep vmtouch | grep -v grep  | grep -v -e'/bin/sh'
ubuntu     61371   61369  0  6778 25116   3 12:50 ?        00:00:00 vmtouch -l /output.dat
% pmap -X 61371 | grep -e "output.dat" -e "Pss"
61371:   vmtouch -l /output.dat
         Address Perm   Offset Device     Inode  Size   Rss   Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
    7f6b9a31b000 r--s 00000000  00:39    184907 24576 24576 24576      24576         0        0              0             0              0               0    0       0  24576           0 output.dat

Now running the same image a second time:

% podman run --rm img1
% pmap -X 61371 | grep -e "output.dat" -e "Pss"
61371:   vmtouch -l /output.dat
         Address Perm   Offset Device     Inode  Size   Rss   Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
    7f6b9a31b000 r--s 00000000  00:39    184907 24576 24576 12288      24576         0        0              0             0              0               0    0       0  12288           0 output.dat
% ps -eF | grep vmtouch | grep -v grep | grep -v -e'/bin/sh'
ubuntu     61371   61369  0  6778 25116   3 12:50 ?        00:00:00 vmtouch -l /output.dat
ubuntu     61665   61663  0  6778 25224   5 13:50 ?        00:00:00 vmtouch -l /output.dat
% pmap -X 61665 | grep -e "output.dat" -e "Pss"
61665:   vmtouch -l /output.dat
         Address Perm   Offset Device     Inode  Size   Rss   Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
    7f1d99fe5000 r--s 00000000  00:48    184907 24576 24576 12288      24576         0        0              0             0              0               0    0       0  12288           0 output.dat

The PSS is now divided by two meaning the file is mapped once and shared by the two containers. This is normal because I use the "overlay" storage driver. (it would not have worked with "vfs" backed by "ext4" for instance because of the lack of reflink support).

Now let's build another image without using the cache. The way output.dat is created will result in the same layer diff as we can see below:

% podman build --no-cache -t img2 .
% diff -u <(podman inspect --format="{{json .RootFS}}" img1 | jq .) <(podman inspect --format="{{json .RootFS}}" img2 | jq .)
--- /proc/self/fd/17    2021-08-16 13:47:40.588000000 +0000
+++ /proc/self/fd/20    2021-08-16 13:47:40.592000000 +0000
@@ -2,7 +2,7 @@
   "Type": "layers",
   "Layers": [
     "sha256:7555a8182c42c7737a384cfe03a3c7329f646a3bf389c4bcd75379fc85e6c144",
-    "sha256:31121a0cb3f5e2a4dd0d68d7d6b6de617d8d937b8b41e5ae5a13c5304c3dfe28",
+    "sha256:0a075e0d3129290f16273f6d6b7c56ae0b282cee8365d2aaa28b327fcc6825d0",
     "sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f"
   ]
 }

The last layer sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f is the same.
If I run a container for this latest image, I can see the PSS of output.dat, for the first container I ran, will not decrease because the file has not the same deviceid/inode.

% podman run --rm img2
% pmap -X 61371 | grep -e "output.dat" -e "Pss"
61371:   vmtouch -l /output.dat
         Address Perm   Offset Device     Inode  Size   Rss   Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
    7f6b9a31b000 r--s 00000000  00:39    184907 24576 24576 12288      24576         0        0              0             0              0               0    0       0  12288           0 output.dat
% ps -eF | grep vmtouch | grep -v grep  | grep -v -e'/bin/sh'
ubuntu     61371   61369  0  6778 25116   3 12:50 ?        00:00:00 vmtouch -l /output.dat
ubuntu     61665   61663  0  6778 25224   5 13:50 ?        00:00:00 vmtouch -l /output.dat
ubuntu     61695   61693  0  6778 25260   5 13:51 ?        00:00:00 vmtouch -l /output.dat
% pmap -X 61695 | grep -e "output.dat" -e "Pss"
61695:   vmtouch -l /output.dat
         Address Perm   Offset Device     Inode  Size   Rss   Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
    7ff800850000 r--s 00000000  00:56 134512622 24576 24576 24576      24576         0        0              0             0              0               0    0       0  24576           0 output.dat

This makes sense to me when the backing filesystem is something that does not support reflink like ext4. So I tried this on XFS with reflink activated but I have the same result. I would have expected that this could be improved because the mmap'ed file is in fact the same if one is a reflink of the other.
So my questions:

  • Do you think such an improvement is feasible ?
  • Is there a limitation from the kernel for instance ?
  • In case it is feasible, is it worth working on such an optimisation ?

System information:

% uname -a
Linux lstacul-vm 5.11.0-25-generic #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
% podman version
Version:      3.2.1
API Version:  3.2.1
Go Version:   go1.16.2
Built:        Thu Jan  1 00:00:00 1970
OS/Arch:      linux/amd64
% podman info
...
store:
  configFile: /home/ubuntu/.config/containers/storage.conf
  containerStore:
    number: 3
    paused: 0
    running: 3
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /mnt/my-xfs/podman-user-root
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 7
  runRoot: /mnt/my-xfs/podman-user-root
  volumePath: /mnt/my-xfs/podman-user-root/volumes
...
% cat ${HOME}/.config/containers/storage.conf
[storage]
driver = "overlay"
graphroot = "/mnt/my-xfs/podman-user-root"
runroot = "/mnt/my-xfs/podman-user-root"
% xfs_info /mnt/my-xfs
meta-data=/dev/vdb               isize=512    agcount=4, agsize=13107200 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0
data     =                       bsize=4096   blocks=52428800, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=25600, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
@stac47 stac47 changed the title Sharing memory between container Sharing memory between containers Aug 16, 2021
@rhatdan
Copy link
Member

rhatdan commented Aug 16, 2021

@nalind @giuseppe PTAL

@stac47
Copy link
Author

stac47 commented Aug 19, 2021

I am trying to continue on my side on this topic. Now I am quite sure the limitation does not come from the Kernel and the backing file system. In fact, the reflink copy (provided by some FS like XFS) as no chance to work in regards to mmap.

As a matter of fact, I turn my hopes to overlay FS which should be geared for this kind of optimization. So, I am digging into this by testing with podman et reading the code of container/storage.

Back to my use case:

% diff -u <(podman inspect --format="{{json .RootFS}}" test1 | jq .) <(podman inspect --format="{{json .RootFS}}" test2 | jq .)
--- /proc/self/fd/17    2021-08-19 08:27:08.080000000 +0000
+++ /proc/self/fd/20    2021-08-19 08:27:08.084000000 +0000
@@ -2,7 +2,7 @@
   "Type": "layers",
   "Layers": [
     "sha256:7555a8182c42c7737a384cfe03a3c7329f646a3bf389c4bcd75379fc85e6c144",
-    "sha256:bcd562f4e17b7218b73a5542f060ba2ccc5604890e718a186a88a34830eeb87e",
+    "sha256:9bef437b03aef0daa59dc7d898dc42b5d8e4497bf0d9725329d1d8c399663d0d",
     "sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f"
   ]
 }

The FS diff containing the output.dat file, "sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f" is shared between to layers. This is confirmed when looking in podman's graphroot.

% cat overlay-layers/layers.json | jq
[
...
  {
    "id": "14343f2d7eb16ac2dae74226742158006cc51debb2a488115b647f887ebea2c2",
    "parent": "4343d0e7cdfcc69f9651920565043dfb47074f767be5921dacd97eb429496834",
    "created": "2021-08-19T08:19:22.358250859Z",
    "compressed-diff-digest": "sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f",
    "compressed-size": 25167360,
    "diff-digest": "sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f",
    "diff-size": 25167360,
    "uidset": [
      0
    ],
    "gidset": [
      0
    ]
  },
...
  {
    "id": "7a5c93693a732ce8932f5ab24132cc565b96be854418274b82d803d7be515df6",
    "parent": "7fdd6648e254c56b9e895bb78ee76c1844eb3b69850dc55eb4fc8fbc40050019",
    "created": "2021-08-19T08:26:46.100199059Z",
    "compressed-diff-digest": "sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f",
    "compressed-size": 25167360,
    "diff-digest": "sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f",
    "diff-size": 25167360,
    "uidset": [
      0
    ],
    "gidset": [
      0
    ]
  },
...
]

The overlay folder contains the fs diff in the diff directory or each layer and a reference to the lower layers. The thing is that, though the two above layers are referencing the same FS diff, it was copied twice. This prevent the running containers to share really share file system diff (and in the case of mmap, to share the consumed memory).

Let's confirm this:

% stat overlay/14343f2d7eb16ac2dae74226742158006cc51debb2a488115b647f887ebea2c2/diff/output.dat | grep 'Inode'
Device: fc10h/64528d    Inode: 268828447   Links: 1
% stat overlay/7a5c93693a732ce8932f5ab24132cc565b96be854418274b82d803d7be515df6/diff/output.dat | grep 'Inode'
Device: fc10h/64528d    Inode: 402974335   Links: 1
% podman run --rm test1&
% podman run --rm test2&
% ps -eF | grep vmtouch | grep -v grep | grep -v -e '/bin/sh'
ubuntu     28359   28357  0  6778 25204   2 08:40 ?        00:00:00 vmtouch -l /output.dat
ubuntu     28424   28422  0  6778 25180   0 09:06 ?        00:00:00 vmtouch -l /output.dat
% pmap -X 28359 | grep -e 'Pss' -e 'output' | awk '{if(NR>1)printf("%16s %4s %6s %10s %10s %10s\n", $1, $2, $4, $5, $7, $8)}'
         Address Perm Device      Inode        Rss        Pss
    7fe1ad600000 r--s  00:39  268828447      24576      24576
% pmap -X 28424 | grep -e 'Pss' -e 'output' | awk '{if(NR>1)printf("%16s %4s %6s %10s %10s %10s\n", $1, $2, $4, $5, $7, $8)}'
         Address Perm Device      Inode        Rss        Pss
    7f07cadb3000 r--s  00:48  402974335      24576      24576

We can see the inodes are corresponding and it is what I expected from overlay. But here we have another problem, the device id are not the same which is probably due to the different mount points for the two containers (tell me if I am wrong).

Maybe we can think to a solution so that the layers share fs diff, by adding somehow an indirection at layer level.

What's your views on this ?
Regards,

@stac47
Copy link
Author

stac47 commented Aug 23, 2021

In order to progress, I was thinking to the following small change.

We could add a overlay-fsdiff as a slibling of overlay, overlay-layers, overlay-images, overlay-containers. This directory would contain the diffs that are currently in overlay/<layerid>/diff directory. Each diff would lay in its own directory whose name could be the sha256 of the diff.

Then, in each layer composing an image, the directory overlay/<layerid>/diff would become a symlink to overlay-fsdiff/<fs diff sha256>. The content of overlay/l and any of the files link, lower would require NO change.

In the case of the writable top layer of a container, the overlay/<container layerid>/diff would remain a directory containing the changes done during the lifetime of the container.

With this change, all the shareable filesystem diffs (basically, all the read-only layers composing an image) could be reused by completely different images which could result in reducing memory footprint of containers running on the same host.

Do you see some drawbacks to this kind of modification ?

@giuseppe
Copy link
Member

@stac47 I am doing some related work (although I've been focusing on sharing storage rather than memory) with #775.

Do you think that model could work for your use case?

Another feature I am working on is doing deduplication through hard-links, that should help with mmap: #995

@stac47
Copy link
Author

stac47 commented Aug 23, 2021

@giuseppe Thanks so much for your answer. Indeed, #995 looks promising. Locally in my podman graphroot directory, I turned one 'output.dat' into a hardlink to the 'output.dat' of the another layer and the memory is, as expected, shared between containers.

I saw that your PR is from a branch named "zstd-chunked-hard-links-dedup" but is it related to the compression algorithm of the layers ?

@stac47
Copy link
Author

stac47 commented Aug 23, 2021

@giuseppe Actually, the reason I proposed to have a separated directory containing all the file system diffs, is that in the Docker documentation (https://docs.docker.com/storage/storagedriver/overlayfs-driver/#how-the-overlay-driver-works) we can read:

The use of hardlinks causes an excessive use of inodes, which is a known limitation of the legacy overlay storage driver, and may require additional configuration of the backing filesystem.

But to be honest, I don't really understand the "excessive use of inodes" when using hardlinks. What I proposed could, at least, only reduce the number of used inodes.

@giuseppe
Copy link
Member

I saw that your PR is from a branch named "zstd-chunked-hard-links-dedup" but is it related to the compression algorithm of the layers ?

that feature is related to the zstd:chunked feature I was working on.

I don't think there is a "excessive use of inodes" problem when using hardlinks, there are other issues though so IMO using hardlinks must be a last resort.

In what timezone are you located? Do you think it could be helpful to have a call to discuss the issue you are having?

@stac47
Copy link
Author

stac47 commented Aug 24, 2021

Thanks for your answer.
I sent you a mail to your RedHat address.

@stac47
Copy link
Author

stac47 commented Aug 25, 2021

@giuseppe as discussed this morning, I manually did what I described in my proposal directly in the podman graphroot.
First, I determined where the file output.dat was stored:

% find overlay -name output.dat 2>/dev/null
overlay/bde06f8a7e49b244d5006bd503af3b360ff8176191712078b366998fdf7067ae/diff/output.dat
overlay/de3ea9ea62c53fe56fd46326986c091f0166b2b31efa916323039cf7775007f6/diff/output.dat

I moved one diff in a directory overlay-fsdiff and removed the other in the other layer:

% mkdir overlay-fsdiff
% sudo mv overlay/bde06f8a7e49b244d5006bd503af3b360ff8176191712078b366998fdf7067ae/diff overlay-fsdiff
% sudo rm -rf overlay/de3ea9ea62c53fe56fd46326986c091f0166b2b31efa916323039cf7775007f6/diff
% (cd overlay-fsdiff && % mv diff 62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f)

Now the two layers look like this:

% tree overlay/de3ea9ea62c53fe56fd46326986c091f0166b2b31efa916323039cf7775007f6
overlay/de3ea9ea62c53fe56fd46326986c091f0166b2b31efa916323039cf7775007f6
├── diff -> ../../overlay-fsdiff/62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f
├── link
├── lower
├── merged
└── work
% tree overlay/bde06f8a7e49b244d5006bd503af3b360ff8176191712078b366998fdf7067ae
overlay/bde06f8a7e49b244d5006bd503af3b360ff8176191712078b366998fdf7067ae
├── diff -> ../../overlay-fsdiff/62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f
├── link
├── lower
├── merged
└── work

I spawned two containers from the different images:

% podman image tree vmtouch1
Image ID: 90c0e711516f
Tags:     [localhost/vmtouch1:latest]
Size:     130.9MB
Image Layers
├── ID: 7555a8182c42 Size: 75.16MB Top Layer of: [dockerhub.rnd.amadeus.net:5005/ubuntu:20.04]
├── ID: 41a85ac98340 Size: 30.53MB
└── ID: bde06f8a7e49 Size: 25.17MB Top Layer of: [localhost/vmtouch1:latest]
% podman image tree vmtouch2
Image ID: 49ae6859a0ed
Tags:     [localhost/vmtouch2:latest]
Size:     130.9MB
Image Layers
├── ID: 7555a8182c42 Size: 75.16MB Top Layer of: [dockerhub.rnd.amadeus.net:5005/ubuntu:20.04]
├── ID: 3faee1e92b48 Size: 30.53MB
└── ID: de3ea9ea62c5 Size: 25.17MB Top Layer of: [localhost/vmtouch2:latest]
% podman run --rm -d vmtouch1
d789d1018211bdb213b7cc043d418c8da1d84dd62781ff1e562ea9946ee7679e
% podman run --rm -d vmtouch2
b4e6b2d314ef42686de4b2838cca763ebaf04c3a71c1c4268ebc3fd5afba5f4c

As we can see here, now the memory is shared:

% ps -eF | grep -e'output.dat' -e'PPID' | grep -v -e 'grep' -e'/bin/sh'
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
ubuntu    173563  173561  0  6778 25220   4 12:57 ?        00:00:00 vmtouch -l /output.dat
ubuntu    173616  173614  0  6778 25200   1 12:57 ?        00:00:00 vmtouch -l /output.dat
% sudo pmap -X 173563 | grep -e 'Pss' -e 'output.dat' | awk 'BEGIN{pattern="%16s %4s %10s %6s %10s %6s %6s %6s %s\n";}{if(NR>1)printf(pattern, $1, $2, $3, $4, $5, $6, $7, $8, $20);}'
         Address Perm     Offset Device      Inode   Size    Rss    Pss Mapping
    7feb9cf36000 r--s   00000000  00:39  140082791  24576  24576  12288 output.dat
% sudo pmap -X 173616 | grep -e 'Pss' -e 'output.dat' | awk 'BEGIN{pattern="%16s %4s %10s %6s %10s %6s %6s %6s %s\n";}{if(NR>1)printf(pattern, $1, $2, $3, $4, $5, $6, $7, $8, $20);}'
         Address Perm     Offset Device      Inode   Size    Rss    Pss Mapping
    7efe49c2f000 r--s   00000000  00:48  140082791  24576  24576  12288 output.dat
% stat overlay-fsdiff/62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f/output.dat | grep Inode | awk '{print($3,$4)}'
Inode: 140082791

So with no change in podman, the change in the storage layout fixes my case.

@giuseppe
Copy link
Member

sorry for the delay.

In your example bde06f8a7e and de3ea9ea6 will both point to the same data. How do you pick what layers can be deduplicated with this mechanism?

@giuseppe
Copy link
Member

I think this problem is fixed upstream with the hard link deduplication feature we have with zstd:chunked and estargz images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants