"Backend was closed" error during backup image transfer #23

barpavel · 2022-01-17T23:28:45Z

When executing image download as part of the backup (happens both during a full backup and during a separate ./backup_vm.py ... download ... command) receiving the following error:

pbar@engine1 ~/dev/git/python-ovirt-engine-sdk4/examples (main) $ ./backup_vm.py -c engine1 download --backup-dir /tmp/pavel --backup-uuid 8b7a17db-dcee-45eb-b5e6-0dc22c4e5bc3 9ecd1ff4-e045-4c11-91f7-cc6b3551b527
[   0.0 ] Downloading VM '9ecd1ff4-e045-4c11-91f7-cc6b3551b527' disks
[   0.4 ] Downloading full backup for disk 'f626c162-7ab6-4acf-9c3c-69eebab8b990'
[   1.6 ] Image transfer '891b75ac-f8d7-4850-815f-30371546b44c' is ready
[ 100.00% ] 1.00 GiB, 0.17 seconds, 5.79 GiB/s                                 
[   1.8 ] Finalizing image transfer
[   4.8 ] Download completed successfully
Traceback (most recent call last):
  File "./backup_vm.py", line 521, in <module>
    main()
  File "./backup_vm.py", line 144, in main
    args.command(args)
  File "./backup_vm.py", line 218, in cmd_download
    connection, args.backup_uuid, args, incremental=args.incremental)
  File "./backup_vm.py", line 403, in download_backup
    incremental=has_incremental)
  File "./backup_vm.py", line 448, in download_disk
    **extra_args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/client/_api.py", line 186, in download
    name="download")
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 193, in __exit__
    self.stop()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 170, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 235, in _run
    handler = self._handler_factory()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 264, in __init__
    self._dst = dst_factory()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/nbd.py", line 77, in clone
    self._client.address,
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/nbd.py", line 228, in __getattr__
    raise IOError("Backend was closed")
OSError: Backend was closed

Scenario: cold backup, either:

Full backup (start + download + stop).
Separate steps - start, then download.

The flow actually works, even the output image is created:
-rw-r--r--. 1 pbar pbar 196624 Jan 17 10:53 f626c162-7ab6-4acf-9c3c-69eebab8b990.202201171053.full.qcow2
It works fine for VMs containing 1 disk, but in case there are several disks, only the 1st image is copied. The exception breaks the loop that should have continued for the remaining disks.
Logs are attached.
After some Google Chat at the RHV Storage room here are a few conclusions from Vojtech & Nir:

Vojtech Juranek
It looks like calling a client which was already closed, so not an issue in your env. and I don't think this is a known issue. Is it reproducible?
Nir Soffer
No need for daemon logs, this is an issue on the client side. For some reason we try to clone the nbd client at the end of the transfer after it was closed. Cloning the connection is done when starting a transfer, not sure why it happens at the end.
Nir Soffer
This is likey result of https://github.com/oVirt/ovirt-imageio/commit/4df9f1c9feb9405ba7c389f28b65180db057f2da
it reveals an issue in the ovirt_imageio._internal.io module, previously it was hidden
Vojtech Juranek
what is the size of downloaded image? Is it something very small?
Pavel Bar
yes, probably very small or even empty, actually I just added new disks (that support incremental backup), attached them to VM and run cold backup flows on the VM
Nir Soffer
based on the download rate (5.79 GiB/s) the disk is empty
Vojtech Juranek
it looks like a concurrency issue, that one worker finished the downloading before another worker was created
or right now don't see any other way, how dst handler can be created after download was finished and clint closed
Nir Soffer
right, this is a known issue
Nir Soffer
when the the download is very small (e.g. nothing if image is completely empty), the dst connection used for checking the source may be closed before we clone it.
The issue is known but was harmless before, please a file a bug

engine.log
engine-ovirt-imageio.log
example.log
vdsm.log
vdsm-ovirt-imageio.log

The text was updated successfully, but these errors were encountered:

When using a thread pool we try to reuse the connection backend by passing the connection backend to the first worker instead of cloning it. This introduce some issues: - The backend is closed twice: once by the worker, and again by the connection. This is safe since closing backend twice does nothing, but dirty. Will break if we try to use the backend after a copy. - With very small images, the first worker may close the original backend before another worker try to clone it. This is unlikely to happen with real images but can break the tests. Fix the issues by introducing a backend wrapper, overriding close() to disconnect from the wrapped backend instead of closing it. This ensures that dst is never closed before it is cloned, and that it is closed exactly once. Another approach to fix this is to close the initial backend used for preparing a copy, pass the backends open argument to io.copy(), and open new backends in the workers without all the hacks. This requires lot of work, so I'm staring with this minimal change. Fixes oVirt#23 Signed-off-by: Nir Soffer <nsoffer@redhat.com>

nirs · 2022-01-17T23:42:19Z

This is an old issue that was hidden by the nbd backend that
could be closed silently multiple times without any side affect.

In 4df9f1c we change the nbd
backend to behave like the file backend - you can close it multiple
times, but accessing the backend after closing rasie IOError.

When using a thread pool we try to reuse the connection backend by passing the connection backend to the first worker instead of cloning it. This introduce some issues: - The backend is closed twice: once by the worker, and again by the connection. This is safe since closing backend twice does nothing, but dirty. Will break if we try to use the backend after a copy. - With very small images, the first worker may close the original backend before another worker try to clone it. This is unlikely to happen with real images but can break the tests. Fix the issues by introducing a backend wrapper, overriding close() to disconnect from the wrapped backend instead of closing it. This ensures that dst is never closed before it is cloned, and that it is closed exactly once. Another approach to fix this is to close the initial backend used for preparing a copy, pass the backends open argument to io.copy(), and open new backends in the workers without all the hacks. This requires lot of work, so I'm staring with this minimal change. Fixes #23 Signed-off-by: Nir Soffer <nsoffer@redhat.com>

barpavel changed the title ~~"Backend was closed" during backup image trnsfer phase.~~ "Backend was closed" error during backup image transfer Jan 17, 2022

nirs mentioned this issue Jan 17, 2022

io: Fix backend closing in executor. #22

Merged

nirs self-assigned this Jan 17, 2022

nirs added the bug Issue is a bug or fix for a bug label Jan 17, 2022

vjuranek closed this as completed in #22 Jan 18, 2022

This was referenced Jan 18, 2022

Uniform backends close behavior #24

Merged

test: Add client stress test #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Backend was closed" error during backup image transfer #23

"Backend was closed" error during backup image transfer #23

barpavel commented Jan 17, 2022 •

edited

Loading

nirs commented Jan 17, 2022

"Backend was closed" error during backup image transfer #23

"Backend was closed" error during backup image transfer #23

Comments

barpavel commented Jan 17, 2022 • edited Loading

nirs commented Jan 17, 2022

barpavel commented Jan 17, 2022 •

edited

Loading