Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backport: prune bitmaps fix #360

Merged

Conversation

aesteve-rh
Copy link
Member

@aesteve-rh aesteve-rh commented Dec 13, 2022

Commits/PRs cherry picked for backporting (in order):

Signed-off-by: Albert Esteve aesteve@redhat.com

@aesteve-rh aesteve-rh added bug Issue is a bug or fix for a bug storage labels Dec 13, 2022
@aesteve-rh aesteve-rh added this to the ovirt-4.5.3 milestone Dec 13, 2022
@aesteve-rh aesteve-rh self-assigned this Dec 13, 2022
@ahadas
Copy link
Member

ahadas commented Dec 13, 2022

@aesteve-rh I see that the first part that is taken from #346 is related to a different bug, https://bugzilla.redhat.com/show_bug.cgi?id=2125290, and seems unrelated to the other changes - shall we clone bz 2125290 to make it visible to QE? where did the issue that it addressed came from (QE/users-list/out test)?

@aesteve-rh
Copy link
Member Author

@aesteve-rh I see that the first part that is taken from #346 is related to a different bug, https://bugzilla.redhat.com/show_bug.cgi?id=2125290, and seems unrelated to the other changes - shall we clone bz 2125290 to make it visible to QE? where did the issue that it addressed came from (QE/users-list/out test)?

Appeared in OST run logs, but the error was ignored, and the OST jobs succeeded anyway, so it went unnoticed until logs were inspected.

This should have low overall impact, as having a host with no lvm devices is rare. Nonetheless, as we are going to do another build, I thought it may be nice to include.

@ahadas
Copy link
Member

ahadas commented Dec 13, 2022

Appeared in OST run logs, but the error was ignored, and the OST jobs succeeded anyway, so it went unnoticed until logs were inspected.

ack, we had a similar case on ovirt-engine and we backported the fix without having a d/s bug
I'm not sure if that's the 'right' way to handle it but it looks like we need a d/s bug to raise awareness for these fixes so I filed https://bugzilla.redhat.com/show_bug.cgi?id=2152845

This should have low overall impact, as having a host with no lvm devices is rare. Nonetheless, as we are going to do another build, I thought it may be nice to include.

right, definitely makes sense to include it, the question is how :)
can you please separate that commit out to a different PR that will be attached to https://bugzilla.redhat.com/show_bug.cgi?id=2152845?

@aesteve-rh aesteve-rh force-pushed the aesteve/backport-prune-bitmaps-fix branch from a1b686e to 9c9f881 Compare December 13, 2022 09:30
@mz-pdm
Copy link
Member

mz-pdm commented Dec 13, 2022

OK, please add this PR to the bug.

@aesteve-rh
Copy link
Member Author

OK, please add this PR to the bug.

It is already tracked in this bug: https://bugzilla.redhat.com/show_bug.cgi?id=2141371

@mz-pdm
Copy link
Member

mz-pdm commented Dec 13, 2022

I see, OK.

Add flag warn_stderr to Command class to
log to WARNING level instead of DEBUG
level if the command execution has data
in the stderr output, even if the command
finished succesfully.

Also, the logged message changes if the
new warn_stderr flag is set, printing
the command that originated the error
output.

Legacy debug logging remains unchanged.

ProgressCommand in qemuimg will enable
the flag so that qemuimg commands show
warning on errors.

Example:
09:49:34,851 WARNING (MainThread) [storage.operation] Command
   ['/usr/bin/qemu-img', 'commit', '-p', '-t', 'none', '-b'...]
   succeeded with warnings: bytearray(b"qcow2_free_clusters failed: No space left on
   device\nqemu-img: Lost persistent bitmaps during inactivation of node \'#block365\':
   Failed to write bitmap \'stale-bitmap5\' to file: No space left on
   device\nqemu-img: Failed to flush the refcount block cache: No space
   left on device\n") (operation:159)']

Signed-off-by: Albert Esteve <aesteve@redhat.com>
Add merge test in block volumes, to have
a realistic scenario with volumes and bitmaps.

When bitmaps are added to the base volume
after the top volume is create, measuring
the top volume will not consider the base
bitmaps size, even with backing enabled.

This will result in not performing an
extend and thus, failing the merge process
with `Failed to write bitmap 'xxx' to file:
No space left on device`.

This test has been used to reproduce the error
in https://bugzilla.redhat.com/2141371

Signed-off-by: Albert Esteve <aesteve@redhat.com>
Add prune_bitmaps function, to remove all stale
bitmaps from a base volume (that are missing from
the top volume).

Signed-off-by: Albert Esteve <aesteve@redhat.com>
Prune the stale base volume bitmaps during
the prepare step on a merge operation.

These stale bitmaps can cause the merge
operation to fail due to 'No space left on device'.
In this case, qemu does not end with error, so
the failure goes unnoticed.

As there is not a reliable way to measure
the size of these stale bitmaps, and they are
invalid and can never be used for incremental
backup, it is better to prune them to avoid
the error.

Related: oVirt#352
Signed-off-by: Albert Esteve <aesteve@redhat.com>
Fix ResourceAlreadyAcquired raise to use f-string
instead of %-format. Also, the %-format is misued,
with a comma in between where the '%' symbol should
be, causing a bogus output.

Before:
vdsm.storage.resourceManager.ResourceAlreadyAcquired: ('%s is already
acquired by %s', '00_storage.a2ad7a23-96c5-4c23-9007-c0d1892ea07d',
'fb7c6873-74b4-4895-876a-a52ab7bfa082')

After:
vdsm.storage.resourceManager.ResourceAlreadyAcquired:
'00_storage.a2ad7a23-96c5-4c23-9007-c0d1892ea07d is already
acquired by fb7c6873-74b4-4895-876a-a52ab7bfa082'

Signed-off-by: Albert Esteve <aesteve@redhat.com>
Add prune_bitmaps method to internal hsm api.
Also add locking to prune_bitmaps method:
- shared lock on the storage domain
- exclusive lock on the image

Since this should be called only from
virt code, and the vm owns the volumes,
no other storage code should access
them (engine should ensure this), but
it is safer to take locks so bad engine
request will fail at Vdsm instead of
corrupting the volume.

By explicitely taking the locks with
a guarded context, we do not need to
call the _produce_volume helper method,
as it already takes a shared lock on the
storage domain.
Instead, produce the volumes directly.

Signed-off-by: Albert Esteve <aesteve@redhat.com>
Prune stale bitmaps before a live merge to avoid
failing with ENOSPC.

Fixes: oVirt#352
Signed-off-by: Albert Esteve <aesteve@redhat.com>
Record and verify prune_bitmaps calls in the tests.

Signed-off-by: Albert Esteve <aesteve@redhat.com>
Add support to fake errors in prune_bitmaps calls
to the IRS fake class.

Signed-off-by: Albert Esteve <aesteve@redhat.com>
@aesteve-rh aesteve-rh force-pushed the aesteve/backport-prune-bitmaps-fix branch from 9c9f881 to 3879121 Compare December 13, 2022 12:29
@aesteve-rh aesteve-rh merged commit f1fc337 into oVirt:ovirt-4.5.3.z Dec 13, 2022
@aesteve-rh aesteve-rh deleted the aesteve/backport-prune-bitmaps-fix branch December 13, 2022 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is a bug or fix for a bug storage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants