Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute error during timeout cancelling transfer #208

Closed
jsattelb opened this issue Dec 27, 2023 · 12 comments · Fixed by #209
Closed

Attribute error during timeout cancelling transfer #208

jsattelb opened this issue Dec 27, 2023 · 12 comments · Fixed by #209
Labels
bug Issue is a bug or fix for a bug good first issue Good issue for a new contributor

Comments

@jsattelb
Copy link
Contributor

Recently our backup appliance kept receiving a "status=500" from RemoveImageTicketVDS. I believe the actual call should be a status=409, conflict. Changing the reference to TicketCancelTimeout to TransferCancelTimeout in _internal/handlers/tickets.py seems to produce the desired event.

oVirt Event Log (before):

VDSM <FQDN> command RemoveImageTicketVDS failed: Image daemon request failed: 'status=500, reason=Internal Server Error, error=Server failed to perform the request, check logs\n'

oVirt Event Log (after):

VDSM <FQDN> command RemoveImageTicketVDS failed: Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling transfer a6bd976e-85ed-431d-bd4d-f4fe26fd7185\n'

File: /var/log/ovirt-imageio/daemon.log (before)

2023-12-26 20:59:06,957 INFO    (Thread-19) [http] OPEN connection=19 client=local
2023-12-26 20:59:06,958 INFO    (Thread-19) [tickets] [local] REMOVE transfer=7ef54af5-bc50-4705-948e-81d7a606ec35
2023-12-26 20:59:06,958 DEBUG   (Thread-19) [auth] Cancelling transfer 7ef54af5-bc50-4705-948e-81d7a606ec35
2023-12-26 20:59:06,958 DEBUG   (Thread-19) [auth] Canceling transfer 7ef54af5-bc50-4705-948e-81d7a606ec35 ongoing operations
2023-12-26 20:59:06,958 DEBUG   (Thread-19) [ops] Cancelling operation <Read size=53687091200 offset=0 done=0 at 0x139773690712528>
2023-12-26 20:59:06,958 INFO    (Thread-19) [auth] Waiting until transfer 7ef54af5-bc50-4705-948e-81d7a606ec35 ongoing operations finish
2023-12-26 21:00:06,958 ERROR   (Thread-19) [http] Server error
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/handlers/tickets.py", line 99, in delete
    self.auth.remove(ticket_id)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/auth.py", line 413, in remove
    if ticket.cancel(self._config.control.remove_timeout):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/auth.py", line 324, in cancel
    raise errors.TransferCancelTimeout(self.transfer_id)
ovirt_imageio._internal.errors.TransferCancelTimeout: Timeout cancelling transfer 7ef54af5-bc50-4705-948e-81d7a606ec35

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 796, in __call__
    self.dispatch(req, resp)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 841, in dispatch
    return method(req, resp, *match.groups())
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/handlers/tickets.py", line 100, in delete
    except errors.TicketCancelTimeout as e:
AttributeError: module 'ovirt_imageio._internal.errors' has no attribute 'TicketCancelTimeout'
2023-12-26 21:00:06,961 DEBUG   (Thread-19) [http] Empty request line, client disconnected

File: /var/log/ovirt-imageio/daemon.log (after)

2023-12-26 23:19:51,490 INFO    (Thread-60) [http] OPEN connection=60 client=local
2023-12-26 23:19:51,490 INFO    (Thread-60) [tickets] [local] REMOVE transfer=a6bd976e-85ed-431d-bd4d-f4fe26fd7185
2023-12-26 23:19:51,490 DEBUG   (Thread-60) [auth] Cancelling transfer a6bd976e-85ed-431d-bd4d-f4fe26fd7185
2023-12-26 23:19:51,490 DEBUG   (Thread-60) [auth] Canceling transfer a6bd976e-85ed-431d-bd4d-f4fe26fd7185 ongoing operations
2023-12-26 23:19:51,490 DEBUG   (Thread-60) [ops] Cancelling operation <Read size=25769803776 offset=0 done=0 at 0x140330263172432>
2023-12-26 23:19:51,490 INFO    (Thread-60) [auth] Waiting until transfer a6bd976e-85ed-431d-bd4d-f4fe26fd7185 ongoing operations finish
@nirs
Copy link
Member

nirs commented Jan 11, 2024

@jsattelb Thanks for reporting! do you want to try to send a simple PR? This looks like trivial error.

Adding a test to cover this case would be nice, I think we already have tests in this area.

@nirs nirs added bug Issue is a bug or fix for a bug good first issue Good issue for a new contributor labels Jan 11, 2024
@nirs
Copy link
Member

nirs commented Jan 11, 2024

The bug was introduced by 03e2f60 renaming TicketCancelTimeout to TransferCancelTimeout without replacing all usages of the old name.

@humnab
Copy link

humnab commented Jan 12, 2024

After changing TicketCancelTimeout to TransferCancelTimeout I still get an Error when backing up a VM (incremental), but now it's a conflict Error:

command RemoveImageTicketVDS failed: Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97\n'

This is the /var/log/ovirt-imageio/daemon.log af the transfer on the Node:

2024-01-12 16:47:56,175 INFO    (Thread-138) [http] OPEN connection=138 client=local
2024-01-12 16:47:56,176 INFO    (Thread-138) [tickets] [local] ADD transfer={'dirty': True, 'ops': ['read'], 'size': 21474836480, 'sparse': False, 'inactivity_timeout': 3600, 'transfer_id': 'b1d9c8fd-ed54-4935-968f-200d8cde3c97', 'uuid': '13e7674a-7171-4df2-ad2d-1a8cf633296a', 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/b1d9c8fd-ed54-4935-968f-200d8cde3c97.sock'}
2024-01-12 16:47:56,176 INFO    (Thread-138) [http] CLOSE connection=138 client=local [connection 1 ops, 0.000856 s] [dispatch 1 ops, 0.000394 s]
2024-01-12 16:47:56,281 INFO    (Thread-139) [http] OPEN connection=139 client=::ffff:192.168.63.223
2024-01-12 16:47:56,307 INFO    (Thread-140) [http] OPEN connection=140 client=::ffff:192.168.63.223
2024-01-12 16:47:56,308 INFO    (Thread-140) [extents] [::ffff:192.168.63.223] EXTENTS transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97 context=dirty
2024-01-12 16:47:56,639 INFO    (Thread-141) [http] OPEN connection=141 client=local
2024-01-12 16:47:56,640 INFO    (Thread-141) [http] CLOSE connection=141 client=local [connection 1 ops, 0.000736 s] [dispatch 1 ops, 0.000203 s]
2024-01-12 16:47:57,005 INFO    (Thread-142) [http] OPEN connection=142 client=::ffff:192.168.63.223
2024-01-12 16:47:57,009 INFO    (Thread-143) [http] OPEN connection=143 client=::ffff:192.168.63.223
2024-01-12 16:47:57,013 INFO    (Thread-144) [http] OPEN connection=144 client=::ffff:192.168.63.223
2024-01-12 16:47:58,675 INFO    (Thread-145) [http] OPEN connection=145 client=local
2024-01-12 16:47:58,676 INFO    (Thread-145) [tickets] [local] REMOVE transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97
2024-01-12 16:47:58,676 INFO    (Thread-145) [auth] Waiting until transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 ongoing operations finish
2024-01-12 16:48:58,677 ERROR   (Thread-145) [http] Client error: Timeout cancelling transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97
2024-01-12 16:48:58,678 INFO    (Thread-145) [http] CLOSE connection=145 client=local [connection 1 ops, 60.001998 s] [dispatch 1 ops, 60.001119 s]
2024-01-12 16:49:02,796 INFO    (Thread-146) [http] OPEN connection=146 client=local
2024-01-12 16:49:02,797 INFO    (Thread-146) [tickets] [local] REMOVE transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97
2024-01-12 16:49:02,797 INFO    (Thread-146) [auth] Waiting until transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 ongoing operations finish
2024-01-12 16:49:11,292 INFO    (Thread-140) [http] CLOSE connection=140 client=::ffff:192.168.63.223 [connection 1 ops, 74.984123 s] [dispatch 10 ops, 0.035233 s] [extents 1 ops, 0.014602 s] [read 9 ops, 0.017504 s, 3.02 MiB, 172.42 MiB/s] [read.read 9 ops, 0.012131 s, 3.02 MiB, 248.78 MiB/s] [read.write 9 ops, 0.005051 s, 3.02 MiB, 597.50 MiB/s]
2024-01-12 16:49:11,294 INFO    (Thread-142) [http] CLOSE connection=142 client=::ffff:192.168.63.223 [connection 1 ops, 74.288057 s] [dispatch 2 ops, 0.011759 s] [read 2 ops, 0.010456 s, 2.00 MiB, 191.29 MiB/s] [read.read 2 ops, 0.007767 s, 2.00 MiB, 257.51 MiB/s] [read.write 2 ops, 0.002590 s, 2.00 MiB, 772.21 MiB/s]
2024-01-12 16:49:11,294 INFO    (Thread-144) [http] CLOSE connection=144 client=::ffff:192.168.63.223 [connection 1 ops, 74.280532 s] [dispatch 1 ops, 0.008919 s] [read 1 ops, 0.006612 s, 1.00 MiB, 151.24 MiB/s] [read.read 1 ops, 0.005209 s, 1.00 MiB, 191.98 MiB/s] [read.write 1 ops, 0.001357 s, 1.00 MiB, 736.89 MiB/s]
2024-01-12 16:49:11,295 INFO    (Thread-143) [http] CLOSE connection=143 client=::ffff:192.168.63.223 [connection 1 ops, 74.284946 s] [dispatch 2 ops, 0.014787 s] [read 2 ops, 0.013882 s, 2.00 MiB, 144.07 MiB/s] [read.read 2 ops, 0.009852 s, 2.00 MiB, 203.00 MiB/s] [read.write 2 ops, 0.003917 s, 2.00 MiB, 510.60 MiB/s]
2024-01-12 16:50:02,797 ERROR   (Thread-146) [http] Client error: Timeout cancelling transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97
2024-01-12 16:50:02,799 INFO    (Thread-146) [http] CLOSE connection=146 client=local [connection 1 ops, 60.001764 s] [dispatch 1 ops, 60.000980 s]
2024-01-12 16:50:10,829 INFO    (Thread-147) [http] OPEN connection=147 client=local
2024-01-12 16:50:10,829 INFO    (Thread-147) [tickets] [local] REMOVE transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97
2024-01-12 16:50:10,829 INFO    (Thread-147) [auth] Waiting until transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 ongoing operations finish
2024-01-12 16:50:16,605 ERROR   (Thread-139) [http] Request aborted after starting response
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 796, in __call__
    self.dispatch(req, resp)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 841, in dispatch
    return method(req, resp, *match.groups())
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/cors.py", line 80, in wrapper
    return func(self, req, resp, *args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/handlers/images.py", line 146, in get
    raise http.Error(http.FORBIDDEN, str(e)) from None
ovirt_imageio._internal.http.Error: You are not allowed to access this resource: Transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 was canceled
2024-01-12 16:50:16,607 INFO    (Thread-139) [http] CLOSE connection=139 client=::ffff:192.168.63.223 [connection 1 ops, 140.324541 s] [dispatch 1 ops, 140.320217 s] [read 0 ops, 140.318857 s] [read.read 1 ops, 0.013382 s, 8.00 MiB, 597.81 MiB/s] [read.write 0 ops, 140.305374 s]
2024-01-12 16:50:16,653 INFO    (Thread-147) [auth] Transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 was canceled
2024-01-12 16:50:16,654 INFO    (Thread-147) [http] CLOSE connection=147 client=local [connection 1 ops, 5.825172 s] [dispatch 1 ops, 5.824329 s]

The problem startet with the last update of our backup software (Veeam), and they say this error does not occure on RHV.
Can someone check if 03e2f60 is also in the official RHV packages?

@jsattelb
Copy link
Contributor Author

I still receive multiple status=409 conflict messages throughout the backup process in the oVirt UI, but the backups seem to complete successfully. I haven't seen the backups stuck in a "finalizing" state for nearly three weeks. The response is used to inform the process that there is still an active connection (e.g., backup) in progress and to try again later. That is my understanding of it, at least from acf69ee.

I also receive the same backtrace referenced in the debug log when the backup request no longer exists. It's not reported in the oVirt UI, so I ignored it for now.

@nirs
Copy link
Member

nirs commented Jan 12, 2024

What we see in the log (with the fix) is no error on imageio side. If there is an issue it is engine or in the client code.

  1. ticket was added
2024-01-12 16:47:56,176 INFO    (Thread-138) [tickets] [local] ADD transfer={'dirty': True, 'ops': ['read'], 'size': 21474836480, 'sparse': False, 'inactivity_timeout': 3600, 'transfer_id': 'b1d9c8fd-ed54-4935-968f-200d8cde3c97', 'uuid': '13e7674a-7171-4df2-ad2d-1a8cf633296a', 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/b1d9c8fd-ed54-4935-968f-200d8cde3c97.sock'}
  1. Client getting dirty extents (which extents changed since last backup)
2024-01-12 16:47:56,308 INFO    (Thread-140) [extents] [::ffff:192.168.63.223] EXTENTS transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97 context=dirty
  1. Ticket removed 2 seconds later - meaning on engine side the client finished the transfer
2024-01-12 16:47:58,676 INFO    (Thread-145) [tickets] [local] REMOVE transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97
  1. Timeout removing ending the transfer
2024-01-12 16:48:58,677 ERROR   (Thread-145) [http] Client error: Timeout cancelling transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97

This means that there active client connections, probably blcoked in recv() on the client socket, waiting for the next request. This means that the client did not close all connections to the image before ending the transfer - this is a client error.

  1. Request aborted while sending response
2024-01-12 16:50:16,605 ERROR   (Thread-139) [http] Request aborted after starting response
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 796, in __call__
    self.dispatch(req, resp)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 841, in dispatch
    return method(req, resp, *match.groups())
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/cors.py", line 80, in wrapper
    return func(self, req, resp, *args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/handlers/images.py", line 146, in get
    raise http.Error(http.FORBIDDEN, str(e)) from None
ovirt_imageio._internal.http.Error: You are not allowed to access this resource: Transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 was canceled

This means that the client closed the connection without reading the entire response. I'm not sure how this happens if the backup is successful, but this is again client errror.

  1. Transfer finally removed
2024-01-12 16:50:16,607 INFO    (Thread-139) [http] CLOSE connection=139 client=::ffff:192.168.63.223 [connection 1 ops, 140.324541 s] [dispatch 1 ops, 140.320217 s] [read 0 ops, 140.318857 s] [read.read 1 ops, 0.013382 s, 8.00 MiB, 597.81 MiB/s] [read.write 0 ops, 140.305374 s]
2024-01-12 16:50:16,653 INFO    (Thread-147) [auth] Transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 was canceled

I don't know what is the client code accessing image, but I guess it does this:

  1. Start transfer
  2. Get extents to find the dirty extents to backup
  3. Open multiple connections for reading the data
  4. Copy the interesting data
  5. End the transfer before closing all connections
  6. Wait until transfer ends
  7. Close all imageio connections

What the client should do is:

  1. Start transfer
  2. Get extents to find the dirty extents to backup
  3. Open multiple connections for reading the data
  4. Copy the interesting data
  5. Close all imageio connections
  6. End the transfer before closing all connections
  7. Wait until transfer ends

With this the ticket will removed immediately and the transfer will finish very quickly.

The error in imageio should be fix but it is not the root cause.

Latest RHV contains the fix that introduced the AttributeError, but maybe Veeam did not test with an older RHV version?

@nirs
Copy link
Member

nirs commented Jan 12, 2024

@jsattelb Regarding RHV - the issue was introduced in imageio 2.4.1 released in the last oVirt 4.4.z version.

oVirt 4.5.x is called RHV 4.4 SPy - I'm not sure about the exact numbers. However imageio version is exactly the same in oVirt and RHV. You can ask Veeam which imageio version was tested.

@humnab
Copy link

humnab commented Jan 12, 2024

@nirs Thanks for the detailed analysis and explanation!
You wrote:

Latest RHV contains the fix that introduced the AttributeError, but maybe Veeam did not test with an older RHV version?

Does that mean that latest offical RHV (4.4 SP1 BU3?) has TransferCancelTimeout in tickets.py and latest oVirt (4.5.5) tickets.py has TicketCancelTimeout?

Veeam Support told me that 4.4 SP1 is required, but 4.4 SP1 without Batch Update is quite old, so if this was fixed / changed in 4.4 SP1 Batch Update 2 or 3, I think the Veeam System Requirements are incorrect and they tested only with a recent Version.

@jsattelb I had stucking and failed backups with first backups after the update to 12.1, but after some days those are gone, without the fix, but they produce this errors and need more time as with 12.0, perhaps a result of the incorrect sequenze they use now.

@humnab
Copy link

humnab commented Jan 12, 2024

@humnab
Copy link

humnab commented Jan 13, 2024

I compiled and downgraded a Node to:
https://ftp.redhat.com/redhat/linux/ent ... ev.src.rpm](https://ftp.redhat.com/redhat/linux/enterprise/8Base/en/RHEV/SRPMS/ovirt-imageio-2.4.8-1.el8ev.src.rpm
followed by a reboot.
I get the same error as before:
command RemoveImageTicketVDS failed: Image daemon request failed: 'status=500, reason=Internal Server Error, error=Server failed to perform the request, check logs\n'

@nirs
Copy link
Member

nirs commented Jan 13, 2024

I compiled and downgraded a Node to ... ovirt-imageio-2.4.8-1 ... I get the same error as before: command RemoveImageTicketVDS failed: Image daemon request failed: 'status=500, reason=Internal Server Error.

Since this issue exists since ovirt-imageio 2.4.1 getting "500 Internal Server Error" is not surprising.

@humnab
Copy link

humnab commented Jan 13, 2024

Do you know if:
https://ftp.redhat.com/redhat/linux/enterprise/8Base/en/RHEV/SRPMS/ovirt-imageio-2.4.8-1.el8ev.src.rpm
is the same version as in the official RHV packages? If yes, Veeam must have have the same errors in their test enviroment, if not, do you have access to the official packages, and if yes, can you look for a difference?
Thanks, I found myself no way to get a subscription or trial of RHV.

@nirs
Copy link
Member

nirs commented Jan 14, 2024

Do you know if: https://ftp.redhat.com/redhat/linux/enterprise/8Base/en/RHEV/SRPMS/ovirt-imageio-2.4.8-1.el8ev.src.rpm is the same version as in the official RHV packages?

I think what you find in ftp.redhat.com is the official package. But more important, ovirt-imageio from github contains exactly the same code you get in oVirt rpms or RHV rpms. There is no special code added in RHV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is a bug or fix for a bug good first issue Good issue for a new contributor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants