Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send messages getting stuck in mediator and not-redilivered #2111

Closed
Tracked by #842
jleach opened this issue Feb 3, 2023 · 5 comments · Fixed by #2147
Closed
Tracked by #842

Send messages getting stuck in mediator and not-redilivered #2111

jleach opened this issue Feb 3, 2023 · 5 comments · Fixed by #2147
Assignees

Comments

@jleach
Copy link

jleach commented Feb 3, 2023

I'm documenting this issue with video and sample code here. This TL;DR version of the issue is that message can become stuck in an ACA-py mediator under certain circumstances. The mediator does not re-try sending the message and they are only picked up by other agents if they specifically request message delivery. For example, in AFJ the API initiateMessagePickup() must be called when a connection enters the "completed" state to see if any messages are queued up.

This is not a bug per-se in ACA-py as it appears to be a shortcoming in the V1 protocol where by a connection is considered valid in the "request-sent" state. Given this, it would be a good mitigation strategy to have queued messages re-delivered automatically when a connection transitions to be "completed" or by some other trigger.

The following is from the repo I used to document the issue and provide sample code to trigger in AFJ:

Trigger the Issue

  1. Using a fresh install of the BC Wallet scan Faber's invitation QR code;
  2. As soon as the Faber UI displays the first connection message, show below,
    press 1 to offer a credential.
{
    "their_label": "BC Wallet",
    "connection_protocol": "connections/1.0",
    "updated_at": "2023-02-02T20:17:34.320044Z",
    "my_did": "KPZpDbMPDXZYVZEKRttanB",
    "connection_id": "b58b186b-d4ac-439c-9eb4-93ed38ba6eef",
    "rfc23_state": "response-sent",
    "invitation_key": "EAUsTQpExyNa7LpwQtDgdxbbWASA4ubMBhAGeGbV1uzX",
    "routing_state": "none",
    "invitation_mode": "once",
    "accept": "auto",
    "their_role": "invitee",
    "created_at": "2023-02-02T20:13:32.734545Z",
    "their_did": "4AM74FNHKm3RCHq327rMnt",
    "state": "response"
}
  1. The offer will be sent over the connection before it is fully setup resulting
    in the message being queued on on the mediator.
  2. The following message will then be displayed in Faber indicating the connection
    is fully setup:
{
    "their_label": "BC Wallet",
    "connection_protocol": "connections/1.0",
    "updated_at": "2023-02-02T20:17:37.660641Z",
    "my_did": "KPZpDbMPDXZYVZEKRttanB",
    "connection_id": "b58b186b-d4ac-439c-9eb4-93ed38ba6eef",
    "rfc23_state": "completed",
    "invitation_key": "EAUsTQpExyNa7LpwQtDgdxbbWASA4ubMBhAGeGbV1uzX",
    "routing_state": "none",
    "invitation_mode": "once",
    "accept": "auto",
    "their_role": "invitee",
    "created_at": "2023-02-02T20:13:32.734545Z",
    "their_did": "4AM74FNHKm3RCHq327rMnt",
    "state": "response"
}
  1. At this point nothing will show up in the wallet. Returning to the home screen
    will display no notifications (offers).

Same Scenario, No Issue

  1. Using a fresh install of the BC Wallet scan Faber's invitation QR code;
  2. You will see a message similar to the following in the Faber UI, do nothing:
{
    "their_label": "BC Wallet",
    "connection_protocol": "connections/1.0",
    "updated_at": "2023-02-02T20:17:34.320044Z",
    "my_did": "KPZpDbMPDXZYVZEKRttanB",
    "connection_id": "b58b186b-d4ac-439c-9eb4-93ed38ba6eef",
    "rfc23_state": "response-sent",
    "invitation_key": "EAUsTQpExyNa7LpwQtDgdxbbWASA4ubMBhAGeGbV1uzX",
    "routing_state": "none",
    "invitation_mode": "once",
    "accept": "auto",
    "their_role": "invitee",
    "created_at": "2023-02-02T20:13:32.734545Z",
    "their_did": "4AM74FNHKm3RCHq327rMnt",
    "state": "response"
}
  1. Wait for the following message to be displayed in the Faber UI:
{
    "their_label": "BC Wallet",
    "connection_protocol": "connections/1.0",
    "updated_at": "2023-02-02T20:17:37.660641Z",
    "my_did": "KPZpDbMPDXZYVZEKRttanB",
    "connection_id": "b58b186b-d4ac-439c-9eb4-93ed38ba6eef",
    "rfc23_state": "completed",
    "invitation_key": "EAUsTQpExyNa7LpwQtDgdxbbWASA4ubMBhAGeGbV1uzX",
    "routing_state": "none",
    "invitation_mode": "once",
    "accept": "auto",
    "their_role": "invitee",
    "created_at": "2023-02-02T20:13:32.734545Z",
    "their_did": "4AM74FNHKm3RCHq327rMnt",
    "state": "response"
}
  1. Once this message is displayed, press 1 to offer a credential.
  2. At this point the offer will show up in the wallet. Returning to the home screen
    will display the notification (offer).

Expected Behaviour

RFC 0160 does not require a acknowlwdgement that a connection is completed before message can be sent over it. This is address in V2.

An ACA-py mediator should atempt delivery of any queued messages when the related connection becomes "completed" to remediate this issue.

Q & A

  1. How do you know the message is queued in the mediator?

In AFJ the fn initiateMessagePickup can be called to trigger the delivery of messages. The outstanding offer will be delivered.

  1. Is this infrastructure (OpenShift, Cloud, Kubernets) related?

This problem exists on two mediators running similar version of ACA-py hosted on different infrastructure by two different companies. It can also be reproduced locally using Docker.

  1. Is this specific to a version of ACA-py.

It can be reproduced locally in Dokcer using ACA-py 0.7.3 and 1.0.0-rc1. The cloud hosted agents both run ACA-py 0.7.x versions.

  1. Why do you think its a race condition?

In one test we used an ACA-py 0.7.3 mediator on a cloud platform which had been running for 1 day under light load. The problem was evident. On the same cloud platform an ACA-py mediator running 0.7.4-rc2 which had been running for <10 min. The problem did not present. This leands us to believe that as a mediator is used performance degrades enough for the problem to present.

  1. Could this be the issuer rather than the mediator?

Unlikley. The situation can be reproduced using the BC Showcase demo. By using the older mediator mentiond in #4 above the automated showcase demo fails. By using the fresh mediator from #4 above the demo succeeds.

  1. What conneciotn protocol is being used?

V1.

  1. Is this a bug?

Maybe. RFC 0160 does not require acknowledgement when a conneciton enters the "completed" state. Its is by convention that a controller should confirm the state before sending a message (offer) over the conneciton. This is a recongized shortcoming that shoudl be addressed in V2.

However, it may be a bug in that an ACA-py mediator does not atempt re-delivery of queued messages if that message comes in before the conneciton is "completed".

@jleach
Copy link
Author

jleach commented Feb 3, 2023

@ianco FYI.

@ianco
Copy link
Contributor

ianco commented Feb 6, 2023

Thanks @jleach

Are 2.0 connections supported in AFJ/BiFold? I thought @amanji said they were but want to confirm.

Per the discussion last week if we:

  • switch over to use connections 2.0 protocol
  • update aca-py to reject any requests on a 2.0 connection unless it is completed
  • ensure that AFJ/BiFold doesn't allow the connection to be completed until the mediation setup is completed

This will help prevent the mediator from receiving un-deliverable messages.

@swcurran your thoughts?

The mediator message re-delivery issue is a separate issue from the above. I think the mediator solution should involve Redis (Redis can hold the message if it's undeliverable for whatever reason and the message will survive an agent re-start).

@jleach
Copy link
Author

jleach commented Feb 6, 2023

If by v2 you mean OOB yes, we do OOB. I'm a little unclear about 2.0 beyond that. Even if AFJ did 2.0 I think it would still leave a hole for all other frameworks that don't do it, including older agents that don't regularly update ACA-py. I don't think ACA-py supports v2 connections - does it? I seem to remember an issue when using AFJ I had to dig for the legacy DID ADB123 rather than the new did format did:peer:xxx123abc which I thought was part of oob/v2 connections.

@swcurran
Copy link
Contributor

swcurran commented Feb 6, 2023

AIP 1.0 had RFC 0160 Connections as its way to establish connections. And it sucked because of the issue with marking a connection as “complete” — the very thing you are running into, we’ve known about since all the participants first implemented it and started to use it in the wild. It also sucked because it doesn’t support connect reuse (below).

AIP 2.0 replaces RFC 0160 with OOB and RFC 0023 (lower number — what???). The changes (quick summary):

  • OOB can be used for establishing connections, doing an interaction without establishing a connection or both (establish a connection and immediately jump into something else).
  • OOB enabled “connection reuse” — being able to receive an invitation, realize that you already had a connection and reuse it and not create a new one. Ridiculous that this and "finalize the connection” were missed in RFC 0160. We had no practical experience and once we did — we saw the weaknesses.
  • DID Exchange added the proper finishing of the connection.

The challenge has been that the Mobile Wallets (Trinsic and the like) have been very slow to move to AIP 2.0 and without Wallet support, it’s hard to deploy issuers and verifiers that use AIP 2.0. Hence why we’ve been encouraging for a while getting AFJ to AIP 2.0, getting Bifold to AIP 2.0 and then phasing out (as quickly as possible) the use of AIP 1.0 and especially the use of RFC 0160 Connections.

So the question is — does Bifold support both OOB and DID Exchange, and if so, can we phase out using RFC 0160 on the server side?

@jleach
Copy link
Author

jleach commented Feb 7, 2023

@swcurran Yup, AFJ does both OOB (RFC 0434) and DID Exchange Protocol (RFC 0023).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants