Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] BC Wallet Mediator #6

Open
4 tasks
krobinsonca opened this issue Jan 25, 2023 · 1 comment
Open
4 tasks

[EPIC] BC Wallet Mediator #6

krobinsonca opened this issue Jan 25, 2023 · 1 comment
Labels

Comments

@krobinsonca
Copy link
Collaborator

krobinsonca commented Jan 25, 2023

Tasks

Acceptance Criteria

  • Required - Mediator Service deployment in HA-fashion and running multiple replicas (3+)
    • This includes the Agent (auto-scaled), wallet (HA instance of Postgres), Proxy (auto-scaled), External Queue (HA instance of Redis or Kafka), Message Workers (auto-scaled).
    • Note regarding the Proxy, it would be nice to eliminate this layer, though it is useful for traffic control and rate limiting. However relying on it to route messages to the separate http and ws ports of the aca-py agents is undesirable.
    • With the use of web sockets auto-scaling is best performed on the basis of the web socket connections themselves. This helps to ensure that pods with active web socket connections are not terminated by the system scaling them down prematurely. There are several ways to accomplish this.
  • Highly Desired - No session/cookie affinity: A agent may seamlessly connect and be served by any of the replicas
    • The use of web sockets defines the current need for session affinity. The socket is opened between a client and an agent/worker instance. Once established all traffic must be routed between the same client and agent/worker instance. This is a challenge in K8S/OCP, especially when to comes to HPAS.
    • Are there any alternatives to using web sockets?
  • Required - Uptime/performance monitoring at Aries protocol level (not just http/s)
    • For example, the k8s compatible status endpoints on the agents do not provide sufficient information regarding the state and health of the websocket connections, nor do they provide any metrics on the durability and longevity of the websocket connections.

Blocked By:

Additional Resources:

@esune
Copy link
Member

esune commented Feb 16, 2023

I did a bit of catching-up and while there are still some items that will require review and planning, I think we have a couple of options to focus on for the short/medium term.

  • The PR linked in the issue description will allow the mediator agent to scale vertically and manage throughput of 2400+ connections: this should be enough to handle the user volumes we expect in the immediate future. This does NOT help with scenarios involving pod rollout, as the in-progress queue would be lost.
  • This PR (https://github.com/bcgov/openshift-aries-mediator-service/pull/18/files) includes changes that, in theory, should help with preventing websockets from being dropped due to scaling, by using sticky sessions and affinity. The changes are already deployed in our dev environment, however testing appears to have been interrupted before it could confirm whether the change was helpful or not sue to a shift in priorities. It would be a good idea to wrap-up the testing and confirm whether this approach resolves, or at least mitigates, the horizontal scaling issues.

Both the above approaches should be accompanied by running a persistent queue to handle messages, so that they will not be lost in case of rollouts/re-deployments/failures.

I would suggest we focus on these three items in the short term, and in the meantime complete the investigation of potential next steps/long term strategies to manage mediation.

@esune esune added the Epic label Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants