Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graphql: subscriptions at the scheduler #3731

Open
oliver-sanders opened this issue Aug 4, 2020 · 5 comments
Open

graphql: subscriptions at the scheduler #3731

oliver-sanders opened this issue Aug 4, 2020 · 5 comments
Assignees
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 4, 2020

Prerequisite to #3527

Implement the following component of the interfaces and protocols diagram:

rect4117

Running GraphQL subscriptions over ZMQ/TCP is non-standard so this will require a bit of work at our end.

This may just involve patching in our pub-sub / req-rep endpoints to the existing websockets functionality as conceptually only the connections and transport differ. We should be able to use the same GraphQL server infrastructure.

Perhaps we should take this opportunity to consider whether there is a nice simplification of the existing req-rep / pub-sub duo, perhaps we could merge them with xpub-xsub or something else? Ping @dwsutherland.

Pull requests welcome!

@oliver-sanders oliver-sanders added this to the cylc-8.0.0 milestone Aug 4, 2020
@dwsutherland
Copy link
Member

dwsutherland commented Mar 11, 2021

From:
https://zguide.zeromq.org/docs/chapter5/

Killing back-chatter is essential to real scalability. With pub-sub, it’s how the pattern can map cleanly to the PGM multicast protocol, which is handled by the network switch. In other words, subscribers don’t connect to the publisher at all, they connect to a multicast group on the switch, to which the publisher sends its messages.

When we remove back-chatter, our overall message flow becomes much simpler, which lets us make simpler APIs, simpler protocols, and in general reach many more people. But we also remove any possibility to coordinate senders and receivers. What this means is:

  • Publishers can’t tell when subscribers are successfully connected, both on initial connections, and on reconnections after network failures.

  • Subscribers can’t tell publishers anything that would allow publishers to control the rate of messages they send. Publishers only have one setting, which is full-speed, and subscribers must either keep up or lose messages.

  • Publishers can’t tell when subscribers have disappeared due to processes crashing, networks breaking, and so on.

The downside is that we actually need all of these if we want to do reliable multicast. The ZeroMQ pub-sub pattern will lose messages arbitrarily when a subscriber is connecting, when a network failure occurs, or just if the subscriber or network can’t keep up with the publisher.

The upside is that there are many use cases where almost reliable multicast is just fine. When we need this back-chatter, we can either switch to using ROUTER-DEALER (which I tend to do for most normal volume cases), or we can add a separate channel for synchronization (we’ll see an example of this later in this chapter).

This means, all the filtering is done by the subscriber (ATM) (if you subscribe to a topic, all it means is you're receiving and throwing away all the other topics)... i.e. if we setup a GraphQL publish to a subscription, then it will be sent to all subscribers, and they will just ignore it..

It, more obviously, means the PUB can't tell if a SUB is killed in order to stop a subscription (and I don't know if you can rely on a last REQ on death of client).. Otherwise the only other solution along side PUB/SUB would be to send a REQ with info about the subscriber for polling or something..

So probably not what we want.. (we also don't care about scalability on the Scheduler (PUB) end as it's just a UIS, a few TUI's... I'm not sure how heavy the CLI will be..)

So we'll keep looking for something that allows some "back-chatter"..

@dwsutherland
Copy link
Member

Bit on the need to know about a client/server being alive (heartbeats):
https://zguide.zeromq.org/docs/chapter4/#Heartbeating

@dwsutherland
Copy link
Member

dwsutherland commented Mar 29, 2021

I have an idea, and it's based on the REQ/REP PUB/SUB patterns but incorporates:

Into a service that will work for both GraphQL and real command response.

The idea is for the client (UIS/CLI) to:

  • Initiate it's request/REQ for published content with a message containing a client ID, subscription topic, and instructions/graphql-document.
  • Before sending the request, subscribe/SUB to the corresponding topic.
  • Hold a record of current subscriptions (in the form of IDs or subscription hashes/topics), and at some arbitrary interval HB_X (say 2min?) send a heartbeat REQ to the scheduler containing this content in some form.

The server/Scheduler will:

  • Receive a REQ/request for some type/topic of published content (i.e. protobuf, graphql-sub/mut/query), and register these topics (with instructions/body/document) against the client.
  • REP to the client with confirmation (or rejection) of the requested subscription topic(s).
  • Publish the result of the requested topic.
  • For ongoing subscription types (i.e. non-command types) continue publishing content.
    • If a heart-beat does not contain the published topic, remove the respective client's association with that topic, and if no other clients are subscribed to that topic; cease the generation & publication of that topic.
    • If no heart-beat is received in 2.5*HB_X, drop the client and it's topic association/references.

The content of the client request to a new publisher endpoint/command may look like:

{
  "client_id": "625d1e47-a828-4b70-851b-aefa299fe6ae"
  "type": "graphql"
  "topic": "some-hash-of-graphql-document1"
  "document": "subscription {
workflows {
    id
    isHeldTotal
    isQueuedTotal
    latestStateTasks
    stateTotals
  }
}"
}

Or for protobuf/data-store deltas:

{
  "client_id": "625d1e47-a828-4b70-851b-aefa299fe6ae"
  "type": "protobuf"
  "topic": "workflow"
  "document": null
}

Or a command:

{
  "client_id": "625d1e47-a828-4b70-851b-aefa299fe6ae"
  "type": "graphql"
  "topic": "some-hash-of-graphql-document2"
  "document": "mutation {
  trigger(workflows: [\"me/foo\"], tasks: [\"*.2020:failed\"]) {
    result
  }
}"
}

The graphql handler will need manage the different cases (whether subscription or mutation/query), and the machinery (async generator ...etc).. And the topic being the hash of document is to avoid producing the exact same content twice.

The heartbeat would contain the the client_id and a list of topics it's currently listening-for/subscribed-to.

The Scheduler/publisher would register the ongoing subscriptions in a mapping to the client ID:

client_ids = {
    '625d1e47-a828-4b70-851b-aefa299fe6ae': {
        'graphql': {
            'some-hash-of-graphql-document1',
            'some-hash-of-graphql-document3',
        }
        'protobuf': {
            'workflow',
        }
    },
    '2de02b0e-376f-4b75-beb5-53269c66c563': {
        'graphql': {
            'some-hash-of-graphql-document3',
            'some-hash-of-graphql-document4',
        }
        'protobuf': {
            'workflow',
            'task_proxies',
            'jobs',
        }
    },
}

With the resulting set of ongoing published topics being:

topics = {
    'graphql': {
        'some-hash-of-graphql-document1',
        'some-hash-of-graphql-document3',
        'some-hash-of-graphql-document4',
    },
    'protobuf': {
        'workflow',
        'task_proxies',
        'jobs',
    }
}

This can be recreated/updated super quick with each deletion/addition using set operations (via a method in the publisher object)...

On of the main benefits of this approach are:

  • Less load on the Scheduler and Network, as it only publishes when there's a subscriber (unlike now).
  • Client instructed publishing & GraphQL subscriptions at the scheduler.
  • Ability to give real results of commands to CLI, instead of "command queued".
  • Most work can be done under the hood, and then packaged into our runtime client and/or subscriber as a new capability.

Thoughts?

@hjoliver
Copy link
Member

Thoughts?

At first glance, it sounds pretty good to me. However I'm not exactly au fait with comms patterns etc., I'll need to get a better understanding (after the beta release is out)

@dwsutherland
Copy link
Member

Have put up a draft PR (#4274) for this one, I have a clear road map of how I want to achieve this therein...
Open to change, but nice to finally get something out there.

Step 1 is done, which also has the benefit of moving most of the network related code out of the scheduler.py..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants