graphql: subscriptions at the scheduler #3731

oliver-sanders · 2020-08-04T11:02:08Z

Prerequisite to #3527

Implement the following component of the interfaces and protocols diagram:

Running GraphQL subscriptions over ZMQ/TCP is non-standard so this will require a bit of work at our end.

This may just involve patching in our pub-sub / req-rep endpoints to the existing websockets functionality as conceptually only the connections and transport differ. We should be able to use the same GraphQL server infrastructure.

Perhaps we should take this opportunity to consider whether there is a nice simplification of the existing req-rep / pub-sub duo, perhaps we could merge them with xpub-xsub or something else? Ping @dwsutherland.

Pull requests welcome!

dwsutherland · 2021-03-11T03:18:27Z

From:
https://zguide.zeromq.org/docs/chapter5/

Killing back-chatter is essential to real scalability. With pub-sub, it’s how the pattern can map cleanly to the PGM multicast protocol, which is handled by the network switch. In other words, subscribers don’t connect to the publisher at all, they connect to a multicast group on the switch, to which the publisher sends its messages.

When we remove back-chatter, our overall message flow becomes much simpler, which lets us make simpler APIs, simpler protocols, and in general reach many more people. But we also remove any possibility to coordinate senders and receivers. What this means is:

Publishers can’t tell when subscribers are successfully connected, both on initial connections, and on reconnections after network failures.

Subscribers can’t tell publishers anything that would allow publishers to control the rate of messages they send. Publishers only have one setting, which is full-speed, and subscribers must either keep up or lose messages.

Publishers can’t tell when subscribers have disappeared due to processes crashing, networks breaking, and so on.

The downside is that we actually need all of these if we want to do reliable multicast. The ZeroMQ pub-sub pattern will lose messages arbitrarily when a subscriber is connecting, when a network failure occurs, or just if the subscriber or network can’t keep up with the publisher.

The upside is that there are many use cases where almost reliable multicast is just fine. When we need this back-chatter, we can either switch to using ROUTER-DEALER (which I tend to do for most normal volume cases), or we can add a separate channel for synchronization (we’ll see an example of this later in this chapter).

This means, all the filtering is done by the subscriber (ATM) (if you subscribe to a topic, all it means is you're receiving and throwing away all the other topics)... i.e. if we setup a GraphQL publish to a subscription, then it will be sent to all subscribers, and they will just ignore it..

It, more obviously, means the PUB can't tell if a SUB is killed in order to stop a subscription (and I don't know if you can rely on a last REQ on death of client).. Otherwise the only other solution along side PUB/SUB would be to send a REQ with info about the subscriber for polling or something..

So probably not what we want.. (we also don't care about scalability on the Scheduler (PUB) end as it's just a UIS, a few TUI's... I'm not sure how heavy the CLI will be..)

So we'll keep looking for something that allows some "back-chatter"..

dwsutherland · 2021-03-11T03:29:59Z

Bit on the need to know about a client/server being alive (heartbeats):
https://zguide.zeromq.org/docs/chapter4/#Heartbeating

dwsutherland · 2021-03-29T02:03:59Z

I have an idea, and it's based on the REQ/REP PUB/SUB patterns but incorporates:

heartbeats
node coordination
passive publisher (Scheduler only publishes if needed).
bespoke and/or generic publish content.

Into a service that will work for both GraphQL and real command response.

The idea is for the client (UIS/CLI) to:

Initiate it's request/REQ for published content with a message containing a client ID, subscription topic, and instructions/graphql-document.
Before sending the request, subscribe/SUB to the corresponding topic.
Hold a record of current subscriptions (in the form of IDs or subscription hashes/topics), and at some arbitrary interval HB_X (say 2min?) send a heartbeat REQ to the scheduler containing this content in some form.

The server/Scheduler will:

Receive a REQ/request for some type/topic of published content (i.e. protobuf, graphql-sub/mut/query), and register these topics (with instructions/body/document) against the client.
REP to the client with confirmation (or rejection) of the requested subscription topic(s).
Publish the result of the requested topic.
For ongoing subscription types (i.e. non-command types) continue publishing content.
- If a heart-beat does not contain the published topic, remove the respective client's association with that topic, and if no other clients are subscribed to that topic; cease the generation & publication of that topic.
- If no heart-beat is received in 2.5*HB_X, drop the client and it's topic association/references.

The content of the client request to a new publisher endpoint/command may look like:

{
  "client_id": "625d1e47-a828-4b70-851b-aefa299fe6ae"
  "type": "graphql"
  "topic": "some-hash-of-graphql-document1"
  "document": "subscription {
workflows {
    id
    isHeldTotal
    isQueuedTotal
    latestStateTasks
    stateTotals
  }
}"
}

Or for protobuf/data-store deltas:

{
  "client_id": "625d1e47-a828-4b70-851b-aefa299fe6ae"
  "type": "protobuf"
  "topic": "workflow"
  "document": null
}

Or a command:

{
  "client_id": "625d1e47-a828-4b70-851b-aefa299fe6ae"
  "type": "graphql"
  "topic": "some-hash-of-graphql-document2"
  "document": "mutation {
  trigger(workflows: [\"me/foo\"], tasks: [\"*.2020:failed\"]) {
    result
  }
}"
}

The graphql handler will need manage the different cases (whether subscription or mutation/query), and the machinery (async generator ...etc).. And the topic being the hash of document is to avoid producing the exact same content twice.

The heartbeat would contain the the client_id and a list of topics it's currently listening-for/subscribed-to.

The Scheduler/publisher would register the ongoing subscriptions in a mapping to the client ID:

client_ids = {
    '625d1e47-a828-4b70-851b-aefa299fe6ae': {
        'graphql': {
            'some-hash-of-graphql-document1',
            'some-hash-of-graphql-document3',
        }
        'protobuf': {
            'workflow',
        }
    },
    '2de02b0e-376f-4b75-beb5-53269c66c563': {
        'graphql': {
            'some-hash-of-graphql-document3',
            'some-hash-of-graphql-document4',
        }
        'protobuf': {
            'workflow',
            'task_proxies',
            'jobs',
        }
    },
}

With the resulting set of ongoing published topics being:

topics = {
    'graphql': {
        'some-hash-of-graphql-document1',
        'some-hash-of-graphql-document3',
        'some-hash-of-graphql-document4',
    },
    'protobuf': {
        'workflow',
        'task_proxies',
        'jobs',
    }
}

This can be recreated/updated super quick with each deletion/addition using set operations (via a method in the publisher object)...

On of the main benefits of this approach are:

Less load on the Scheduler and Network, as it only publishes when there's a subscriber (unlike now).
Client instructed publishing & GraphQL subscriptions at the scheduler.
Ability to give real results of commands to CLI, instead of "command queued".
Most work can be done under the hood, and then packaged into our runtime client and/or subscriber as a new capability.

Thoughts?

hjoliver · 2021-03-30T04:48:14Z

Thoughts?

At first glance, it sounds pretty good to me. However I'm not exactly au fait with comms patterns etc., I'll need to get a better understanding (after the beta release is out)

dwsutherland · 2021-06-18T12:19:51Z

Have put up a draft PR (#4274) for this one, I have a clear road map of how I want to achieve this therein...
Open to change, but nice to finally get something out there.

Step 1 is done, which also has the benefit of moving most of the network related code out of the scheduler.py..

oliver-sanders added this to the cylc-8.0.0 milestone Aug 4, 2020

dwsutherland self-assigned this Jun 18, 2021

dwsutherland mentioned this issue Jun 18, 2021

Node (Network) Coordinated Subscriptions (part 1/3) - Centralise REP and PUB into single runtime server #4274

Merged

9 tasks

hjoliver mentioned this issue Aug 4, 2021

tui: improve performance #3527

Closed

hjoliver modified the milestones: cylc-8.0.0, cylc-8.x Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graphql: subscriptions at the scheduler #3731

graphql: subscriptions at the scheduler #3731

oliver-sanders commented Aug 4, 2020 •

edited

Loading

dwsutherland commented Mar 11, 2021 •

edited

Loading

dwsutherland commented Mar 11, 2021

dwsutherland commented Mar 29, 2021 •

edited

Loading

hjoliver commented Mar 30, 2021

dwsutherland commented Jun 18, 2021

graphql: subscriptions at the scheduler #3731

graphql: subscriptions at the scheduler #3731

Comments

oliver-sanders commented Aug 4, 2020 • edited Loading

dwsutherland commented Mar 11, 2021 • edited Loading

dwsutherland commented Mar 11, 2021

dwsutherland commented Mar 29, 2021 • edited Loading

hjoliver commented Mar 30, 2021

dwsutherland commented Jun 18, 2021

oliver-sanders commented Aug 4, 2020 •

edited

Loading

dwsutherland commented Mar 11, 2021 •

edited

Loading

dwsutherland commented Mar 29, 2021 •

edited

Loading