Kapex and other production node endpoint links commented without justification. #9619

chrisdcosta · 2023-06-22T02:12:22Z

I'm not sure why this is happening, but recently our production endpoint wss://k-ui.kapex.network has been removed from being accessible in the Apps UI.

What is the current behavior and expected behavior?

This appears to be a recent change which removes the endpoint from the list automatically if a script is not able to reach the end point. Our logs show no downtime.

This may or may not be a bug, but I think that in principle all production RPC endpoints should not be removed unless requested to do so by the teams or after having attempted to contact the teams directly in the event that an endpoint appears to be unresponsive.

We have third party wallet services that have not reported issues connecting to our endpoint.

The text was updated successfully, but these errors were encountered:

jacogr · 2023-06-22T05:03:58Z

See

apps/packages/apps-config/src/endpoints/productionRelayPolkadot.ts

Line 346 in 1e10ba8

    
           // Totem: 'wss://k-ui.kapex.network' // https://github.com/polkadot-js/apps/issues/9616

(it is commented)

It has been tested as unreachable for 3 runs in a row (it runs twice daily).

So clicking through the link -

Wed, Jun 21 2023, 1:09:45 pm (3rd) Unavailable chain endpoints Wed, Jun 21 2023, 1:09:45 pm #9616
Wed, Jun 21 2023, 2:11:31 am (2nd) Unavailable chain endpoints Wed, Jun 21 2023, 2:11:31 am #9614
Tue, Jun 20 2023, 1:08:32 pm (1st) Unavailable chain endpoints Tue, Jun 20 2023, 1:08:32 pm #9613

TL;DR if an endpoint is detected as unreachable in 3 runs, it gets marked as such with the link to the actual (final) detection run. This is not specific to this endpoint, there are a number that gets marked as such as soon as detected offline - https://github.com/polkadot-js/apps/pulls?q=is%3Apr+is%3Aclosed+unreachable

chrisdcosta · 2023-06-22T10:33:22Z

There are many reasons why a node could be unreachable from CI tests. Indeed EFF Lets Encrypt auto certificate renewal from within Github and Gitlab sometimes fails, and not because those services are "unavailable".

What purpose does it serve to comment the end point? Does it get uncommented when it is reachable again? If not, why not?

jacogr · 2023-06-22T12:21:16Z

It is not a "unavailable once, it is out" - if it is unreachable for 3 consecutive tries, it gets removed. So if it "sometimes" fails, it is all fine. Hence all the issues listing 1st, 2nd and 3rd - which refers to consecutive reports.

Anybody is free to re-enable with a PR if they feel the endpoint is reachable again.

Sadly, my position on this is fixed. It is what it is.

chrisdcosta · 2023-06-22T12:46:46Z

Also should add that I have looked at the logs, there might have been some heavy traffic, but again it does not look like the node was down. Perhaps the response time was slower, but again, it should not mean that the node endpoint should be deprecated or commented.

#9616

#9614

chrisdcosta · 2023-06-22T12:50:49Z

It is not a "unavailable once, it is out" - if it is unreachable for 3 consecutive tries, it gets removed. So if it "sometimes" fails, it is all fine. Hence all the issues listing 1st, 2nd and 3rd - which refers to consecutive reports.

Anybody is free to re-enable with a PR if they feel the endpoint is reachable again.

Sadly, my position on this is fixed. It is what it is.

This could just as easily be a problem with the CI not having enough bandwidth as anything else - the server was clearly not down!

My point is that you have not answered what point this serves you on Production?

Also for us to manually have to check if this endpoint has been deprecated or not and then issue a pull request is unmaintainable.

jacogr · 2023-06-22T13:30:00Z

This is the reality -

Teams stop endpoints and they never remove them here
Some teams never monitor, release and throw out there (it may or may not work)
The next we hear about it is that "this UI doesn't work" (cannot connect) and issues get logged

I'm doing the PRs manually (and sort the issues manually, aka track consecutive issues), since NOT doing so just leads to a stream of user support.

Things don't "just stop working". And they don't do so over 3 consecutive issues either.

I appreciate that you don't like this - but nothing about this is going to change, as mentioned above.

chrisdcosta · 2023-06-22T16:19:17Z

Ok I understand you want to keep issues to being related specifically to issue with apps, and not connectivity, but I think there must be a better solution. This feels wrong for the community not just the teams.

For example if we hadn't just made a partnership with Dwellir our parachain on the flagship Polkadot would have simply disappeared - that's not acceptable for any team.

Can you point me to your script and I will try to look at alternatives that address your concerns and ours?

jacogr · 2023-06-22T16:45:03Z

That is, fine, I hear your concerns, and I understand you think you are speaking to a brick wall here, but the auto-check won't go away.

It is always been proven to be correct, either due to -

Endpoints just being switched off
DNS just being dropped
Connection issues caused by load (there is a connection timeout in the tests)
Connection issues caused by configuration (i.e. if you only allow apps.polkadot.js.org it will show up since it comes form a different IP)
(Sadly) Some issues caused by upgrades where RPC has not behaved correctly
There are always transient errors, hence the 3 consecutive issues manual check, aka FAIL, FAIL, OK, FAIL (not 3 in a row) won't be marked, but FAIL, FAIL, FAIL (3 in a row) would be

With all that, I am always very happy to look at actual concrete PRs to tighten things up, extra eyes are always helpful.

Here is the actual tests that are being executed -

https://github.com/polkadot-js/apps/blob/master/packages/apps-config/src/ci/chainEndpoints.spec.ts

CI executes this via yarn ci:chainEndpoints which is just an alias to yarn polkadot-dev-run-test --env node packages/apps-config/src/ci/chainEndpoints with logging to file enabled.

jamesbayly · 2023-06-23T00:11:52Z

@jacogr is there a way that we can give you a contact to notify. e.g. If you get three consecutive fails, the you send an email notification to the contact (or perhaps assign a GitHub issue directly to the person that added the endpoint), and if you don't hear back in 24 hours you take the endpoint out.

The only problem with the current process is that this fails somewhat silently from our perspective, and unless we manually review endpoints each month, we don't know what has been removed or not. A tight time based notification flow would allow us to automatically rectify etc?

jacogr · 2023-06-23T03:25:43Z

If we add a config for DNS, we can match on those URLs it can then add the contact on a per issue basis.

So the config somewhere like -

// could also make this an array, which is probably preferable for larger providers
'*.api.onfinality.io': 'jamesbayly'

So the outputs can look something like

 x packages/apps-config/src/ci/chainEndpoints.spec.ts
   Bifrost @ wss://bifrost-polkadot.api.onfinality.io/public-ws

	 testCodeFailure / ERR_TEST_FAILURE

	 No DNS entry
         cc @jamesbayly

(Without the quotes in that case, so the GH CC logic takes effect)

It is slightly noisier than "just on issue 3", but those are not done automatically as of yet (there is a whole bunch on logic needed for that, and obviously needs to be consequtive), but at least the above would make a start and alert earlier?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kapex and other production node endpoint links commented without justification. #9619

Kapex and other production node endpoint links commented without justification. #9619

chrisdcosta commented Jun 22, 2023

jacogr commented Jun 22, 2023

chrisdcosta commented Jun 22, 2023

jacogr commented Jun 22, 2023

chrisdcosta commented Jun 22, 2023

chrisdcosta commented Jun 22, 2023

jacogr commented Jun 22, 2023 •

edited

Loading

chrisdcosta commented Jun 22, 2023

jacogr commented Jun 22, 2023 •

edited

Loading

jamesbayly commented Jun 23, 2023

jacogr commented Jun 23, 2023 •

edited

Loading

Kapex and other production node endpoint links commented without justification. #9619

Kapex and other production node endpoint links commented without justification. #9619

Comments

chrisdcosta commented Jun 22, 2023

jacogr commented Jun 22, 2023

chrisdcosta commented Jun 22, 2023

jacogr commented Jun 22, 2023

chrisdcosta commented Jun 22, 2023

chrisdcosta commented Jun 22, 2023

jacogr commented Jun 22, 2023 • edited Loading

chrisdcosta commented Jun 22, 2023

jacogr commented Jun 22, 2023 • edited Loading

jamesbayly commented Jun 23, 2023

jacogr commented Jun 23, 2023 • edited Loading

jacogr commented Jun 22, 2023 •

edited

Loading

jacogr commented Jun 22, 2023 •

edited

Loading

jacogr commented Jun 23, 2023 •

edited

Loading