Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kapex and other production node endpoint links commented without justification. #9619

Open
chrisdcosta opened this issue Jun 22, 2023 · 10 comments

Comments

@chrisdcosta
Copy link
Contributor

I'm not sure why this is happening, but recently our production endpoint wss://k-ui.kapex.network has been removed from being accessible in the Apps UI.

  • What is the current behavior and expected behavior?

This appears to be a recent change which removes the endpoint from the list automatically if a script is not able to reach the end point. Our logs show no downtime.

This may or may not be a bug, but I think that in principle all production RPC endpoints should not be removed unless requested to do so by the teams or after having attempted to contact the teams directly in the event that an endpoint appears to be unresponsive.

We have third party wallet services that have not reported issues connecting to our endpoint.

@jacogr
Copy link
Member

jacogr commented Jun 22, 2023

See

// Totem: 'wss://k-ui.kapex.network' // https://github.com/polkadot-js/apps/issues/9616
(it is commented)

It has been tested as unreachable for 3 runs in a row (it runs twice daily).

So clicking through the link -

TL;DR if an endpoint is detected as unreachable in 3 runs, it gets marked as such with the link to the actual (final) detection run. This is not specific to this endpoint, there are a number that gets marked as such as soon as detected offline - https://github.com/polkadot-js/apps/pulls?q=is%3Apr+is%3Aclosed+unreachable

@chrisdcosta
Copy link
Contributor Author

There are many reasons why a node could be unreachable from CI tests. Indeed EFF Lets Encrypt auto certificate renewal from within Github and Gitlab sometimes fails, and not because those services are "unavailable".

What purpose does it serve to comment the end point? Does it get uncommented when it is reachable again? If not, why not?

@jacogr
Copy link
Member

jacogr commented Jun 22, 2023

It is not a "unavailable once, it is out" - if it is unreachable for 3 consecutive tries, it gets removed. So if it "sometimes" fails, it is all fine. Hence all the issues listing 1st, 2nd and 3rd - which refers to consecutive reports.

Anybody is free to re-enable with a PR if they feel the endpoint is reachable again.

Sadly, my position on this is fixed. It is what it is.

@chrisdcosta
Copy link
Contributor Author

Also should add that I have looked at the logs, there might have been some heavy traffic, but again it does not look like the node was down. Perhaps the response time was slower, but again, it should not mean that the node endpoint should be deprecated or commented.

#9616
image

#9614
image

@chrisdcosta
Copy link
Contributor Author

It is not a "unavailable once, it is out" - if it is unreachable for 3 consecutive tries, it gets removed. So if it "sometimes" fails, it is all fine. Hence all the issues listing 1st, 2nd and 3rd - which refers to consecutive reports.

Anybody is free to re-enable with a PR if they feel the endpoint is reachable again.

Sadly, my position on this is fixed. It is what it is.

This could just as easily be a problem with the CI not having enough bandwidth as anything else - the server was clearly not down!

My point is that you have not answered what point this serves you on Production?

Also for us to manually have to check if this endpoint has been deprecated or not and then issue a pull request is unmaintainable.

@jacogr
Copy link
Member

jacogr commented Jun 22, 2023

This is the reality -

  1. Teams stop endpoints and they never remove them here
  2. Some teams never monitor, release and throw out there (it may or may not work)
  3. The next we hear about it is that "this UI doesn't work" (cannot connect) and issues get logged

I'm doing the PRs manually (and sort the issues manually, aka track consecutive issues), since NOT doing so just leads to a stream of user support.

Things don't "just stop working". And they don't do so over 3 consecutive issues either.

I appreciate that you don't like this - but nothing about this is going to change, as mentioned above.

@chrisdcosta
Copy link
Contributor Author

Ok I understand you want to keep issues to being related specifically to issue with apps, and not connectivity, but I think there must be a better solution. This feels wrong for the community not just the teams.

For example if we hadn't just made a partnership with Dwellir our parachain on the flagship Polkadot would have simply disappeared - that's not acceptable for any team.

Can you point me to your script and I will try to look at alternatives that address your concerns and ours?

@jacogr
Copy link
Member

jacogr commented Jun 22, 2023

That is, fine, I hear your concerns, and I understand you think you are speaking to a brick wall here, but the auto-check won't go away.

It is always been proven to be correct, either due to -

  1. Endpoints just being switched off
  2. DNS just being dropped
  3. Connection issues caused by load (there is a connection timeout in the tests)
  4. Connection issues caused by configuration (i.e. if you only allow apps.polkadot.js.org it will show up since it comes form a different IP)
  5. (Sadly) Some issues caused by upgrades where RPC has not behaved correctly
  6. There are always transient errors, hence the 3 consecutive issues manual check, aka FAIL, FAIL, OK, FAIL (not 3 in a row) won't be marked, but FAIL, FAIL, FAIL (3 in a row) would be

With all that, I am always very happy to look at actual concrete PRs to tighten things up, extra eyes are always helpful.

Here is the actual tests that are being executed -

https://github.com/polkadot-js/apps/blob/master/packages/apps-config/src/ci/chainEndpoints.spec.ts

CI executes this via yarn ci:chainEndpoints which is just an alias to yarn polkadot-dev-run-test --env node packages/apps-config/src/ci/chainEndpoints with logging to file enabled.

@jamesbayly
Copy link
Contributor

@jacogr is there a way that we can give you a contact to notify. e.g. If you get three consecutive fails, the you send an email notification to the contact (or perhaps assign a GitHub issue directly to the person that added the endpoint), and if you don't hear back in 24 hours you take the endpoint out.

The only problem with the current process is that this fails somewhat silently from our perspective, and unless we manually review endpoints each month, we don't know what has been removed or not. A tight time based notification flow would allow us to automatically rectify etc?

@jacogr
Copy link
Member

jacogr commented Jun 23, 2023

If we add a config for DNS, we can match on those URLs it can then add the contact on a per issue basis.

So the config somewhere like -

// could also make this an array, which is probably preferable for larger providers
'*.api.onfinality.io': 'jamesbayly'

So the outputs can look something like

 x packages/apps-config/src/ci/chainEndpoints.spec.ts
   Bifrost @ wss://bifrost-polkadot.api.onfinality.io/public-ws

	 testCodeFailure / ERR_TEST_FAILURE

	 No DNS entry
         cc @jamesbayly 

(Without the quotes in that case, so the GH CC logic takes effect)

It is slightly noisier than "just on issue 3", but those are not done automatically as of yet (there is a whole bunch on logic needed for that, and obviously needs to be consequtive), but at least the above would make a start and alert earlier?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants