Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero time restart #2214

Merged
merged 4 commits into from
Mar 19, 2021
Merged

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Mar 17, 2021

What do these changes do?

  • add a retry middleware on the webserver: in case of 500 a retry will be done (3x) before returning 500 to the frontend
  • add load balancer healthchecking of the webserver as traefik label so that traefik recognize when the webserver is not available anymore. NOTE: the webserver still need to issue a 503 when shutting down, with the header field Retry-After set to something reasonable. the webclient shall read this and retry then before failing.
  • set the healthcheck for traefik
    (⚠️ devops) = simcore_traefik commands must be adapted @Surfict , @pcrespov , @GitHK :
# this shall be added to the command field
- "--ping=true"
- "--entryPoints.ping.address=:9082"
- "--ping.entryPoint=ping"
- - "--providers.docker.swarmModeRefreshSeconds=1"

Related issue/s

Zero downtime: Connected users should be able to continue working with osparc when osparc micro-services are restarted #2212

How to test

make build-x up-prod
  1. go to http://127.0.0.1:9081
  2. log in
  3. go to http://127.0.0.1:9081/dev/doc#/project/list_projects
  4. try the call --> it should return a 200
  5. right-click in google chrom dev module, identify the call to /projects?type=all&state=active, then click in Copy, then Copy as cURL
  6. in the following bash script, replace from the 2nd line till --compressed with the obtained copy
  7. it should look like something similar as below (let's call it tester.bash).
#!/bin/bash
doit() {
  curl -s -o /dev/null -w "%{http_code} - %{time_total}s\n" 'http://127.0.0.1:9081/v0/projects?type=all&state=active' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"' \
  -H 'accept: application/json' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Referer: http://127.0.0.1:9081/dev/doc' \
  -H 'Accept-Language: en-US,en;q=0.9' \
  -H 'Cookie: adminer_key=f70327f90dbb00a8c697c933a32bbd24; _ga=GA1.1.1164697573.1596089800; _pk_id.1.dc78=3bc4017fe41ce1b3.1598967593.; _xsrf=2|c9c1015c|4cc66909cafefb2882de969e58f09aa4|1613982430; osparc.WEBAPI_SESSION="gAAAAABgTyx35utztzbL4WHmBhDAhjnMOz4FbTBKDiWHcCB97Q6ScucrxGveO39zaoXq4Av4i6iaDKnw1lIXtMTF6gStAZ7_cBTB7pGebXYKfmk7QaeBkEhU1QZjwnQr_kDldrLlROEC_0Ix2ntS2jQK3oHg3qZz_QMHNm5Ri_nwIR-Lwwm7eNY="; user=anderegg@itis.swiss; adminer_sid=f091b10a4383a59dc8593c5de85ecdaa; adminer_permanent=cGdzcWw%3D-cG9zdGdyZXM%3D-YWRtaW4%3D-dGVzdA%3D%3D%3Aa7x8OoLd3eqJCAKu%2BcGdzcWw%3D-cG9zdGdyZXM%3D-c2N1-c2ltY29yZWRi%3AwLJfeCwqtZqCaXHsyuXFxQ%3D%3D%2BcGdzcWw%3D-cG9zdGdyZXM%3D-c2N1-c2ltY29yZWRi%3AwLJfeCwqtZqCaXHsyuXFxQ%3D%3D%2BcGdzcWw%3D-cG9zdGdyZXM%3D-c2N1-c2ltY29yZWRi%3AwLJfeCwqtZqCaXHsyuXFxQ%3D%3D+cGdzcWw%3D-cG9zdGdyZXM%3D-c2N1-c2ltY29yZWRi%3AMPBAMB4Ev%2F998eSTpbMe3A%3D%3D; _pk_ses.1.dc78=1; io=72b46316ea2a4bd2b2b06af5fdccb380' \
  --compressed
}
export -f doit

while true; do
echo $(date +"%H:%M:%S.%N") $(doit)
# seq 10 | parallel --progress -n0 doit
done
  1. go to ./tester.bash this will continuously call /projects endpoint and should output 200s
  2. issue docker service update --force master-simcore_webserver, this will simulate an update of the webserver

Checklist

@sanderegg sanderegg added the a:webserver issue related to the webserver service label Mar 17, 2021
@sanderegg sanderegg added this to the The Red Panda milestone Mar 17, 2021
@sanderegg sanderegg requested a review from pcrespov March 17, 2021 22:29
@sanderegg sanderegg self-assigned this Mar 17, 2021
@codecov
Copy link

codecov bot commented Mar 17, 2021

Codecov Report

Merging #2214 (19a3c0d) into master (ab36b42) will decrease coverage by 10.9%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2214      +/-   ##
=========================================
- Coverage    66.6%   55.6%   -11.0%     
=========================================
  Files         465     336     -129     
  Lines       17949   12672    -5277     
  Branches     1769    1191     -578     
=========================================
- Hits        11965    7057    -4908     
+ Misses       5579    5425     -154     
+ Partials      405     190     -215     
Flag Coverage Δ
integrationtests ?
unittests 55.6% <ø> (-7.0%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ackages/simcore-sdk/src/simcore_sdk/models/base.py 0.0% <0.0%> (-100.0%) ⬇️
...ges/simcore-sdk/src/simcore_sdk/models/__init__.py 0.0% <0.0%> (-100.0%) ⬇️
...erver/src/simcore_service_webserver/rest_models.py 0.0% <0.0%> (-91.2%) ⬇️
...es/web/server/src/simcore_service_webserver/cli.py 0.0% <0.0%> (-73.9%) ⬇️
...es/sidecar/src/simcore_service_sidecar/mpi_lock.py 29.0% <0.0%> (-71.0%) ⬇️
...rc/simcore_service_webserver/exporter/archiving.py 29.4% <0.0%> (-66.7%) ⬇️
...erver/src/simcore_service_webserver/diagnostics.py 34.0% <0.0%> (-66.0%) ⬇️
...core-sdk/src/simcore_sdk/models/pipeline_models.py 0.0% <0.0%> (-64.6%) ⬇️
...src/simcore_service_webserver/activity/handlers.py 27.1% <0.0%> (-62.9%) ⬇️
...ore_service_director_v2/api/routes/computations.py 33.9% <0.0%> (-57.2%) ⬇️
... and 214 more

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So traffik service has healthcheck entrypoint in port 8092 and monitors webserver healthcheck to balance the traffic between new and old webserver containers.

Q: does this mean that the webserver is queried every second by traffik and every now-and-then by the swarm? or if there is a single replica of the webserver, then traffik does not need to "balance" and therefore skips the health-check call?

@sanderegg
Copy link
Member Author

sanderegg commented Mar 19, 2021

@pcrespov I will also add the following: traefik issue1 and traefik issue2

@sanderegg
Copy link
Member Author

port 9082 for monitoring traefik itself.

Q: does this mean that the webserver is queried every second by traffik and every now-and-then by the swarm? or if there is a single replica of the webserver, then traffik does not need to "balance" and therefore skips the health-check call?
no the load balancing happens all the time. else traefik would not be able to detect when it goes down.

Copy link
Member

@mguidon mguidon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets see how this performs on aws staging.

Copy link
Member

@odeimaiz odeimaiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@sanderegg sanderegg merged commit b9ba559 into ITISFoundation:master Mar 19, 2021
@sanderegg sanderegg deleted the zero_time_restart branch March 19, 2021 09:40
@sanderegg sanderegg mentioned this pull request Mar 24, 2021
mrnicegyu11 pushed a commit to mrnicegyu11/osparc-ops-environments that referenced this pull request Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:webserver issue related to the webserver service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants