Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BACKPORT 2.20][#18948] YSQL: Explicitly destruct YSQL webserver on S…
…IGTERM Summary: Original commit: 5862233 / D35116 The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster. ``` #0 0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip> #1 0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so) #2 0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip> #3 0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so) .... .... ``` The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received. The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`. According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit(): - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped. - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]]. - Objects with static storage duration are destroyed. - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed. This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor. It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated. Jira: DB-7796 Test Plan: To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`. On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue. The following bash script can be used to detect the coredumps: ``` #/bin/bash ITERATIONS=50 YBDB_PATH=/path/to/code/yugabyte-db # Count the number of dump files to avoid having to use `sudo coredumpctl` idumps=$(ls /var/lib/systemd/coredump/ | wc -l) for ((i = 0 ; i < $ITERATIONS ; i++ )) do echo "Iteration: $(($i + 1))"; $YBDB_PATH/bin/yb-ctl restart > /dev/null nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l) if (( nservers != 1)); then echo "Web server has not come up. Exiting" exit 1; fi sleep 5s # Kill the webserver pkill -TERM -f 'YSQL webserver' # Count the number of coredumps # Please validate that the coredump produced is that of postgres/webserver ndumps=$(ls /var/lib/systemd/coredump/ | wc -l) if (( ndumps > idumps )); then echo "Core dumps: $(($ndumps - $idumps))" else echo "No new core dumps found" fi done ``` Run the script with the load generation tool running against the webserver in the background. - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations. - With the fix, no coredumps were observed. Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: ybase, smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35170
- Loading branch information