-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] PITR test failing with Segmentation Fault with core dump #18948
Comments
This issue is observed on build 57, and build 71 on master as well |
Core dump was repeated at 2.20.1.1-b3 version |
Core dump seen on another run: http://stress.dev.yugabyte.com/stress_test/85a078b8-4d00-471b-b49e-52e8261caedf 2.20.2.0-b67 However this is without any PITR, its a CDC case with nemesis running. |
@timothy-e @m-iancu |
Observing this core dump in test_ysql_bank_operations_pessimistic_lock stress test on DB version |
Summary: The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster. ``` #0 0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip> #1 0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so) #2 0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip> #3 0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so) .... .... ``` The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received. The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`. According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit(): - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped. - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]]. - Objects with static storage duration are destroyed. - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed. This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor. It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated. Jira: DB-7796 Test Plan: To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`. On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue. The following bash script can be used to detect the coredumps: ``` #/bin/bash ITERATIONS=50 YBDB_PATH=/path/to/code/yugabyte-db # Count the number of dump files to avoid having to use `sudo coredumpctl` idumps=$(ls /var/lib/systemd/coredump/ | wc -l) for ((i = 0 ; i < $ITERATIONS ; i++ )) do echo "Iteration: $(($i + 1))"; $YBDB_PATH/bin/yb-ctl restart > /dev/null nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l) if (( nservers != 1)); then echo "Web server has not come up. Exiting" exit 1; fi sleep 5s # Kill the webserver pkill -TERM -f 'YSQL webserver' # Count the number of coredumps # Please validate that the coredump produced is that of postgres/webserver ndumps=$(ls /var/lib/systemd/coredump/ | wc -l) if (( ndumps > idumps )); then echo "Core dumps: $(($ndumps - $idumps))" else echo "No new core dumps found" fi done ``` Run the script with the load generation tool running against the webserver in the background. - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations. - With the fix, no coredumps were observed. Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: ybase, smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35116
Summary: The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster. ``` #0 0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip> #1 0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so) #2 0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip> #3 0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so) .... .... ``` The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received. The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`. According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit(): - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped. - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]]. - Objects with static storage duration are destroyed. - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed. This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor. It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated. Jira: DB-7796 Test Plan: To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`. On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue. The following bash script can be used to detect the coredumps: ``` #/bin/bash ITERATIONS=50 YBDB_PATH=/path/to/code/yugabyte-db # Count the number of dump files to avoid having to use `sudo coredumpctl` idumps=$(ls /var/lib/systemd/coredump/ | wc -l) for ((i = 0 ; i < $ITERATIONS ; i++ )) do echo "Iteration: $(($i + 1))"; $YBDB_PATH/bin/yb-ctl restart > /dev/null nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l) if (( nservers != 1)); then echo "Web server has not come up. Exiting" exit 1; fi sleep 5s # Kill the webserver pkill -TERM -f 'YSQL webserver' # Count the number of coredumps # Please validate that the coredump produced is that of postgres/webserver ndumps=$(ls /var/lib/systemd/coredump/ | wc -l) if (( ndumps > idumps )); then echo "Core dumps: $(($ndumps - $idumps))" else echo "No new core dumps found" fi done ``` Run the script with the load generation tool running against the webserver in the background. - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations. - With the fix, no coredumps were observed. Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: ybase, smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35116
… SIGTERM Summary: Original commit: 5862233 / D35116 The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster. ``` #0 0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip> #1 0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so) #2 0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip> #3 0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so) .... .... ``` The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received. The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`. According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit(): - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped. - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]]. - Objects with static storage duration are destroyed. - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed. This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor. It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated. Jira: DB-7796 Test Plan: To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`. On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue. The following bash script can be used to detect the coredumps: ``` #/bin/bash ITERATIONS=50 YBDB_PATH=/path/to/code/yugabyte-db # Count the number of dump files to avoid having to use `sudo coredumpctl` idumps=$(ls /var/lib/systemd/coredump/ | wc -l) for ((i = 0 ; i < $ITERATIONS ; i++ )) do echo "Iteration: $(($i + 1))"; $YBDB_PATH/bin/yb-ctl restart > /dev/null nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l) if (( nservers != 1)); then echo "Web server has not come up. Exiting" exit 1; fi sleep 5s # Kill the webserver pkill -TERM -f 'YSQL webserver' # Count the number of coredumps # Please validate that the coredump produced is that of postgres/webserver ndumps=$(ls /var/lib/systemd/coredump/ | wc -l) if (( ndumps > idumps )); then echo "Core dumps: $(($ndumps - $idumps))" else echo "No new core dumps found" fi done ``` Run the script with the load generation tool running against the webserver in the background. - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations. - With the fix, no coredumps were observed. Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: yql, smishra, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35169
…IGTERM Summary: Original commit: 5862233 / D35116 The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster. ``` #0 0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip> #1 0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so) #2 0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip> #3 0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so) .... .... ``` The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received. The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`. According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit(): - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped. - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]]. - Objects with static storage duration are destroyed. - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed. This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor. It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated. Jira: DB-7796 Test Plan: To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`. On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue. The following bash script can be used to detect the coredumps: ``` ITERATIONS=50 YBDB_PATH=/path/to/code/yugabyte-db idumps=$(ls /var/lib/systemd/coredump/ | wc -l) for ((i = 0 ; i < $ITERATIONS ; i++ )) do echo "Iteration: $(($i + 1))"; $YBDB_PATH/bin/yb-ctl restart > /dev/null nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l) if (( nservers != 1)); then echo "Web server has not come up. Exiting" exit 1; fi sleep 5s # Kill the webserver pkill -TERM -f 'YSQL webserver' # Count the number of coredumps # Please validate that the coredump produced is that of postgres/webserver ndumps=$(ls /var/lib/systemd/coredump/ | wc -l) if (( ndumps > idumps )); then echo "Core dumps: $(($ndumps - $idumps))" else echo "No new core dumps found" fi done ``` Run the script with the load generation tool running against the webserver in the background. - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations. - With the fix, no coredumps were observed. Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: yql, smishra, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35171
…IGTERM Summary: Original commit: 5862233 / D35116 The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster. ``` #0 0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip> #1 0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so) #2 0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip> #3 0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so) .... .... ``` The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received. The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`. According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit(): - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped. - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]]. - Objects with static storage duration are destroyed. - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed. This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor. It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated. Jira: DB-7796 Test Plan: To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`. On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue. The following bash script can be used to detect the coredumps: ``` #/bin/bash ITERATIONS=50 YBDB_PATH=/path/to/code/yugabyte-db # Count the number of dump files to avoid having to use `sudo coredumpctl` idumps=$(ls /var/lib/systemd/coredump/ | wc -l) for ((i = 0 ; i < $ITERATIONS ; i++ )) do echo "Iteration: $(($i + 1))"; $YBDB_PATH/bin/yb-ctl restart > /dev/null nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l) if (( nservers != 1)); then echo "Web server has not come up. Exiting" exit 1; fi sleep 5s # Kill the webserver pkill -TERM -f 'YSQL webserver' # Count the number of coredumps # Please validate that the coredump produced is that of postgres/webserver ndumps=$(ls /var/lib/systemd/coredump/ | wc -l) if (( ndumps > idumps )); then echo "Core dumps: $(($ndumps - $idumps))" else echo "No new core dumps found" fi done ``` Run the script with the load generation tool running against the webserver in the background. - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations. - With the fix, no coredumps were observed. Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: ybase, smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35170
Jira Link: DB-7796
Description
Segmentation Fault with core dump is observed in
test_ysql_snapshots_daily_test
. Observed this failure in on2.19.3.0-b29
but unable to reproduce same issue in further runsSteps
RESTART_VM
Environment setup
Stack trace
These core dumps were not observed before
Error:
Logs attached: https://drive.google.com/file/d/1p76ewpZf0W8zBDfvZDUhOC6l9PJ8dRep/view?usp=sharing
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: