[DocDB] PITR test failing with Segmentation Fault with core dump #18948

agsh-yb · 2023-09-01T08:59:19Z

Jira Link: DB-7796

Description

Segmentation Fault with core dump is observed in test_ysql_snapshots_daily_test. Observed this failure in on 2.19.3.0-b29 but unable to reproduce same issue in further runs

Steps

Create 20 YSQL tables
Delete existing schedules if any
Load data - using a large key multiplier-500 and Create a snapshot schedule
Execute Nemesis - Failure moment- After Schedule-RESTART_VM, ADD_DROP_COLUMN
Capture time for restore point
Execute 20 restores in a row
Evaluate nemesis after restore- RESTART_TSERVER, RESTART_MASTER, PARTITION_NETWORK, SLOW_NETWORK,
RESTART_VM
Assert no. of rows of data
Check for FATALs and CORRUPT files

Environment setup

instance_type="c5.2xlarge",
        master_gflags={
            "ysql_num_shards_per_tserver": "10",
            "yb_num_shards_per_tserver": "10",
        },
        tserver_gflags={
            "ysql_num_shards_per_tserver": "10",
            "yb_num_shards_per_tserver": "10",
            'client_read_write_timeout_ms': str(30 * 60 * 1000),
            'yb_client_admin_operation_timeout_sec': str(30 * 60),
        }

Stack trace

(lldb) target create "/home/yugabyte/yb-software/yugabyte-2.19.3.0-b29-centos-x86_64/postgres/bin/postgres" --core "/home/yugabyte/cores/core_1603_1693262229_!home!yugabyte!yb-software!yugabyte-2.19.3.0-b29-centos-x86_64!postgres!bin!postgres"
Core file '/home/yugabyte/cores/core_1603_1693262229_!home!yugabyte!yb-software!yugabyte-2.19.3.0-b29-centos-x86_64!postgres!bin!postgres' (x86_64) was loaded.
(lldb) bt all
warning: This version of LLDB has no plugin for the language "assembler". Inspection of frame variables will be limited.
* thread #1, name = 'postgres', stop reason = signal SIGSEGV
  * frame #0: 0x00007f31012f7e85 libyb_util.so`std::__1::__hash_const_iterator<std::__1::__hash_node<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, void*>*> std::__1::__hash_table<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::__unordered_map_hasher<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, true>, std::__1::__unordered_map_equal<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>::find<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>(this=<unavailable>, __k=<unavailable>) const at __hash_table:2338:31
    frame #1: 0x00007f31012f7b16 libyb_util.so`yb::PrometheusWriter::WriteSingleEntry(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, long, yb::AggregationFunction, char const*, char const*) [inlined] std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>::find[abi:v15007](this=0x00007f31005e34f0, __k="table_id") const at unordered_map:1445:69
    frame #2: 0x00007f31012f7b00 libyb_util.so`yb::PrometheusWriter::WriteSingleEntry(this=0x00007f30f5baa818, attr=0x00007f31005e34f0, name="handler_latency_yb_ysqlserver_SQLProcessor_SelectStmt_sum", value=641, aggregation_function=kSum, type="", description="") at metrics_writer.cc:125:18
    frame #3: 0x00007f31005de4ba libyb_pggate_webserver.so`yb::pggate::PgPrometheusMetricsHandler(req=<unavailable>, resp=<unavailable>) at pgsql_webserver_wrapper.cc:428:5
    frame #4: 0x00007f3102bac770 libserver_process.so`yb::Webserver::Impl::RunPathHandler(yb::Webserver::Impl::PathHandler const&, sq_connection*, sq_request_info*) [inlined] std::__1::__function::__value_func<void (yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*)>::operator(this=0x0000259abfdd8800, __args=0x00007f30f5bacb80, __args=0x00007f30f5baca50)[abi:v15007](yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*&&) const at function.h:512:16
    frame #5: 0x00007f3102bac757 libserver_process.so`yb::Webserver::Impl::RunPathHandler(yb::Webserver::Impl::PathHandler const&, sq_connection*, sq_request_info*) [inlined] std::__1::function<void (yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*)>::operator(this= Function = yb::pggate::PgPrometheusMetricsHandler(yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*) , __arg=0x00007f30f5bacb80, __arg=0x00007f30f5baaa50)(yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*) const at function.h:1197:12
    frame #6: 0x00007f3102bac757 libserver_process.so`yb::Webserver::Impl::RunPathHandler(this=0x0000259abfcfec80, handler=<unavailable>, connection=0x0000259abfddf000, request_info=<unavailable>) at webserver.cc:625:5
    frame #7: 0x00007f3102bac1ed libserver_process.so`yb::Webserver::Impl::BeginRequestCallback(this=0x0000259abfcfec80, connection=0x0000259abfddf000, request_info=0x0000259abfddf000) at webserver.cc:557:10
    frame #8: 0x00007f3102bb8da8 libserver_process.so`process_new_connection + 4184
    frame #9: 0x00007f3102bb7a58 libserver_process.so`worker_thread + 232
    frame #10: 0x00007f3105b10694 libpthread.so.0`start_thread(arg=0x00007f30f5bbc700) at pthread_create.c:333
    frame #11: 0x00007f310524d41d libc.so.6`__clone at clone.S:109
  thread #2, stop reason = signal 0
    frame #0: 0x00007f3105b1500f libpthread.so.0`pthread_cond_wait@@GLIBC_2.3.2 at pthread_cond_wait.S:185
    frame #1: 0x00007f3102bb78d9 libserver_process.so`master_thread + 1305
    frame #2: 0x00007f3105b10694 libpthread.so.0`start_thread(arg=0x00007f30f63bd700) at pthread_create.c:333
    frame #3: 0x00007f310524d41d libc.so.6`__clone at clone.S:109
  thread #3, stop reason = signal 0
    frame #0: 0x00007f31071a625e ld.so`_dl_fini at dl-fini.c:180
    frame #1: 0x00007f310519c969 libc.so.6`__run_exit_handlers(status=1, listp=0x00007f31054fd5c0, run_list_atexit=true) at exit.c:82
    frame #2: 0x00007f310519c9b5 libc.so.6`__GI_exit(status=<unavailable>) at exit.c:104
    frame #3: 0x000055eba6f9d552 postgres`proc_exit(code=1) at ipc.c:157:2
    frame #4: 0x000055eba7182d9f postgres`errfinish(dummy=<unavailable>) at elog.c:801:3
    frame #5: 0x000055eba6efd4dd postgres`bgworker_die(postgres_signal_arg=<unavailable>) at bgworker.c:672:2
    frame #6: 0x00007f3105b18ba0 libpthread.so.0`__restore_rt
    frame #7: 0x00007f310524d9f3 libc.so.6`epoll_wait at syscall-template.S:84
    frame #8: 0x000055eba6f9f575 postgres`WaitEventSetWait [inlined] WaitEventSetWaitBlock(set=0x0000259abfda9e88, cur_timeout=-1, occurred_events=0x00007ffcb1a1b0f8, nevents=1) at latch.c:1062:7
    frame #9: 0x000055eba6f9f567 postgres`WaitEventSetWait(set=0x0000259abfda9e88, timeout=-1, occurred_events=<unavailable>, nevents=1, wait_event_info=<unavailable>) at latch.c:1014:8
    frame #10: 0x000055eba6f9eff3 postgres`WaitLatchOrSocket(latch=0x00007f30ff74d88c, wakeEvents=<unavailable>, sock=-1, timeout=<unavailable>, wait_event_info=117440512) at latch.c:399:7
    frame #11: 0x000055eba6f9eef0 postgres`WaitLatch(latch=<unavailable>, wakeEvents=<unavailable>, timeout=<unavailable>, wait_event_info=<unavailable>) at latch.c:353:9 [artificial]
    frame #12: 0x00007f31005c0640 yb_pg_metrics.so`webserver_worker_main(unused=<unavailable>) at yb_pg_metrics.c:375:3
    frame #13: 0x000055eba6efd36c postgres`StartBackgroundWorker at bgworker.c:841:2
    frame #14: 0x000055eba6f1619c postgres`maybe_start_bgworkers [inlined] do_start_bgworker(rw=0x0000259abfcfe000) at postmaster.c:6005:4
    frame #15: 0x000055eba6f1613f postgres`maybe_start_bgworkers at postmaster.c:6231:9
    frame #16: 0x000055eba6f12940 postgres`PostmasterMain(argc=25, argv=0x0000259abfd021a0) at postmaster.c:1429:2
    frame #17: 0x000055eba6e1388f postgres`PostgresServerProcessMain(argc=25, argv=0x0000259abfd021a0) at main.c:234:3
    frame #18: 0x000055eba6ad6f12 postgres`main + 34
    frame #19: 0x00007f3105187825 libc.so.6`__libc_start_main(main=(postgres`main), argc=25, argv=0x00007ffcb1a1ba48, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007ffcb1a1ba38) at libc-start.c:289
    frame #20: 0x000055eba6ad6e29 postgres`_start at start.S:108

These core dumps were not observed before
Error:

AssertionError: FATALS: 1, CORE DUMPS: {"172.151.30.207": ["core_1603_1693262229_!home!yugabyte!yb-software!yugabyte-2.19.3.0-b29-centos-x86_64!postgres!bin!postgres"]}, TABLETS: 10, LOGGED_NEW_TABLETS: 6

Logs attached: https://drive.google.com/file/d/1p76ewpZf0W8zBDfvZDUhOC6l9PJ8dRep/view?usp=sharing

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

agsh-yb · 2023-09-08T05:10:10Z

This issue is observed on build 57, and build 71 on master as well

sushantrmishra · 2023-10-11T20:31:02Z

Possible other duplicate of this issue:

[YSQL] Segmentation Fault in PrometheusWriter::WriteSingleEntry within PostgreSQL Process #19426
[YCQL Packed Columns][Stress Tests] [2.19.3.0-b71] Segmentation Fault with libyb_util.so during Metric Writing #19046
[YCQL Packed Rows] [Stress Tests] Core dump with SIGSEGV on libyb_util.so`yb::PrometheusWriter::WriteSingleEntry - handler_latency_yb_ysqlserver_SQLProcessor_SelectStmt_count after Tserver Restart Nemesis #19456
[YSQL] Postgres crash seen when running CDC with nemesis testcase #16639
[YSQL] Postgres process get rashed during sequential PITR restore test #17847

pilshchikov · 2024-01-05T20:21:24Z

Core dump was repeated at 2.20.1.1-b3 version

shamanthchandra-yb · 2024-01-08T05:29:16Z

@timothy-e @m-iancu

Core dump seen on another run: http://stress.dev.yugabyte.com/stress_test/85a078b8-4d00-471b-b49e-52e8261caedf

2.20.2.0-b67

However this is without any PITR, its a CDC case with nemesis running.

agsh-yb · 2024-01-12T09:11:25Z

@timothy-e @m-iancu
This dump seen on build2.21.0.0-b444
http://stress.dev.yugabyte.com/stress_test/890af47c-94cd-44c4-9b72-20f2097018ba

Karvy-yb · 2024-01-30T09:23:13Z

Observing this core dump in test_ysql_bank_operations_pessimistic_lock stress test on DB version 2.21.0.0-b461

Summary: The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster. ``` #0 0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip> #1 0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so) #2 0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip> #3 0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so) .... .... ``` The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received. The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`. According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit(): - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped. - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]]. - Objects with static storage duration are destroyed. - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed. This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor. It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated. Jira: DB-7796 Test Plan: To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`. On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue. The following bash script can be used to detect the coredumps: ``` #/bin/bash ITERATIONS=50 YBDB_PATH=/path/to/code/yugabyte-db # Count the number of dump files to avoid having to use `sudo coredumpctl` idumps=$(ls /var/lib/systemd/coredump/ | wc -l) for ((i = 0 ; i < $ITERATIONS ; i++ )) do echo "Iteration: $(($i + 1))"; $YBDB_PATH/bin/yb-ctl restart > /dev/null nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l) if (( nservers != 1)); then echo "Web server has not come up. Exiting" exit 1; fi sleep 5s # Kill the webserver pkill -TERM -f 'YSQL webserver' # Count the number of coredumps # Please validate that the coredump produced is that of postgres/webserver ndumps=$(ls /var/lib/systemd/coredump/ | wc -l) if (( ndumps > idumps )); then echo "Core dumps: $(($ndumps - $idumps))" else echo "No new core dumps found" fi done ``` Run the script with the load generation tool running against the webserver in the background. - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations. - With the fix, no coredumps were observed. Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: ybase, smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35116

… SIGTERM Summary: Original commit: 5862233 / D35116 The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster. ``` #0 0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip> #1 0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so) #2 0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip> #3 0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so) .... .... ``` The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received. The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`. According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit(): - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped. - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]]. - Objects with static storage duration are destroyed. - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed. This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor. It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated. Jira: DB-7796 Test Plan: To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`. On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue. The following bash script can be used to detect the coredumps: ``` #/bin/bash ITERATIONS=50 YBDB_PATH=/path/to/code/yugabyte-db # Count the number of dump files to avoid having to use `sudo coredumpctl` idumps=$(ls /var/lib/systemd/coredump/ | wc -l) for ((i = 0 ; i < $ITERATIONS ; i++ )) do echo "Iteration: $(($i + 1))"; $YBDB_PATH/bin/yb-ctl restart > /dev/null nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l) if (( nservers != 1)); then echo "Web server has not come up. Exiting" exit 1; fi sleep 5s # Kill the webserver pkill -TERM -f 'YSQL webserver' # Count the number of coredumps # Please validate that the coredump produced is that of postgres/webserver ndumps=$(ls /var/lib/systemd/coredump/ | wc -l) if (( ndumps > idumps )); then echo "Core dumps: $(($ndumps - $idumps))" else echo "No new core dumps found" fi done ``` Run the script with the load generation tool running against the webserver in the background. - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations. - With the fix, no coredumps were observed. Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: yql, smishra, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35169

…IGTERM Summary: Original commit: 5862233 / D35116 The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster. ``` #0 0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip> #1 0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so) #2 0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip> #3 0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so) .... .... ``` The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received. The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`. According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit(): - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped. - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]]. - Objects with static storage duration are destroyed. - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed. This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor. It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated. Jira: DB-7796 Test Plan: To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`. On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue. The following bash script can be used to detect the coredumps: ``` ITERATIONS=50 YBDB_PATH=/path/to/code/yugabyte-db idumps=$(ls /var/lib/systemd/coredump/ | wc -l) for ((i = 0 ; i < $ITERATIONS ; i++ )) do echo "Iteration: $(($i + 1))"; $YBDB_PATH/bin/yb-ctl restart > /dev/null nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l) if (( nservers != 1)); then echo "Web server has not come up. Exiting" exit 1; fi sleep 5s # Kill the webserver pkill -TERM -f 'YSQL webserver' # Count the number of coredumps # Please validate that the coredump produced is that of postgres/webserver ndumps=$(ls /var/lib/systemd/coredump/ | wc -l) if (( ndumps > idumps )); then echo "Core dumps: $(($ndumps - $idumps))" else echo "No new core dumps found" fi done ``` Run the script with the load generation tool running against the webserver in the background. - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations. - With the fix, no coredumps were observed. Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: yql, smishra, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35171

…IGTERM Summary: Original commit: 5862233 / D35116 The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster. ``` #0 0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip> #1 0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so) #2 0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip> #3 0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so) .... .... ``` The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received. The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`. According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit(): - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped. - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]]. - Objects with static storage duration are destroyed. - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed. This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor. It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated. Jira: DB-7796 Test Plan: To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`. On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue. The following bash script can be used to detect the coredumps: ``` #/bin/bash ITERATIONS=50 YBDB_PATH=/path/to/code/yugabyte-db # Count the number of dump files to avoid having to use `sudo coredumpctl` idumps=$(ls /var/lib/systemd/coredump/ | wc -l) for ((i = 0 ; i < $ITERATIONS ; i++ )) do echo "Iteration: $(($i + 1))"; $YBDB_PATH/bin/yb-ctl restart > /dev/null nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l) if (( nservers != 1)); then echo "Web server has not come up. Exiting" exit 1; fi sleep 5s # Kill the webserver pkill -TERM -f 'YSQL webserver' # Count the number of coredumps # Please validate that the coredump produced is that of postgres/webserver ndumps=$(ls /var/lib/systemd/coredump/ | wc -l) if (( ndumps > idumps )); then echo "Core dumps: $(($ndumps - $idumps))" else echo "No new core dumps found" fi done ``` Run the script with the load generation tool running against the webserver in the background. - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations. - With the fix, no coredumps were observed. Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: ybase, smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35170

agsh-yb added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Sep 1, 2023

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Sep 1, 2023

agsh-yb assigned lingamsandeep Sep 1, 2023

yugabyte-ci added area/ysql Yugabyte SQL (YSQL) and removed area/docdb YugabyteDB core features labels Sep 5, 2023

yugabyte-ci assigned tverona1 and unassigned lingamsandeep Sep 5, 2023

sushantrmishra assigned timothy-e and unassigned tverona1 Oct 11, 2023

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Oct 11, 2023

yugabyte-ci added kind/bug This issue is a bug and removed kind/enhancement This is an enhancement of an existing feature labels Oct 31, 2023

agsh-yb added the qa_stress Bugs identified via Stress automation label Dec 11, 2023

yugabyte-ci unassigned timothy-e Jan 22, 2024

yugabyte-ci added the QA_Blocker label May 29, 2024

pilshchikov added the qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures label Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] PITR test failing with Segmentation Fault with core dump #18948

[DocDB] PITR test failing with Segmentation Fault with core dump #18948

agsh-yb commented Sep 1, 2023 •

edited

Loading

agsh-yb commented Sep 8, 2023

sushantrmishra commented Oct 11, 2023

pilshchikov commented Jan 5, 2024

shamanthchandra-yb commented Jan 8, 2024

agsh-yb commented Jan 12, 2024

Karvy-yb commented Jan 30, 2024

[DocDB] PITR test failing with Segmentation Fault with core dump #18948

[DocDB] PITR test failing with Segmentation Fault with core dump #18948

Comments

agsh-yb commented Sep 1, 2023 • edited Loading

Description

Steps

Environment setup

Stack trace

Warning: Please confirm that this issue does not contain any sensitive information

agsh-yb commented Sep 8, 2023

sushantrmishra commented Oct 11, 2023

pilshchikov commented Jan 5, 2024

shamanthchandra-yb commented Jan 8, 2024

agsh-yb commented Jan 12, 2024

Karvy-yb commented Jan 30, 2024

agsh-yb commented Sep 1, 2023 •

edited

Loading