Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] PITR test failing with Segmentation Fault with core dump #18948

Open
1 task done
agsh-yb opened this issue Sep 1, 2023 · 6 comments
Open
1 task done

[DocDB] PITR test failing with Segmentation Fault with core dump #18948

agsh-yb opened this issue Sep 1, 2023 · 6 comments
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures QA_Blocker qa_stress Bugs identified via Stress automation

Comments

@agsh-yb
Copy link
Contributor

agsh-yb commented Sep 1, 2023

Jira Link: DB-7796

Description

Segmentation Fault with core dump is observed in test_ysql_snapshots_daily_test. Observed this failure in on 2.19.3.0-b29 but unable to reproduce same issue in further runs

Steps

  1. Create 20 YSQL tables
  2. Delete existing schedules if any
  3. Load data - using a large key multiplier-500 and Create a snapshot schedule
  4. Execute Nemesis - Failure moment- After Schedule-RESTART_VM, ADD_DROP_COLUMN
  5. Capture time for restore point
  6. Execute 20 restores in a row
  7. Evaluate nemesis after restore- RESTART_TSERVER, RESTART_MASTER, PARTITION_NETWORK, SLOW_NETWORK,
    RESTART_VM
  8. Assert no. of rows of data
  9. Check for FATALs and CORRUPT files

Environment setup

instance_type="c5.2xlarge",
        master_gflags={
            "ysql_num_shards_per_tserver": "10",
            "yb_num_shards_per_tserver": "10",
        },
        tserver_gflags={
            "ysql_num_shards_per_tserver": "10",
            "yb_num_shards_per_tserver": "10",
            'client_read_write_timeout_ms': str(30 * 60 * 1000),
            'yb_client_admin_operation_timeout_sec': str(30 * 60),
        }

Stack trace

(lldb) target create "/home/yugabyte/yb-software/yugabyte-2.19.3.0-b29-centos-x86_64/postgres/bin/postgres" --core "/home/yugabyte/cores/core_1603_1693262229_!home!yugabyte!yb-software!yugabyte-2.19.3.0-b29-centos-x86_64!postgres!bin!postgres"
Core file '/home/yugabyte/cores/core_1603_1693262229_!home!yugabyte!yb-software!yugabyte-2.19.3.0-b29-centos-x86_64!postgres!bin!postgres' (x86_64) was loaded.
(lldb) bt all
warning: This version of LLDB has no plugin for the language "assembler". Inspection of frame variables will be limited.
* thread #1, name = 'postgres', stop reason = signal SIGSEGV
  * frame #0: 0x00007f31012f7e85 libyb_util.so`std::__1::__hash_const_iterator<std::__1::__hash_node<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, void*>*> std::__1::__hash_table<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::__unordered_map_hasher<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, true>, std::__1::__unordered_map_equal<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>::find<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>(this=<unavailable>, __k=<unavailable>) const at __hash_table:2338:31
    frame #1: 0x00007f31012f7b16 libyb_util.so`yb::PrometheusWriter::WriteSingleEntry(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, long, yb::AggregationFunction, char const*, char const*) [inlined] std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>::find[abi:v15007](this=0x00007f31005e34f0, __k="table_id") const at unordered_map:1445:69
    frame #2: 0x00007f31012f7b00 libyb_util.so`yb::PrometheusWriter::WriteSingleEntry(this=0x00007f30f5baa818, attr=0x00007f31005e34f0, name="handler_latency_yb_ysqlserver_SQLProcessor_SelectStmt_sum", value=641, aggregation_function=kSum, type="", description="") at metrics_writer.cc:125:18
    frame #3: 0x00007f31005de4ba libyb_pggate_webserver.so`yb::pggate::PgPrometheusMetricsHandler(req=<unavailable>, resp=<unavailable>) at pgsql_webserver_wrapper.cc:428:5
    frame #4: 0x00007f3102bac770 libserver_process.so`yb::Webserver::Impl::RunPathHandler(yb::Webserver::Impl::PathHandler const&, sq_connection*, sq_request_info*) [inlined] std::__1::__function::__value_func<void (yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*)>::operator(this=0x0000259abfdd8800, __args=0x00007f30f5bacb80, __args=0x00007f30f5baca50)[abi:v15007](yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*&&) const at function.h:512:16
    frame #5: 0x00007f3102bac757 libserver_process.so`yb::Webserver::Impl::RunPathHandler(yb::Webserver::Impl::PathHandler const&, sq_connection*, sq_request_info*) [inlined] std::__1::function<void (yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*)>::operator(this= Function = yb::pggate::PgPrometheusMetricsHandler(yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*) , __arg=0x00007f30f5bacb80, __arg=0x00007f30f5baaa50)(yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*) const at function.h:1197:12
    frame #6: 0x00007f3102bac757 libserver_process.so`yb::Webserver::Impl::RunPathHandler(this=0x0000259abfcfec80, handler=<unavailable>, connection=0x0000259abfddf000, request_info=<unavailable>) at webserver.cc:625:5
    frame #7: 0x00007f3102bac1ed libserver_process.so`yb::Webserver::Impl::BeginRequestCallback(this=0x0000259abfcfec80, connection=0x0000259abfddf000, request_info=0x0000259abfddf000) at webserver.cc:557:10
    frame #8: 0x00007f3102bb8da8 libserver_process.so`process_new_connection + 4184
    frame #9: 0x00007f3102bb7a58 libserver_process.so`worker_thread + 232
    frame #10: 0x00007f3105b10694 libpthread.so.0`start_thread(arg=0x00007f30f5bbc700) at pthread_create.c:333
    frame #11: 0x00007f310524d41d libc.so.6`__clone at clone.S:109
  thread #2, stop reason = signal 0
    frame #0: 0x00007f3105b1500f libpthread.so.0`pthread_cond_wait@@GLIBC_2.3.2 at pthread_cond_wait.S:185
    frame #1: 0x00007f3102bb78d9 libserver_process.so`master_thread + 1305
    frame #2: 0x00007f3105b10694 libpthread.so.0`start_thread(arg=0x00007f30f63bd700) at pthread_create.c:333
    frame #3: 0x00007f310524d41d libc.so.6`__clone at clone.S:109
  thread #3, stop reason = signal 0
    frame #0: 0x00007f31071a625e ld.so`_dl_fini at dl-fini.c:180
    frame #1: 0x00007f310519c969 libc.so.6`__run_exit_handlers(status=1, listp=0x00007f31054fd5c0, run_list_atexit=true) at exit.c:82
    frame #2: 0x00007f310519c9b5 libc.so.6`__GI_exit(status=<unavailable>) at exit.c:104
    frame #3: 0x000055eba6f9d552 postgres`proc_exit(code=1) at ipc.c:157:2
    frame #4: 0x000055eba7182d9f postgres`errfinish(dummy=<unavailable>) at elog.c:801:3
    frame #5: 0x000055eba6efd4dd postgres`bgworker_die(postgres_signal_arg=<unavailable>) at bgworker.c:672:2
    frame #6: 0x00007f3105b18ba0 libpthread.so.0`__restore_rt
    frame #7: 0x00007f310524d9f3 libc.so.6`epoll_wait at syscall-template.S:84
    frame #8: 0x000055eba6f9f575 postgres`WaitEventSetWait [inlined] WaitEventSetWaitBlock(set=0x0000259abfda9e88, cur_timeout=-1, occurred_events=0x00007ffcb1a1b0f8, nevents=1) at latch.c:1062:7
    frame #9: 0x000055eba6f9f567 postgres`WaitEventSetWait(set=0x0000259abfda9e88, timeout=-1, occurred_events=<unavailable>, nevents=1, wait_event_info=<unavailable>) at latch.c:1014:8
    frame #10: 0x000055eba6f9eff3 postgres`WaitLatchOrSocket(latch=0x00007f30ff74d88c, wakeEvents=<unavailable>, sock=-1, timeout=<unavailable>, wait_event_info=117440512) at latch.c:399:7
    frame #11: 0x000055eba6f9eef0 postgres`WaitLatch(latch=<unavailable>, wakeEvents=<unavailable>, timeout=<unavailable>, wait_event_info=<unavailable>) at latch.c:353:9 [artificial]
    frame #12: 0x00007f31005c0640 yb_pg_metrics.so`webserver_worker_main(unused=<unavailable>) at yb_pg_metrics.c:375:3
    frame #13: 0x000055eba6efd36c postgres`StartBackgroundWorker at bgworker.c:841:2
    frame #14: 0x000055eba6f1619c postgres`maybe_start_bgworkers [inlined] do_start_bgworker(rw=0x0000259abfcfe000) at postmaster.c:6005:4
    frame #15: 0x000055eba6f1613f postgres`maybe_start_bgworkers at postmaster.c:6231:9
    frame #16: 0x000055eba6f12940 postgres`PostmasterMain(argc=25, argv=0x0000259abfd021a0) at postmaster.c:1429:2
    frame #17: 0x000055eba6e1388f postgres`PostgresServerProcessMain(argc=25, argv=0x0000259abfd021a0) at main.c:234:3
    frame #18: 0x000055eba6ad6f12 postgres`main + 34
    frame #19: 0x00007f3105187825 libc.so.6`__libc_start_main(main=(postgres`main), argc=25, argv=0x00007ffcb1a1ba48, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007ffcb1a1ba38) at libc-start.c:289
    frame #20: 0x000055eba6ad6e29 postgres`_start at start.S:108

These core dumps were not observed before
Error:

AssertionError: FATALS: 1, CORE DUMPS: {"172.151.30.207": ["core_1603_1693262229_!home!yugabyte!yb-software!yugabyte-2.19.3.0-b29-centos-x86_64!postgres!bin!postgres"]}, TABLETS: 10, LOGGED_NEW_TABLETS: 6

Logs attached: https://drive.google.com/file/d/1p76ewpZf0W8zBDfvZDUhOC6l9PJ8dRep/view?usp=sharing

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@agsh-yb agsh-yb added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Sep 1, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Sep 1, 2023
@yugabyte-ci yugabyte-ci added area/ysql Yugabyte SQL (YSQL) and removed area/docdb YugabyteDB core features labels Sep 5, 2023
@yugabyte-ci yugabyte-ci assigned tverona1 and unassigned lingamsandeep Sep 5, 2023
@agsh-yb
Copy link
Contributor Author

agsh-yb commented Sep 8, 2023

This issue is observed on build 57, and build 71 on master as well

@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug and removed kind/enhancement This is an enhancement of an existing feature labels Oct 31, 2023
@agsh-yb agsh-yb added the qa_stress Bugs identified via Stress automation label Dec 11, 2023
@pilshchikov
Copy link
Contributor

Core dump was repeated at 2.20.1.1-b3 version

@shamanthchandra-yb
Copy link

@timothy-e @m-iancu

Core dump seen on another run: http://stress.dev.yugabyte.com/stress_test/85a078b8-4d00-471b-b49e-52e8261caedf

2.20.2.0-b67

However this is without any PITR, its a CDC case with nemesis running.

@agsh-yb
Copy link
Contributor Author

agsh-yb commented Jan 12, 2024

@Karvy-yb
Copy link

Observing this core dump in test_ysql_bank_operations_pessimistic_lock stress test on DB version 2.21.0.0-b461

karthik-ramanathan-3006 added a commit that referenced this issue May 17, 2024
Summary:
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
#/bin/bash
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

# Count the number of dump files to avoid having to use `sudo coredumpctl`
idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: ybase, smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35116
svarnau pushed a commit that referenced this issue May 25, 2024
Summary:
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
#/bin/bash
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

# Count the number of dump files to avoid having to use `sudo coredumpctl`
idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: ybase, smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35116
svarnau pushed a commit that referenced this issue May 29, 2024
… SIGTERM

Summary:
Original commit: 5862233 / D35116
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
#/bin/bash
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

# Count the number of dump files to avoid having to use `sudo coredumpctl`
idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: yql, smishra, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35169
karthik-ramanathan-3006 added a commit that referenced this issue Jun 6, 2024
…IGTERM

Summary:
Original commit: 5862233 / D35116
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: yql, smishra, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35171
karthik-ramanathan-3006 added a commit that referenced this issue Jun 6, 2024
…IGTERM

Summary:
Original commit: 5862233 / D35116
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
#/bin/bash
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

# Count the number of dump files to avoid having to use `sudo coredumpctl`
idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: ybase, smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35170
@pilshchikov pilshchikov added the qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures label Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures QA_Blocker qa_stress Bugs identified via Stress automation
Projects
None yet
Development

No branches or pull requests

9 participants