Assert when executing SELECT citus_set_coordinator_host('localhost'); #7646

saygoodbyye · 2024-07-08T07:03:27Z

Postgres REL_16_STABLE
Citus main
My postgres configuration:

CFLAGS="-Og" ./configure --enable-debug --enable-tap-tests --with-libxml --with-openssl --enable-cassert --with-lz4 --quiet --prefix="$PGPREFIX"

Way to reproduce the problem:
Leave only this test in multi_schedule so that it runs in parallel

test: multi_cluster_management multi_cluster_management multi_cluster_management

Run tests:

make check-multi

regression.out:

# using postmaster on localhost, port 57636
# parallel group (3 tests):  multi_cluster_management multi_cluster_management multi_cluster_management
not ok 1     + multi_cluster_management                  776 ms
# (test process exited with exit code 2)
not ok 2     + multi_cluster_management                  780 ms
# (test process exited with exit code 2)
not ok 3     + multi_cluster_management                  776 ms
# (test process exited with exit code 2)
1..3
# 3 of 3 tests failed.

backtrace:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000076e954e4526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x000076e954e288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000057a06af88d6a in ExceptionalCondition (conditionName=conditionName@entry=0x76e955573495 "!nodeAlreadyExists", fileName=fileName@entry=0x76e95557339f "metadata/node_metadata.c", lineNumber=lineNumber@entry=233) at assert.c:66
#6  0x000076e9554fb93c in citus_set_coordinator_host (fcinfo=0x57a06c20f8a0) at metadata/node_metadata.c:233
#7  0x000057a06ac746c9 in ExecInterpExpr (state=0x57a06c20f7c8, econtext=0x57a06c20f4f0, isnull=0x7ffe27c929d7) at execExprInterp.c:758
#8  0x000057a06ac7119e in ExecInterpExprStillValid (state=0x57a06c20f7c8, econtext=0x57a06c20f4f0, isNull=0x7ffe27c929d7) at execExprInterp.c:1870
#9  0x000057a06acaeb76 in ExecEvalExprSwitchContext (state=<optimized out>, econtext=<optimized out>, isNull=<optimized out>) at ../../../src/include/executor/executor.h:355
#10 0x000057a06acaebbd in ExecProject (projInfo=0x57a06c20f7c0) at ../../../src/include/executor/executor.h:389
#11 0x000057a06acaed5d in ExecResult (pstate=<optimized out>) at nodeResult.c:136
#12 0x000057a06ac80cde in ExecProcNodeFirst (node=0x57a06c20f3e0) at execProcnode.c:464
#13 0x000057a06ac78b8d in ExecProcNode (node=node@entry=0x57a06c20f3e0) at ../../../src/include/executor/executor.h:273
#14 0x000057a06ac78c1d in ExecutePlan (estate=estate@entry=0x57a06c20f1b8, planstate=0x57a06c20f3e0, use_parallel_mode=<optimized out>, operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=true, numberTuples=numberTuples@entry=0,
    direction=ForwardScanDirection, dest=0x57a06c22f3c0, execute_once=true) at execMain.c:1670
#15 0x000057a06ac7994f in standard_ExecutorRun (queryDesc=queryDesc@entry=0x57a06c103de8, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=execute_once@entry=true) at execMain.c:365
#16 0x000076e9554de7ea in CitusExecutorRun (queryDesc=0x57a06c103de8, direction=ForwardScanDirection, count=0, execute_once=true) at executor/multi_executor.c:238
#17 0x000076e954c86e19 in pgss_ExecutorRun (queryDesc=0x57a06c103de8, direction=ForwardScanDirection, count=0, execute_once=true) at pg_stat_statements.c:1008
#18 0x000057a06ac79a6b in ExecutorRun (queryDesc=queryDesc@entry=0x57a06c103de8, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=<optimized out>) at execMain.c:307
#19 0x000057a06ae4e501 in PortalRunSelect (portal=portal@entry=0x57a06c185828, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x57a06c22f3c0) at pquery.c:924
#20 0x000057a06ae4fd51 in PortalRun (portal=portal@entry=0x57a06c185828, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x57a06c22f3c0, altdest=altdest@entry=0x57a06c22f3c0,
    qc=0x7ffe27c92f50) at pquery.c:768
#21 0x000057a06ae4bf4b in exec_simple_query (query_string=query_string@entry=0x57a06c06a498 "SELECT citus_set_coordinator_host('localhost');") at postgres.c:1274
#22 0x000057a06ae4dd9d in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4637
#23 0x000057a06adafe4c in BackendRun (port=port@entry=0x57a06c11f230) at postmaster.c:4464
#24 0x000057a06adb1e64 in BackendStartup (port=port@entry=0x57a06c11f230) at postmaster.c:4192
#25 0x000057a06adb2002 in ServerLoop () at postmaster.c:1782
#26 0x000057a06adb353e in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x57a06c024240) at postmaster.c:1466
#27 0x000057a06acd8173 in main (argc=5, argv=0x57a06c024240) at main.c:198

I've tried to build postgres without asserts enabled and run this test, but server didn't crash, those tests just failed.

Best regards,
Egor Chindyaskin
Postgres Professional: https://postgrespro.com/

The text was updated successfully, but these errors were encountered:

… coordinator nodes concurrently (#7682) When multiple sessions concurrently attempt to add the same coordinator node using `citus_set_coordinator_host`, there is a potential race condition. Both sessions may pass the initial metadata check (`isCoordinatorInMetadata`), but only one will succeed in adding the node. The other session will fail with an assertion error (`Assert(!nodeAlreadyExists)`), causing the server to crash. Even though the `AddNodeMetadata` function takes an exclusive lock, it appears that the lock is not preventing the race condition before the initial metadata check. - **Issue**: The current logic allows concurrent sessions to pass the check for existing coordinators, leading to an attempt to insert duplicate nodes, which triggers the assertion failure. - **Impact**: This race condition leads to crashes during operations that involve concurrent coordinator additions, as seen in #7646. **Test Plan:** - Isolation Test Limitation: An isolation test was added to simulate concurrent additions of the same coordinator node, but due to the behavior of PostgreSQL locking mechanisms, the test does not trigger the edge case. The lock applied within the function serializes the operations, preventing the race condition from occurring in the isolation test environment. While the edge case is difficult to reproduce in an isolation test, the fix addresses the core issue by ensuring concurrency control through proper locking. - Existing Tests: All existing tests related to node metadata and coordinator management have been run to ensure that no regressions were introduced. **After the Fix:** - Concurrent attempts to add the same coordinator node will be serialized. One session will succeed in adding the node, while the others will skip the operation without crashing the server. Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com>

… coordinator nodes concurrently (citusdata#7682) When multiple sessions concurrently attempt to add the same coordinator node using `citus_set_coordinator_host`, there is a potential race condition. Both sessions may pass the initial metadata check (`isCoordinatorInMetadata`), but only one will succeed in adding the node. The other session will fail with an assertion error (`Assert(!nodeAlreadyExists)`), causing the server to crash. Even though the `AddNodeMetadata` function takes an exclusive lock, it appears that the lock is not preventing the race condition before the initial metadata check. - **Issue**: The current logic allows concurrent sessions to pass the check for existing coordinators, leading to an attempt to insert duplicate nodes, which triggers the assertion failure. - **Impact**: This race condition leads to crashes during operations that involve concurrent coordinator additions, as seen in citusdata#7646. **Test Plan:** - Isolation Test Limitation: An isolation test was added to simulate concurrent additions of the same coordinator node, but due to the behavior of PostgreSQL locking mechanisms, the test does not trigger the edge case. The lock applied within the function serializes the operations, preventing the race condition from occurring in the isolation test environment. While the edge case is difficult to reproduce in an isolation test, the fix addresses the core issue by ensuring concurrency control through proper locking. - Existing Tests: All existing tests related to node metadata and coordinator management have been run to ensure that no regressions were introduced. **After the Fix:** - Concurrent attempts to add the same coordinator node will be serialized. One session will succeed in adding the node, while the others will skip the operation without crashing the server. Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com>

… coordinator nodes concurrently (#7682) When multiple sessions concurrently attempt to add the same coordinator node using `citus_set_coordinator_host`, there is a potential race condition. Both sessions may pass the initial metadata check (`isCoordinatorInMetadata`), but only one will succeed in adding the node. The other session will fail with an assertion error (`Assert(!nodeAlreadyExists)`), causing the server to crash. Even though the `AddNodeMetadata` function takes an exclusive lock, it appears that the lock is not preventing the race condition before the initial metadata check. - **Issue**: The current logic allows concurrent sessions to pass the check for existing coordinators, leading to an attempt to insert duplicate nodes, which triggers the assertion failure. - **Impact**: This race condition leads to crashes during operations that involve concurrent coordinator additions, as seen in #7646. **Test Plan:** - Isolation Test Limitation: An isolation test was added to simulate concurrent additions of the same coordinator node, but due to the behavior of PostgreSQL locking mechanisms, the test does not trigger the edge case. The lock applied within the function serializes the operations, preventing the race condition from occurring in the isolation test environment. While the edge case is difficult to reproduce in an isolation test, the fix addresses the core issue by ensuring concurrency control through proper locking. - Existing Tests: All existing tests related to node metadata and coordinator management have been run to ensure that no regressions were introduced. **After the Fix:** - Concurrent attempts to add the same coordinator node will be serialized. One session will succeed in adding the node, while the others will skip the operation without crashing the server. Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com> (cherry picked from commit 4775715)

m3hm3t mentioned this issue Sep 4, 2024

Fix race condition in citus_set_coordinator_host when adding multiple coordinator nodes concurrently #7682

Merged

m3hm3t self-assigned this Sep 4, 2024

m3hm3t linked a pull request Sep 4, 2024 that will close this issue

Fix race condition in citus_set_coordinator_host when adding multiple coordinator nodes concurrently #7682

Merged

m3hm3t closed this as completed in #7682 Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assert when executing SELECT citus_set_coordinator_host('localhost'); #7646

Assert when executing SELECT citus_set_coordinator_host('localhost'); #7646

saygoodbyye commented Jul 8, 2024 •

edited

Loading

Assert when executing SELECT citus_set_coordinator_host('localhost'); #7646

Assert when executing SELECT citus_set_coordinator_host('localhost'); #7646

Comments

saygoodbyye commented Jul 8, 2024 • edited Loading

saygoodbyye commented Jul 8, 2024 •

edited

Loading