Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert when executing SELECT citus_set_coordinator_host('localhost'); #7646

Closed
saygoodbyye opened this issue Jul 8, 2024 · 0 comments · Fixed by #7682
Closed

Assert when executing SELECT citus_set_coordinator_host('localhost'); #7646

saygoodbyye opened this issue Jul 8, 2024 · 0 comments · Fixed by #7682
Assignees

Comments

@saygoodbyye
Copy link

saygoodbyye commented Jul 8, 2024

Postgres REL_16_STABLE
Citus main
My postgres configuration:

CFLAGS="-Og" ./configure --enable-debug --enable-tap-tests --with-libxml --with-openssl --enable-cassert --with-lz4 --quiet --prefix="$PGPREFIX"

Way to reproduce the problem:
Leave only this test in multi_schedule so that it runs in parallel

test: multi_cluster_management multi_cluster_management multi_cluster_management

Run tests:

make check-multi

regression.out:

# using postmaster on localhost, port 57636
# parallel group (3 tests):  multi_cluster_management multi_cluster_management multi_cluster_management
not ok 1     + multi_cluster_management                  776 ms
# (test process exited with exit code 2)
not ok 2     + multi_cluster_management                  780 ms
# (test process exited with exit code 2)
not ok 3     + multi_cluster_management                  776 ms
# (test process exited with exit code 2)
1..3
# 3 of 3 tests failed.

backtrace:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000076e954e4526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x000076e954e288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000057a06af88d6a in ExceptionalCondition (conditionName=conditionName@entry=0x76e955573495 "!nodeAlreadyExists", fileName=fileName@entry=0x76e95557339f "metadata/node_metadata.c", lineNumber=lineNumber@entry=233) at assert.c:66
#6  0x000076e9554fb93c in citus_set_coordinator_host (fcinfo=0x57a06c20f8a0) at metadata/node_metadata.c:233
#7  0x000057a06ac746c9 in ExecInterpExpr (state=0x57a06c20f7c8, econtext=0x57a06c20f4f0, isnull=0x7ffe27c929d7) at execExprInterp.c:758
#8  0x000057a06ac7119e in ExecInterpExprStillValid (state=0x57a06c20f7c8, econtext=0x57a06c20f4f0, isNull=0x7ffe27c929d7) at execExprInterp.c:1870
#9  0x000057a06acaeb76 in ExecEvalExprSwitchContext (state=<optimized out>, econtext=<optimized out>, isNull=<optimized out>) at ../../../src/include/executor/executor.h:355
#10 0x000057a06acaebbd in ExecProject (projInfo=0x57a06c20f7c0) at ../../../src/include/executor/executor.h:389
#11 0x000057a06acaed5d in ExecResult (pstate=<optimized out>) at nodeResult.c:136
#12 0x000057a06ac80cde in ExecProcNodeFirst (node=0x57a06c20f3e0) at execProcnode.c:464
#13 0x000057a06ac78b8d in ExecProcNode (node=node@entry=0x57a06c20f3e0) at ../../../src/include/executor/executor.h:273
#14 0x000057a06ac78c1d in ExecutePlan (estate=estate@entry=0x57a06c20f1b8, planstate=0x57a06c20f3e0, use_parallel_mode=<optimized out>, operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=true, numberTuples=numberTuples@entry=0,
    direction=ForwardScanDirection, dest=0x57a06c22f3c0, execute_once=true) at execMain.c:1670
#15 0x000057a06ac7994f in standard_ExecutorRun (queryDesc=queryDesc@entry=0x57a06c103de8, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=execute_once@entry=true) at execMain.c:365
#16 0x000076e9554de7ea in CitusExecutorRun (queryDesc=0x57a06c103de8, direction=ForwardScanDirection, count=0, execute_once=true) at executor/multi_executor.c:238
#17 0x000076e954c86e19 in pgss_ExecutorRun (queryDesc=0x57a06c103de8, direction=ForwardScanDirection, count=0, execute_once=true) at pg_stat_statements.c:1008
#18 0x000057a06ac79a6b in ExecutorRun (queryDesc=queryDesc@entry=0x57a06c103de8, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=<optimized out>) at execMain.c:307
#19 0x000057a06ae4e501 in PortalRunSelect (portal=portal@entry=0x57a06c185828, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x57a06c22f3c0) at pquery.c:924
#20 0x000057a06ae4fd51 in PortalRun (portal=portal@entry=0x57a06c185828, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x57a06c22f3c0, altdest=altdest@entry=0x57a06c22f3c0,
    qc=0x7ffe27c92f50) at pquery.c:768
#21 0x000057a06ae4bf4b in exec_simple_query (query_string=query_string@entry=0x57a06c06a498 "SELECT citus_set_coordinator_host('localhost');") at postgres.c:1274
#22 0x000057a06ae4dd9d in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4637
#23 0x000057a06adafe4c in BackendRun (port=port@entry=0x57a06c11f230) at postmaster.c:4464
#24 0x000057a06adb1e64 in BackendStartup (port=port@entry=0x57a06c11f230) at postmaster.c:4192
#25 0x000057a06adb2002 in ServerLoop () at postmaster.c:1782
#26 0x000057a06adb353e in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x57a06c024240) at postmaster.c:1466
#27 0x000057a06acd8173 in main (argc=5, argv=0x57a06c024240) at main.c:198

I've tried to build postgres without asserts enabled and run this test, but server didn't crash, those tests just failed.

Best regards,
Egor Chindyaskin
Postgres Professional: https://postgrespro.com/

@m3hm3t m3hm3t self-assigned this Sep 4, 2024
m3hm3t added a commit that referenced this issue Sep 9, 2024
… coordinator nodes concurrently (#7682)

When multiple sessions concurrently attempt to add the same coordinator
node using `citus_set_coordinator_host`, there is a potential race
condition. Both sessions may pass the initial metadata check
(`isCoordinatorInMetadata`), but only one will succeed in adding the
node. The other session will fail with an assertion error
(`Assert(!nodeAlreadyExists)`), causing the server to crash. Even though
the `AddNodeMetadata` function takes an exclusive lock, it appears that
the lock is not preventing the race condition before the initial
metadata check.

- **Issue**: The current logic allows concurrent sessions to pass the
check for existing coordinators, leading to an attempt to insert
duplicate nodes, which triggers the assertion failure.

- **Impact**: This race condition leads to crashes during operations
that involve concurrent coordinator additions, as seen in
#7646.

**Test Plan:**

- Isolation Test Limitation: An isolation test was added to simulate
concurrent additions of the same coordinator node, but due to the
behavior of PostgreSQL locking mechanisms, the test does not trigger the
edge case. The lock applied within the function serializes the
operations, preventing the race condition from occurring in the
isolation test environment.
While the edge case is difficult to reproduce in an isolation test, the
fix addresses the core issue by ensuring concurrency control through
proper locking.

- Existing Tests: All existing tests related to node metadata and
coordinator management have been run to ensure that no regressions were
introduced.

**After the Fix:**

- Concurrent attempts to add the same coordinator node will be
serialized. One session will succeed in adding the node, while the
others will skip the operation without crashing the server.

Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com>
winter-loo pushed a commit to winter-loo/citus that referenced this issue Jan 7, 2025
… coordinator nodes concurrently (citusdata#7682)

When multiple sessions concurrently attempt to add the same coordinator
node using `citus_set_coordinator_host`, there is a potential race
condition. Both sessions may pass the initial metadata check
(`isCoordinatorInMetadata`), but only one will succeed in adding the
node. The other session will fail with an assertion error
(`Assert(!nodeAlreadyExists)`), causing the server to crash. Even though
the `AddNodeMetadata` function takes an exclusive lock, it appears that
the lock is not preventing the race condition before the initial
metadata check.

- **Issue**: The current logic allows concurrent sessions to pass the
check for existing coordinators, leading to an attempt to insert
duplicate nodes, which triggers the assertion failure.

- **Impact**: This race condition leads to crashes during operations
that involve concurrent coordinator additions, as seen in
citusdata#7646.

**Test Plan:**

- Isolation Test Limitation: An isolation test was added to simulate
concurrent additions of the same coordinator node, but due to the
behavior of PostgreSQL locking mechanisms, the test does not trigger the
edge case. The lock applied within the function serializes the
operations, preventing the race condition from occurring in the
isolation test environment.
While the edge case is difficult to reproduce in an isolation test, the
fix addresses the core issue by ensuring concurrency control through
proper locking.

- Existing Tests: All existing tests related to node metadata and
coordinator management have been run to ensure that no regressions were
introduced.

**After the Fix:**

- Concurrent attempts to add the same coordinator node will be
serialized. One session will succeed in adding the node, while the
others will skip the operation without crashing the server.

Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com>
naisila pushed a commit that referenced this issue Jan 13, 2025
… coordinator nodes concurrently (#7682)

When multiple sessions concurrently attempt to add the same coordinator
node using `citus_set_coordinator_host`, there is a potential race
condition. Both sessions may pass the initial metadata check
(`isCoordinatorInMetadata`), but only one will succeed in adding the
node. The other session will fail with an assertion error
(`Assert(!nodeAlreadyExists)`), causing the server to crash. Even though
the `AddNodeMetadata` function takes an exclusive lock, it appears that
the lock is not preventing the race condition before the initial
metadata check.

- **Issue**: The current logic allows concurrent sessions to pass the
check for existing coordinators, leading to an attempt to insert
duplicate nodes, which triggers the assertion failure.

- **Impact**: This race condition leads to crashes during operations
that involve concurrent coordinator additions, as seen in
#7646.

**Test Plan:**

- Isolation Test Limitation: An isolation test was added to simulate
concurrent additions of the same coordinator node, but due to the
behavior of PostgreSQL locking mechanisms, the test does not trigger the
edge case. The lock applied within the function serializes the
operations, preventing the race condition from occurring in the
isolation test environment.
While the edge case is difficult to reproduce in an isolation test, the
fix addresses the core issue by ensuring concurrency control through
proper locking.

- Existing Tests: All existing tests related to node metadata and
coordinator management have been run to ensure that no regressions were
introduced.

**After the Fix:**

- Concurrent attempts to add the same coordinator node will be
serialized. One session will succeed in adding the node, while the
others will skip the operation without crashing the server.

Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com>
(cherry picked from commit 4775715)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants