Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick from main to release-13.0 #7841

Merged
merged 14 commits into from
Jan 13, 2025
Merged

Conversation

naisila
Copy link
Member

@naisila naisila commented Jan 12, 2025

need PR to trigger tests

crabhi and others added 6 commits January 12, 2025 22:54
DESCRIPTION: citus_move_shard_placement now fails early when shard
cannot be safely moved

The implementation is quite simplistic -
`citus_move_shard_placement(...)` will fail with an error if there's any
new node in the cluster that doesn't have reference tables yet.

It could have been finer-grained, i.e. erroring only when trying to move
a shard to an unitialized node. Looking at the related functions -
`replicate_reference_tables()` or `citus_rebalance_start()`, I think
it's acceptable behaviour. These other functions also treat "any"
unitialized node as a temporary anomaly.

Fixes #7426

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Store the previous shard cost so that the invariant checking performs as
expected.
… single-node cluster (#7552)

This fixes #7551 reported by Egor Chindyaskin

Function activate_node_snapshot() is not meant to be called on a cluster
without worker nodes. This commit adds ERROR report for such case to
prevent server crash.
…ion (#7534)

Fixes #7533.

DESCRIPTION: Fixes incorrect `VALID UNTIL` setting assumption made for
roles when syncing them to new nodes
#7607)

DESCRIPTION: Use macro wrapper to access PGPROC data, to improve compatibility with PostgreSQL forks.
DESCRIPTION: Add a check to see if the given limit is null. 

Fixes a bug by checking if the limit given in the query is null when the
actual limit is computed with respect to the given offset.
Prior to this change, null is interpreted as 0 during the limit
calculation when both limit and offset are given.

Fixes #7663
Copy link

codecov bot commented Jan 13, 2025

Codecov Report

Attention: Patch coverage is 86.99187% with 16 lines in your changes missing coverage. Please review.

Project coverage is 89.48%. Comparing base (ddef972) to head (c55bc8c).
Report is 15 commits behind head on release-13.0.

Additional details and impacted files
@@              Coverage Diff              @@
##           release-13.0    #7841   +/-   ##
=============================================
  Coverage         89.47%   89.48%           
=============================================
  Files               274      276    +2     
  Lines             59967    60063   +96     
  Branches           7506     7524   +18     
=============================================
+ Hits              53656    53747   +91     
+ Misses             4167     4166    -1     
- Partials           2144     2150    +6     

m3hm3t and others added 7 commits January 13, 2025 15:20
… coordinator nodes concurrently (#7682)

When multiple sessions concurrently attempt to add the same coordinator
node using `citus_set_coordinator_host`, there is a potential race
condition. Both sessions may pass the initial metadata check
(`isCoordinatorInMetadata`), but only one will succeed in adding the
node. The other session will fail with an assertion error
(`Assert(!nodeAlreadyExists)`), causing the server to crash. Even though
the `AddNodeMetadata` function takes an exclusive lock, it appears that
the lock is not preventing the race condition before the initial
metadata check.

- **Issue**: The current logic allows concurrent sessions to pass the
check for existing coordinators, leading to an attempt to insert
duplicate nodes, which triggers the assertion failure.

- **Impact**: This race condition leads to crashes during operations
that involve concurrent coordinator additions, as seen in
#7646.

**Test Plan:**

- Isolation Test Limitation: An isolation test was added to simulate
concurrent additions of the same coordinator node, but due to the
behavior of PostgreSQL locking mechanisms, the test does not trigger the
edge case. The lock applied within the function serializes the
operations, preventing the race condition from occurring in the
isolation test environment.
While the edge case is difficult to reproduce in an isolation test, the
fix addresses the core issue by ensuring concurrency control through
proper locking.

- Existing Tests: All existing tests related to node metadata and
coordinator management have been run to ensure that no regressions were
introduced.

**After the Fix:**

- Concurrent attempts to add the same coordinator node will be
serialized. One session will succeed in adding the node, while the
others will skip the operation without crashing the server.

Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com>
(cherry picked from commit 4775715)
#7659)

We were writing incorrect data to target collection in some cases of merge command. In case of repartition when source query is RELATION. We were referring to incorrect attribute number that was resulting into
this incorrect behavior.

Example :

![image](https://github.com/user-attachments/assets/a101cb36-7976-459c-befb-96a55a5b3dc1)

![image](https://github.com/user-attachments/assets/e5c83b7b-5b8e-4d79-a927-95684dc9ba49)

I have added fixed tests as part of this PR , Thanks.

(cherry picked from commit 5bad6c6)
…n may cause segfault #7705

In function MasterAggregateMutator(), when the original Node is a Var node use makeVar() instead
of copyObject() when constructing the Var node for the target list of the combine query.
The varnullingrels field of the original Var node is ignored because it is not relevant for the
combine query; copying this cause the problem in issue 7705, where a coordinator query had
a Var with a reference to a non-existent join relation.

(cherry picked from commit c52f360)
Co-authored-by: Pavel Seleznev <PNSeleznev@sberbank.ru>
(cherry picked from commit fe6d198)
…when the application_name changes (#7791)

DESCRIPTION: Fixes a crash that happens because of unsafe catalog access
when re-assigning the global pid after application_name changes.

When application_name changes, we don't actually need to
try re-assigning the global pid for external client backends because
application_name doesn't affect the global pid for such backends. Plus,
trying to re-assign the global pid for external client backends would
unnecessarily cause performing a catalog access when the cached local
node id is invalidated. However, accessing to the catalog tables is
dangerous in certain situations like when we're not in a transaction
block. And for the other types of backends, i.e., the Citus internal
backends, we need to re-assign the global pid when the application_name
changes because for such backends we simply extract the global pid
inherited from the originating backend from the application_name -that's
specified by originating backend when openning that connection- and this
doesn't require catalog access.

(cherry picked from commit 7341191)
…stributed transaction: Raise an ERROR instead of a crash

(cherry picked from commit ab7c13b)
DESCRIPTION: Shard moves/isolate report LSN's in lsn format

While investigating an issue with our catchup mechanism on certain
postgres versions we noticed we print LSN's in the format of the native
long type. This is an uncommon representation for LSN's in postgres
logs.

This patch changes the output of our log message to go from the long
type representation to the native LSN type representation. Making it
easier for postgres users to recognize and compare LSN's with other
related reports.

example of new output:
```
2023-09-25 17:28:47.544 CEST [11345] LOG:  The LSN of the target subscriptions on node localhost:9701 have increased from 0/0 to 0/E1ED20F8 at 2023-09-25 17:28:47.544165+02 where the source LSN is 1/415DCAD0
```

(cherry picked from commit b87fbcb)
@naisila naisila force-pushed the release-13.0-naisila branch 2 times, most recently from 520d9d7 to 1468ae3 Compare January 13, 2025 15:07
Propagates SECURITY LABEL ON ROLE stmt (#7304)
We propagate `SECURITY LABEL [for provider] ON ROLE rolename IS
labelname` to the worker nodes.
We also make sure to run the relevant `SecLabelStmt` commands on a
newly added node by looking at roles found in `pg_shseclabel`.

See official docs for explanation on how this command works:
https://www.postgresql.org/docs/current/sql-security-label.html
This command stores the role label in the `pg_shseclabel` catalog table.

This commit also fixes the regex string in
`check_gucs_are_alphabetically_sorted.sh` script such that it escapes
the dot. Previously it was looking for all strings starting with "citus"
instead of "citus." as it should.

To test this feature, I currently make use of a special GUC to control
label provider registration in PG_init when creating the Citus extension.

(cherry picked from commit 0d1f188)

Co-authored-by: Naisila Puka <37271756+naisila@users.noreply.github.com>
(cherry picked from commit 686d2b4)
@naisila naisila force-pushed the release-13.0-naisila branch from 1468ae3 to c55bc8c Compare January 13, 2025 16:56
@naisila naisila merged commit c55bc8c into release-13.0 Jan 13, 2025
153 of 155 checks passed
@naisila naisila deleted the release-13.0-naisila branch January 13, 2025 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.