Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: cherry-pick better local dev for ssh from gpdb close #746 #747

Merged
merged 3 commits into from
Dec 9, 2024

Conversation

yihong0618
Copy link
Contributor

@yihong0618 yihong0618 commented Dec 3, 2024

Fixes #746

What does this PR do?

This PR cherry-pick three commits from gpdb for better local dev.

commit1: add start_new_session
commit2: do not need to ssh if developing cloudberry/gpdb in local env

commit3: base from coomit1

for commit2 benefit: as the test before, make test local can also speed up.

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context


This comment should now re-trigger a CI run. The previous skip ci has been removed.

@avamingli avamingli added the cherry-pick cherry-pick upstream commts label Dec 3, 2024
@yihong0618 yihong0618 changed the title Hy/cherry pick ssh fix: cherry-pick better local dev for ssh from gpdb Dec 3, 2024
@yihong0618 yihong0618 changed the title fix: cherry-pick better local dev for ssh from gpdb fix: cherry-pick better local dev for ssh from gpdb close #746 Dec 3, 2024
@avamingli
Copy link
Contributor

@yihong0618 Thanks for doing this.

@edespino CI jobs succeed immediately?

image

image

@edespino
Copy link
Contributor

edespino commented Dec 3, 2024

@avamingli - Woah .. that is too fast!

@edespino
Copy link
Contributor

edespino commented Dec 3, 2024

It is probably due to the PR template. It has:

⚠️ To skip CI: Add [skip ci] to your PR title. Only use when necessary! ⚠️

@yihong0618
Copy link
Contributor Author

It is probably due to the PR template. It has:

⚠️ To skip CI: Add [skip ci] to your PR title. Only use when necessary! ⚠️

copy, will remember that next time

@edespino
Copy link
Contributor

edespino commented Dec 3, 2024

Sorry about that ... I will need to submit a workflow PR tomorrow to only allow a skip in the title.

@leborchuk
Copy link
Contributor

Sorry, could you explain how start_new_session is used? I see the new parameter was added in 0aad1ca but it has the False value everywhere, how it is used?

@yihong0618
Copy link
Contributor Author

Sorry, could you explain how start_new_session is used? I see the new parameter was added in 0aad1ca but it has the False value everywhere, how it is used?

seems its not a necessary arg, but cherry-pick this commit to sync with upstream code is better

@avamingli
Copy link
Contributor

Sorry, could you explain how start_new_session is used? I see the new parameter was added in 0aad1ca but it has the False value everywhere, how it is used?

seems its not a necessary arg, but cherry-pick this commit to sync with upstream code is better

Do we miss some commits around that param?

@avamingli
Copy link
Contributor

Sorry, could you explain how start_new_session is used? I see the new parameter was added in 0aad1ca but it has the False value everywhere, how it is used?

seems its not a necessary arg, but cherry-pick this commit to sync with upstream code is better

Do we miss some commits around that param?

Have a search, the param is true in commit avamingli/gpdb@01bbcb6
which has not been cherry-picked yet.

@yihong0618
Copy link
Contributor Author

Sorry, could you explain how start_new_session is used? I see the new parameter was added in 0aad1ca but it has the False value everywhere, how it is used?

seems its not a necessary arg, but cherry-pick this commit to sync with upstream code is better

Do we miss some commits around that param?

Have a search, the param is true in commit avamingli/gpdb@01bbcb6 which has not been cherry-picked yet.

Thanks will cherry-pick it too

@avamingli
Copy link
Contributor

Hi @yihong0618 , thanks for your help again.

I see commit: 0aad1ca has the same commit title and message with 12d8566 but different codes.

I guess GPDB has only one commit with that title and message, is there sth wrong?

image

@yihong0618
Copy link
Contributor Author

Hi @yihong0618 , thanks for your help again.

I see commit: 0aad1ca has the same commit title and message with 12d8566 but different codes.

I guess GPDB has only one commit with that title and message, is there sth wrong?

image

will try to figure maybe this weekend thanks~

jnihal and others added 3 commits December 8, 2024 14:59
Currently, gprecoverseg lacks the ability to handle interrupts. When the main gprecoverseg process is terminated, it leaves behind orphaned `gpsegrecovery, pg_rewind, or pg_basebackup` processes that the user would have to manually terminate.

This commit aims to add signal handling capabilities to the gprecoverseg code so that we can gracefully terminate the gprecoverseg process by using appropriate signal handlers and performing necessary cleanup actions, instead of abruptly terminating only the main process.

The overall process of termination using signal handlers is described below:

- Signals are caught by the signal handlers in the `gprecoverseg` code. This signal handler then gets the PIDs of the `gpsegsetuprecovery/gpsegrecovery` process running on each segment host and sends a `SIGTERM` signal to them.
- The `gpsegrecovery` process also has a signal handler that will be used to handle the `SIGTERM` signal sent by the main process. We do not need any signal handler for the `gpsegsetuprecovery` code as it does not spawn any new process and we can directly terminate it.
- The signal handler in the `gpsegrecovery` code would then call `terminate_proc_tree` which would terminate the entire process tree of `gpsegrecovery` process, since it is the parent PID of all the child processes (either `pg_rewind/pg_basebackup/rsync`). We do not terminate the `gpsegrecovery` process here since we would need it to perform proper error handling and cleanup actions.
- The above steps are repeated unitl the `WorkerPool` is done with its work. This is to make sure that we are not leaving behind any unfinished commands that might be waiting in the `WorkerPool` queue.
- Termination of the `pg_rewind/pg_basebackup/rsync` process will raise an error, which will be propagated back to the user, and it would also trigger the necessary cleanup actions that are already in place whenever a failure occurs.

The behavior of gprecoverseg on different termination scenarios will be as follows:

1. **Using CTRL-C**: This sends a `SIGINT` signal, and upon receiving this, the user would be given a prompt if they want to proceed further with the termination.
```
It is not recommended to terminate a recovery procedure midway. However, if you choose to proceed, you will need to run either gprecoverseg --differential or gprecoverseg -F to start a new recovery process the next time.

Continue terminating gprecoverseg Yy|Nn (default=N):
>
```
2. **Using kill from the terminal**: This, by default, sends a `SIGTERM` signal, and upon receiving this signal, gprecoverseg would directly terminate without any confirmation from the user.
3. **SSH disconnections**: This sends a `SIGHUP` signal, and currently, this signal will be ignored as we do not want to terminate the recovery process due to such issues.

The signal handlers are placed just before the recovery process is started so that it does not affect any other functionality of gprecoverseg (like rebalancing).
This commit adds a new parameter `start_new_session` to the Command class. When set to `True`, it spawns the processes created by the Popen object in a separate session from the parent process. This can be useful in cases where you want to affect only the parent process without affecting the child processes, such as sending a termination signal.

By default the parameter is set to `False`.
…812)

We can skip ssh connections if segments are on the same host with the
coordinator. We can save some time when executing commands. E.g., with
this patch, we don't ssh connection when creating the demo cluster and
the creation time can be reduced from 18s to 10s on my laptop. Besides,
we can skip setting up ssh keys for deploying the demo cluster.
@yihong0618
Copy link
Contributor Author

fixed

@avamingli
Copy link
Contributor

fixed

Thanks for your help.

@yihong0618
Copy link
Contributor Author

failed diff

diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /__w/cloudberry/cloudberry/src/test/regress/expected/uao_compaction/full_stats.out /__w/cloudberry/cloudberry/src/test/regress/results/uao_compaction/full_stats.out
--- /__w/cloudberry/cloudberry/src/test/regress/expected/uao_compaction/full_stats.out	2024-12-07 23:38:36.283181352 -0800
+++ /__w/cloudberry/cloudberry/src/test/regress/results/uao_compaction/full_stats.out	2024-12-07 23:38:36.287181360 -0800
@@ -32,6 +32,6 @@
 SELECT relname, reltuples FROM pg_class WHERE relname = 'uao_full_stats_index';
        relname        | reltuples 
 ----------------------+-----------
- uao_full_stats_index |        85
+ uao_full_stats_index |       185
 (1 row)
 
diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /__w/cloudberry/cloudberry/src/test/regress/expected/uao_compaction/stats.out /__w/cloudberry/cloudberry/src/test/regress/results/uao_compaction/stats.out
--- /__w/cloudberry/cloudberry/src/test/regress/expected/uao_compaction/stats.out	2024-12-07 23:38:36.663182147 -0800
+++ /__w/cloudberry/cloudberry/src/test/regress/results/uao_compaction/stats.out	2024-12-07 23:38:36.667182155 -0800
@@ -38,7 +38,7 @@
 SELECT relname, reltuples FROM pg_class WHERE relname = 'uao_stats_index';
      relname     | reltuples 
 -----------------+-----------
- uao_stats_index |        88
+ uao_stats_index |       151
 (1 row)
 
 -- re-setup for next case

Copy link
Member

@yjhjstz yjhjstz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@my-ship-it my-ship-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@my-ship-it my-ship-it merged commit cd4a4d2 into apache:main Dec 9, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick cherry-pick upstream commts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] make create-demo-cluster问题
8 participants