-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: cherry-pick better local dev for ssh from gpdb close #746 #747
Conversation
@yihong0618 Thanks for doing this. @edespino CI jobs succeed immediately? |
@avamingli - Woah .. that is too fast! |
It is probably due to the PR template. It has: |
copy, will remember that next time |
Sorry about that ... I will need to submit a workflow PR tomorrow to only allow a skip in the title. |
Sorry, could you explain how start_new_session is used? I see the new parameter was added in 0aad1ca but it has the False value everywhere, how it is used? |
seems its not a necessary arg, but cherry-pick this commit to sync with upstream code is better |
Do we miss some commits around that param? |
Have a search, the param is true in commit avamingli/gpdb@01bbcb6 |
Thanks will cherry-pick it too |
Hi @yihong0618 , thanks for your help again. I see commit: 0aad1ca has the same commit title and message with 12d8566 but different codes. I guess GPDB has only one commit with that title and message, is there sth wrong? |
will try to figure maybe this weekend thanks~ |
Currently, gprecoverseg lacks the ability to handle interrupts. When the main gprecoverseg process is terminated, it leaves behind orphaned `gpsegrecovery, pg_rewind, or pg_basebackup` processes that the user would have to manually terminate. This commit aims to add signal handling capabilities to the gprecoverseg code so that we can gracefully terminate the gprecoverseg process by using appropriate signal handlers and performing necessary cleanup actions, instead of abruptly terminating only the main process. The overall process of termination using signal handlers is described below: - Signals are caught by the signal handlers in the `gprecoverseg` code. This signal handler then gets the PIDs of the `gpsegsetuprecovery/gpsegrecovery` process running on each segment host and sends a `SIGTERM` signal to them. - The `gpsegrecovery` process also has a signal handler that will be used to handle the `SIGTERM` signal sent by the main process. We do not need any signal handler for the `gpsegsetuprecovery` code as it does not spawn any new process and we can directly terminate it. - The signal handler in the `gpsegrecovery` code would then call `terminate_proc_tree` which would terminate the entire process tree of `gpsegrecovery` process, since it is the parent PID of all the child processes (either `pg_rewind/pg_basebackup/rsync`). We do not terminate the `gpsegrecovery` process here since we would need it to perform proper error handling and cleanup actions. - The above steps are repeated unitl the `WorkerPool` is done with its work. This is to make sure that we are not leaving behind any unfinished commands that might be waiting in the `WorkerPool` queue. - Termination of the `pg_rewind/pg_basebackup/rsync` process will raise an error, which will be propagated back to the user, and it would also trigger the necessary cleanup actions that are already in place whenever a failure occurs. The behavior of gprecoverseg on different termination scenarios will be as follows: 1. **Using CTRL-C**: This sends a `SIGINT` signal, and upon receiving this, the user would be given a prompt if they want to proceed further with the termination. ``` It is not recommended to terminate a recovery procedure midway. However, if you choose to proceed, you will need to run either gprecoverseg --differential or gprecoverseg -F to start a new recovery process the next time. Continue terminating gprecoverseg Yy|Nn (default=N): > ``` 2. **Using kill from the terminal**: This, by default, sends a `SIGTERM` signal, and upon receiving this signal, gprecoverseg would directly terminate without any confirmation from the user. 3. **SSH disconnections**: This sends a `SIGHUP` signal, and currently, this signal will be ignored as we do not want to terminate the recovery process due to such issues. The signal handlers are placed just before the recovery process is started so that it does not affect any other functionality of gprecoverseg (like rebalancing).
This commit adds a new parameter `start_new_session` to the Command class. When set to `True`, it spawns the processes created by the Popen object in a separate session from the parent process. This can be useful in cases where you want to affect only the parent process without affecting the child processes, such as sending a termination signal. By default the parameter is set to `False`.
…812) We can skip ssh connections if segments are on the same host with the coordinator. We can save some time when executing commands. E.g., with this patch, we don't ssh connection when creating the demo cluster and the creation time can be reduced from 18s to 10s on my laptop. Besides, we can skip setting up ssh keys for deploying the demo cluster.
77c7a28
to
baedf1f
Compare
fixed |
Thanks for your help. |
failed diff diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /__w/cloudberry/cloudberry/src/test/regress/expected/uao_compaction/full_stats.out /__w/cloudberry/cloudberry/src/test/regress/results/uao_compaction/full_stats.out
--- /__w/cloudberry/cloudberry/src/test/regress/expected/uao_compaction/full_stats.out 2024-12-07 23:38:36.283181352 -0800
+++ /__w/cloudberry/cloudberry/src/test/regress/results/uao_compaction/full_stats.out 2024-12-07 23:38:36.287181360 -0800
@@ -32,6 +32,6 @@
SELECT relname, reltuples FROM pg_class WHERE relname = 'uao_full_stats_index';
relname | reltuples
----------------------+-----------
- uao_full_stats_index | 85
+ uao_full_stats_index | 185
(1 row)
diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /__w/cloudberry/cloudberry/src/test/regress/expected/uao_compaction/stats.out /__w/cloudberry/cloudberry/src/test/regress/results/uao_compaction/stats.out
--- /__w/cloudberry/cloudberry/src/test/regress/expected/uao_compaction/stats.out 2024-12-07 23:38:36.663182147 -0800
+++ /__w/cloudberry/cloudberry/src/test/regress/results/uao_compaction/stats.out 2024-12-07 23:38:36.667182155 -0800
@@ -38,7 +38,7 @@
SELECT relname, reltuples FROM pg_class WHERE relname = 'uao_stats_index';
relname | reltuples
-----------------+-----------
- uao_stats_index | 88
+ uao_stats_index | 151
(1 row)
-- re-setup for next case
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #746
What does this PR do?
This PR cherry-pick three commits from gpdb for better local dev.
commit1: add start_new_session
commit2: do not need to ssh if developing cloudberry/gpdb in local env
commit3: base from coomit1
for commit2 benefit: as the test before, make test local can also speed up.
Type of Change
Breaking Changes
Test Plan
make installcheck
make -C src/test installcheck-cbdb-parallel
Impact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
This comment should now re-trigger a CI run. The previous skip ci has been removed.