-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graceful shutdown looks like lose the received events #2259
Labels
bug
Something isn't working
Comments
Hmm.... looks like in_forward stops event loop to shutdown and it causes pending requests are ignored when at-most-once. |
daipom
added a commit
to daipom/fluentd
that referenced
this issue
Nov 25, 2024
This replaces the current `GracefulReload` (`SIGUSR2`) (fluent#2716) with the new feature on non-Windows: * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/config.gracefulReload` * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (fluent#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see fluent#4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * fluent#4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional GracefulReload: * The traditional GracefulReload feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * fluent#2259 * fluent#3469 * fluent#3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 25, 2024
This replaces the current `GracefulReload` (`SIGUSR2`) (#2716) with the new feature on non-Windows: * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/config.gracefulReload` * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional GracefulReload: * The traditional GracefulReload feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 26, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 26, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 26, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 26, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 27, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 27, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 27, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Co-authored-by: Kentaro Hayashi <hayashi@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 27, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Co-authored-by: Kentaro Hayashi <hayashi@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 27, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Co-authored-by: Kentaro Hayashi <hayashi@clear-code.com> Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom
added a commit
that referenced
this issue
Nov 28, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com> Co-authored-by: Shizuo Fujita <fujita@clear-code.com> Co-authored-by: Kentaro Hayashi <hayashi@clear-code.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
fluentd or td-agent version.
fluentd 1.3.3
Environment information
linux
ruby 2.5.1
configuration
graceful shutdown immediately after post a large number of events looks like lose received events.
case 1
post 10 events -> graceful shutdown immediately -> file output contains 10 events
case 2
post 100000 events -> graceful shutdown immediately -> file output doesn't contains 100000 events
desired behavior of case 2
post 100000 events -> graceful shutdown immediately -> file output contains 100000 events
The text was updated successfully, but these errors were encountered: