Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: adjust prow tide probe conf #1242

Closed

Conversation

purelind
Copy link
Contributor

@purelind purelind commented Aug 30, 2024

  • During periods of high traffic, the tide interface response will be very slow and there is a probability that it may cause Pod restarts. During the initial startup period, the healthy interface response will also be slow, but the service can still handle other requests normally. Therefore, modify the probe settings to perform liveness and readiness checks 10 minutes after the service starts.
  • To prevent resource shortages from causing service restarts, set resource limits to 1 core and 4GB.

@ti-chi-bot ti-chi-bot bot requested a review from wuhuizuo August 30, 2024 10:26
Copy link
Contributor

ti-chi-bot bot commented Aug 30, 2024

I have already done a preliminary review for you, and I hope to help you do a better job.

Based on the pull request, the key changes made are:

  1. Adjusting the probe settings to perform liveness and readiness checks five minutes after the service starts to prevent slow interface responses and Pod restarts.
  2. Setting resource limits to 1 core and 4GB to prevent resource shortages from causing service restarts.

There are no potential problems identified in the pull request. However, some suggestions for further improvement are:

  1. Instead of setting a fixed delay of 5 minutes for liveness and readiness probe, it is better to use a startupProbe that checks the status of the service before the main probes start running.
  2. The failure threshold for the liveness and readiness probes can be increased to reduce the number of unnecessary restarts.
  3. It is a good practice to use resource requests and limits that are closer to the actual usage of the service to prevent overprovisioning or underprovisioning.

Therefore, the following changes can be made to the pull request:

  1. Add a startupProbe to check the status of the service before the main probes start running.
  2. Increase the failure threshold for the liveness and readiness probes.
  3. Use resource requests and limits that are closer to the actual usage of the service.

These changes will improve the reliability and performance of the service.

@purelind
Copy link
Contributor Author

/hold

Copy link
Contributor

ti-chi-bot bot commented Aug 30, 2024

I have already done a preliminary review for you, and I hope to help you do a better job.

Key changes:

  • Added startupProbe to perform liveness and readiness checks five minutes after the service starts.
  • Increased the periodSeconds and failureThreshold of livenessProbe and readinessProbe.
  • Set resource limits to 1 core and 4GB.

Potential problems:

  • The increased periodSeconds and failureThreshold of livenessProbe and readinessProbe may cause delays in detecting unhealthy pods.
  • The resource limits may not be sufficient for high traffic periods.

Fixing suggestions:

  • Consider setting initialDelaySeconds to a lower value for livenessProbe and readinessProbe to detect unhealthy pods earlier.
  • Monitor the resource usage during high traffic periods and adjust the resource limits accordingly.

@purelind purelind marked this pull request as draft August 30, 2024 10:30
@purelind purelind marked this pull request as ready for review August 30, 2024 10:34
Copy link
Contributor

ti-chi-bot bot commented Sep 10, 2024

I have already done a preliminary review for you, and I hope to help you do a better job.

This pull request updates the tide component of the prow chart. There are two main changes:

  1. The livenessProbe and readinessProbe settings have been modified to wait for 10 minutes after service start before performing the checks, and the period between checks has been increased to 20 seconds. The startupProbe has also been added to perform a check immediately after the service starts, with a failure threshold of 60 seconds and a period of 10 seconds.

  2. The resource limits have been set to 1 core and 4GB to prevent service restarts due to resource shortages.

These changes seem reasonable and should improve the stability of the tide service. However, there are a few potential problems that should be addressed:

  1. The startupProbe failure threshold is quite high at 60 seconds. This means that if the service takes more than 60 seconds to start up, it will be restarted even if it eventually becomes healthy. This might be too aggressive and could lead to unnecessary restarts. A lower threshold, such as 30 seconds, might be more appropriate.

  2. The livenessProbe and readinessProbe settings have been modified to wait for 10 minutes after service start before performing the checks. This might be too long, especially if the service takes a long time to start up. A shorter delay, such as 1 or 2 minutes, might be more appropriate.

  3. The resource limits have been set to 1 core and 4GB, but it's not clear if these are appropriate for the workload. The limits might be too high or too low, depending on the actual resource usage of the service. It would be good to review the service's resource usage and adjust the limits accordingly.

To address these potential problems, I would suggest the following:

  1. Lower the startupProbe failure threshold to 30 seconds.

  2. Reduce the delay before performing the livenessProbe and readinessProbe checks to 1 or 2 minutes.

  3. Review the service's resource usage and adjust the limits accordingly.

Overall, this is a good pull request that should improve the stability of the tide service. However, it's important to review the settings carefully to ensure they are appropriate for the workload.

@wuhuizuo wuhuizuo self-assigned this Sep 10, 2024
@wuhuizuo wuhuizuo force-pushed the purelind/adjust-prow-tide-probe-conf branch from bebc930 to becc5f2 Compare September 10, 2024 07:06
Copy link
Contributor

ti-chi-bot bot commented Sep 10, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from wuhuizuo, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

ti-chi-bot bot commented Sep 10, 2024

I have already done a preliminary review for you, and I hope to help you do a better job.

This pull request makes two changes to the configuration of the Prow Tide component. Firstly, it adjusts the startup, liveness, and readiness probes to delay the checks until ten minutes after the service starts, and to increase the period between checks. Secondly, it sets resource limits to 1 core and 4GB to prevent service restarts due to resource shortages.

There are no potential problems with these changes, but it is worth noting that the modification of the probe settings may make it harder to detect issues with the service. However, this is a reasonable trade-off to prevent unnecessary restarts during high traffic periods.

As for fixing suggestions, it would be helpful to include comments in the values.yaml file to explain the purpose of the resource requests and limits. Additionally, it may be worth considering using a horizontal pod autoscaler to automatically adjust the number of pods in response to changes in traffic.

Signed-off-by: wuhuizuo <wuhuizuo@126.com>
Copy link
Contributor

ti-chi-bot bot commented Sep 10, 2024

I have already done a preliminary review for you, and I hope to help you do a better job.

Based on the pull request title and description, the following changes were made:

  • The probe settings have been modified to perform liveness and readiness checks 10 minutes after the service starts. This is to avoid slow response and pod restarts during high traffic periods.
  • Resource limits have been set to 1 core and 4GB to prevent resource shortages from causing service restarts.

Potential problems:

  • The initialDelaySeconds for the livenessProbe has been increased to 30 seconds which may cause the CI to fail if it takes longer than 30 seconds to start the service.
  • The periodSeconds for the livenessProbe has been decreased to 20 seconds which may cause unnecessary pod restarts if the probes are called too frequently.

Fixing suggestions:

  • Consider increasing the initialDelaySeconds to allow the service more time to start.
  • Consider increasing the periodSeconds for the livenessProbe to avoid unnecessary pod restarts.

Overall, the changes seem reasonable and should help improve the stability of the service during high traffic periods.

@wuhuizuo
Copy link
Collaborator

fixed in #1256

@wuhuizuo wuhuizuo closed this Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants