Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cooldown feature to properly manage several instances of emissary #4293

Merged
merged 3 commits into from
Jul 20, 2022

Conversation

knlambert
Copy link
Contributor

@knlambert knlambert commented Jun 20, 2022

Description

Previously, we added a cooldown step (30s) for the metrics to avoid to send too many requests to the ambassador cloud API.

Although, some values were not consistent, and we figured out that the way the metrics are sent to the agent are one request per emissary node, and not one for all.

It means that the previous iterations of the code were wrong for two reasons :

  • If we forward the metrics directly to the cloud, we won't be able to identify what node corresponds to what metric.
  • If we put a simple cooldown of 30s, for 3 nodes, we'll miss 2 out of 3 sets of metrics.

As a workaround, with minimal changes, this PR changes the logic to make the metric relay handler to accumulate metrics in a map per nodes' IP, and unload them after the cooldown period.

image

Testing

Adapted existing tests, & tested it in a dev cluster.

Checklist

  • I made sure to update CHANGELOG.md.

    Remember, the CHANGELOG needs to mention:

    • Any new features
    • Any changes to our included version of Envoy
    • Any non-backward-compatible changes
    • Any deprecations
  • This is unlikely to impact how Ambassador performs at scale.

    Remember, things that might have an impact at scale include:

    • Any significant changes in memory use that might require adjusting the memory limits
    • Any significant changes in CPU use that might require adjusting the CPU limits
    • Anything that might change how many replicas users should use
    • Changes that impact data-plane latency/scalability
  • My change is adequately tested.

    Remember when considering testing:

    • Your change needs to be specifically covered by tests.
      • Tests need to cover all the states where your change is relevant: for example, if you add a behavior that can be enabled or disabled, you'll need tests that cover the enabled case and tests that cover the disabled case. It's not sufficient just to test with the behavior enabled.
    • You also need to make sure that the entire area being changed has adequate test coverage.
      • If existing tests don't actually cover the entire area being changed, add tests.
      • This applies even for aspects of the area that you're not changing – check the test coverage, and improve it if needed!
    • We should lean on the bulk of code being covered by unit tests, but...
    • ... an end-to-end test should cover the integration points
  • I updated DEVELOPING.md with any any special dev tricks I had to use to work on this code efficiently.

  • The changes in this PR have been reviewed for security concerns and adherence to security best practices.

@knlambert knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch 7 times, most recently from e8fe26c to 5a83d04 Compare June 21, 2022 13:17
@knlambert knlambert requested a review from alexgervais June 21, 2022 13:40
@knlambert knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch from 5a83d04 to 287be1d Compare June 21, 2022 13:48
@knlambert knlambert marked this pull request as ready for review June 21, 2022 13:53
@knlambert knlambert changed the base branch from master to release/v2.3 June 22, 2022 13:43
@knlambert knlambert changed the base branch from release/v2.3 to master June 22, 2022 13:45
@knlambert knlambert changed the base branch from master to release/v2.3 June 22, 2022 13:47
@knlambert knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch 2 times, most recently from 597ee5f to a80a73a Compare June 22, 2022 13:49
Copy link
Contributor

@alexgervais alexgervais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM.
@LanceEa is release/v2.3 the correct head branch? Should this target master instead?

@LanceEa
Copy link
Contributor

LanceEa commented Jun 23, 2022

@knlambert - I will have some time to look at this more later and think about it but just as a quick glance.

  1. does this need to land in v2.3 branch as Alex mentioned? I would rather it land in the v3 branch which is master then if need be we can backport it.
  2. Again only taking a few minutes to look at it, it appears that you delete metrics before they have successfully been sent which means on errors they will be dropped. Is this the intended behavior? Keeping them around would introduce memory pressure too so not advocating for any particular solution but I just wanted to understand better 😄.

@knlambert
Copy link
Contributor Author

knlambert commented Jun 27, 2022

@knlambert - I will have some time to look at this more later and think about it but just as a quick glance.

  1. does this need to land in v2.3 branch as Alex mentioned? I would rather it land in the v3 branch which is master then if need be we can backport it.
  2. Again only taking a few minutes to look at it, it appears that you delete metrics before they have successfully been sent which means on errors they will be dropped. Is this the intended behavior? Keeping them around would introduce memory pressure too so not advocating for any particular solution but I just wanted to understand better 😄.

@LanceEa

  1. The next one to be published :o) it's a bug fix.
  2. It's ok, that's not critical data and we don't want to spam cloud if the link is broken, or saturate the memory as you said :)

@knlambert knlambert requested a review from LanceEa June 29, 2022 11:34
@knlambert knlambert changed the base branch from release/v2.3 to master July 6, 2022 13:54
Copy link
Contributor

@LanceEa LanceEa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knlambert - Overall it looks good, I have left a few style suggestions and a few test cases we should add coverage for.

Once addressed we can get this merged into master and cherry-picked to the release/v2.3 branch as well.

docs/releaseNotes.yml Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent_metrics_test.go Show resolved Hide resolved
@knlambert knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch 3 times, most recently from d47e2fe to c121e99 Compare July 12, 2022 22:37
@knlambert knlambert requested review from LanceEa and alexgervais July 12, 2022 22:46
@LanceEa
Copy link
Contributor

LanceEa commented Jul 14, 2022

It was added because otherwise it sends at least 3 requests every 5 seconds, which is too much for ambassador cloud. More-other, we only need to have data every 30s.

I understood that for when we reset the backoff....I was more trying to understand why we intialized it so the value was in the future. My assumption is so that the first set of reported values are 30s after startup rather than right away trying to trigger a report.

LanceEa
LanceEa previously approved these changes Jul 14, 2022
Copy link
Contributor

@LanceEa LanceEa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@LanceEa LanceEa requested a review from haq204 July 15, 2022 14:43
alexgervais
alexgervais previously approved these changes Jul 18, 2022
Copy link
Contributor

@alexgervais alexgervais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets cherry pick this on release/v2.3 as soon as it lands on master. We don't want the changelog to be lying!

Copy link

@haq204 haq204 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, what's the expected rate of metrics per second?

Would it make sense to set a fixed buffer size and push if its either full or after 30 seconds?

@knlambert
Copy link
Contributor Author

out of curiosity, what's the expected rate of metrics per second?

Would it make sense to set a fixed buffer size and push if its either full or after 30 seconds?

It depends on the number of running edge stack instances. With a 3 nodes setup. it's usually around 3 requests per second, so we would need to know the number of instances to calculate this buffer size.

More-other, we don't want to push it before 30 s because it's not necessary, and causes a lot of stress to the system-a api. The 30s value is a necessary cooldown period implemented at the agent level since we can't configure it from the metrics sync of envoy.

@knlambert knlambert dismissed stale reviews from alexgervais and LanceEa via 30f9c04 July 19, 2022 18:20
@knlambert knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch from c121e99 to 30f9c04 Compare July 19, 2022 18:20
LanceEa
LanceEa previously approved these changes Jul 19, 2022
Copy link
Contributor

@LanceEa LanceEa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm...now that CI is green and merge conflicts are all good

@LanceEa LanceEa requested a review from alexgervais July 20, 2022 11:13
@LanceEa
Copy link
Contributor

LanceEa commented Jul 20, 2022

Let's squash the commits and clean up commit message when we merge since they are all related.

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>
Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>
Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>
@knlambert knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch from 30f9c04 to 45c3df1 Compare July 20, 2022 13:26
@knlambert knlambert merged commit a73be1c into master Jul 20, 2022
@knlambert knlambert deleted the knlambert/perform-metrics-cooldown-per-node branch July 20, 2022 14:30
knlambert added a commit that referenced this pull request Jul 21, 2022
…#4293)

* Fix cooldown feature to properly manage several instances of emissary

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

* Apply style guidelines

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

* Add tests for edge cases

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>
knlambert added a commit that referenced this pull request Jul 21, 2022
…#4293)

* Fix cooldown feature to properly manage several instances of emissary

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

* Apply style guidelines

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

* Add tests for edge cases

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>
(cherry picked from commit a73be1c)
knlambert added a commit that referenced this pull request Jul 22, 2022
…#4293) (#4343)

* Fix cooldown feature to properly manage several instances of emissary

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

* Apply style guidelines

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

* Add tests for edge cases

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>
(cherry picked from commit a73be1c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants