Fix cooldown feature to properly manage several instances of emissary #4293

knlambert · 2022-06-20T17:24:38Z

Description

Previously, we added a cooldown step (30s) for the metrics to avoid to send too many requests to the ambassador cloud API.

Although, some values were not consistent, and we figured out that the way the metrics are sent to the agent are one request per emissary node, and not one for all.

It means that the previous iterations of the code were wrong for two reasons :

If we forward the metrics directly to the cloud, we won't be able to identify what node corresponds to what metric.
If we put a simple cooldown of 30s, for 3 nodes, we'll miss 2 out of 3 sets of metrics.

As a workaround, with minimal changes, this PR changes the logic to make the metric relay handler to accumulate metrics in a map per nodes' IP, and unload them after the cooldown period.

Testing

Adapted existing tests, & tested it in a dev cluster.

Checklist

I made sure to update CHANGELOG.md.

Remember, the CHANGELOG needs to mention:
- Any new features
- Any changes to our included version of Envoy
- Any non-backward-compatible changes
- Any deprecations
This is unlikely to impact how Ambassador performs at scale.

Remember, things that might have an impact at scale include:
- Any significant changes in memory use that might require adjusting the memory limits
- Any significant changes in CPU use that might require adjusting the CPU limits
- Anything that might change how many replicas users should use
- Changes that impact data-plane latency/scalability
My change is adequately tested.

Remember when considering testing:
- Your change needs to be specifically covered by tests.
  - Tests need to cover all the states where your change is relevant: for example, if you add a behavior that can be enabled or disabled, you'll need tests that cover the enabled case and tests that cover the disabled case. It's not sufficient just to test with the behavior enabled.
- You also need to make sure that the entire area being changed has adequate test coverage.
  - If existing tests don't actually cover the entire area being changed, add tests.
  - This applies even for aspects of the area that you're not changing – check the test coverage, and improve it if needed!
- We should lean on the bulk of code being covered by unit tests, but...
- ... an end-to-end test should cover the integration points
I updated DEVELOPING.md with any any special dev tricks I had to use to work on this code efficiently.
The changes in this PR have been reviewed for security concerns and adherence to security best practices.

alexgervais

The code LGTM.
@LanceEa is release/v2.3 the correct head branch? Should this target master instead?

LanceEa · 2022-06-23T18:56:05Z

@knlambert - I will have some time to look at this more later and think about it but just as a quick glance.

does this need to land in v2.3 branch as Alex mentioned? I would rather it land in the v3 branch which is master then if need be we can backport it.
Again only taking a few minutes to look at it, it appears that you delete metrics before they have successfully been sent which means on errors they will be dropped. Is this the intended behavior? Keeping them around would introduce memory pressure too so not advocating for any particular solution but I just wanted to understand better 😄.

knlambert · 2022-06-27T19:49:56Z

@knlambert - I will have some time to look at this more later and think about it but just as a quick glance.

does this need to land in v2.3 branch as Alex mentioned? I would rather it land in the v3 branch which is master then if need be we can backport it.

Again only taking a few minutes to look at it, it appears that you delete metrics before they have successfully been sent which means on errors they will be dropped. Is this the intended behavior? Keeping them around would introduce memory pressure too so not advocating for any particular solution but I just wanted to understand better 😄.

@LanceEa

The next one to be published :o) it's a bug fix.
It's ok, that's not critical data and we don't want to spam cloud if the link is broken, or saturate the memory as you said :)

LanceEa

@knlambert - Overall it looks good, I have left a few style suggestions and a few test cases we should add coverage for.

Once addressed we can get this merged into master and cherry-picked to the release/v2.3 branch as well.

docs/releaseNotes.yml

pkg/agent/agent.go

pkg/agent/agent_metrics_test.go

LanceEa · 2022-07-14T13:55:51Z

It was added because otherwise it sends at least 3 requests every 5 seconds, which is too much for ambassador cloud. More-other, we only need to have data every 30s.

I understood that for when we reset the backoff....I was more trying to understand why we intialized it so the value was in the future. My assumption is so that the first set of reported values are 30s after startup rather than right away trying to trigger a report.

LanceEa

lgtm

alexgervais

Lets cherry pick this on release/v2.3 as soon as it lands on master. We don't want the changelog to be lying!

haq204

out of curiosity, what's the expected rate of metrics per second?

Would it make sense to set a fixed buffer size and push if its either full or after 30 seconds?

knlambert · 2022-07-19T18:18:04Z

out of curiosity, what's the expected rate of metrics per second?

Would it make sense to set a fixed buffer size and push if its either full or after 30 seconds?

It depends on the number of running edge stack instances. With a 3 nodes setup. it's usually around 3 requests per second, so we would need to know the number of instances to calculate this buffer size.

More-other, we don't want to push it before 30 s because it's not necessary, and causes a lot of stress to the system-a api. The 30s value is a necessary cooldown period implemented at the agent level since we can't configure it from the metrics sync of envoy.

LanceEa

lgtm...now that CI is green and merge conflicts are all good

LanceEa · 2022-07-20T11:43:07Z

Let's squash the commits and clean up commit message when we merge since they are all related.

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

…#4293) * Fix cooldown feature to properly manage several instances of emissary Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com> * Apply style guidelines Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com> * Add tests for edge cases Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

…#4293) * Fix cooldown feature to properly manage several instances of emissary Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com> * Apply style guidelines Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com> * Add tests for edge cases Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com> (cherry picked from commit a73be1c)

…#4293) (#4343) * Fix cooldown feature to properly manage several instances of emissary Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com> * Apply style guidelines Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com> * Add tests for edge cases Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com> (cherry picked from commit a73be1c)

knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch 7 times, most recently from e8fe26c to 5a83d04 Compare June 21, 2022 13:17

knlambert requested a review from alexgervais June 21, 2022 13:40

knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch from 5a83d04 to 287be1d Compare June 21, 2022 13:48

knlambert marked this pull request as ready for review June 21, 2022 13:53

knlambert changed the base branch from master to release/v2.3 June 22, 2022 13:43

knlambert changed the base branch from release/v2.3 to master June 22, 2022 13:45

knlambert changed the base branch from master to release/v2.3 June 22, 2022 13:47

knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch 2 times, most recently from 597ee5f to a80a73a Compare June 22, 2022 13:49

alexgervais reviewed Jun 22, 2022

View reviewed changes

knlambert requested a review from LanceEa June 29, 2022 11:34

knlambert changed the base branch from release/v2.3 to master July 6, 2022 13:54

LanceEa suggested changes Jul 10, 2022

View reviewed changes

knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch 3 times, most recently from d47e2fe to c121e99 Compare July 12, 2022 22:37

knlambert requested review from LanceEa and alexgervais July 12, 2022 22:46

LanceEa previously approved these changes Jul 14, 2022

View reviewed changes

LanceEa requested a review from haq204 July 15, 2022 14:43

alexgervais previously approved these changes Jul 18, 2022

View reviewed changes

haq204 reviewed Jul 18, 2022

View reviewed changes

knlambert dismissed stale reviews from alexgervais and LanceEa via 30f9c04 July 19, 2022 18:20

knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch from c121e99 to 30f9c04 Compare July 19, 2022 18:20

LanceEa previously approved these changes Jul 19, 2022

View reviewed changes

LanceEa requested a review from alexgervais July 20, 2022 11:13

knlambert added 3 commits July 20, 2022 09:26

Fix cooldown feature to properly manage several instances of emissary

6384ec1

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

Apply style guidelines

c2b15be

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

Add tests for edge cases

45c3df1

Signed-off-by: Kévin Lambert <kevin.lambert.ca@gmail.com>

knlambert dismissed LanceEa’s stale review via 45c3df1 July 20, 2022 13:26

knlambert force-pushed the knlambert/perform-metrics-cooldown-per-node branch from 30f9c04 to 45c3df1 Compare July 20, 2022 13:26

haq204 approved these changes Jul 20, 2022

View reviewed changes

knlambert merged commit a73be1c into master Jul 20, 2022

knlambert deleted the knlambert/perform-metrics-cooldown-per-node branch July 20, 2022 14:30

knlambert mentioned this pull request Jul 21, 2022

backport: Fix cooldown feature to properly manage several instances of emissary #4343

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cooldown feature to properly manage several instances of emissary #4293

Fix cooldown feature to properly manage several instances of emissary #4293

knlambert commented Jun 20, 2022 •

edited

Loading

alexgervais left a comment

LanceEa commented Jun 23, 2022

knlambert commented Jun 27, 2022 •

edited

Loading

LanceEa left a comment

LanceEa commented Jul 14, 2022

LanceEa left a comment

alexgervais left a comment

haq204 left a comment •

edited

Loading

knlambert commented Jul 19, 2022

LanceEa left a comment

LanceEa commented Jul 20, 2022

Fix cooldown feature to properly manage several instances of emissary #4293

Fix cooldown feature to properly manage several instances of emissary #4293

Conversation

knlambert commented Jun 20, 2022 • edited Loading

Description

Testing

Checklist

alexgervais left a comment

Choose a reason for hiding this comment

LanceEa commented Jun 23, 2022

knlambert commented Jun 27, 2022 • edited Loading

LanceEa left a comment

Choose a reason for hiding this comment

LanceEa commented Jul 14, 2022

LanceEa left a comment

Choose a reason for hiding this comment

alexgervais left a comment

Choose a reason for hiding this comment

haq204 left a comment • edited Loading

Choose a reason for hiding this comment

knlambert commented Jul 19, 2022

LanceEa left a comment

Choose a reason for hiding this comment

LanceEa commented Jul 20, 2022

knlambert commented Jun 20, 2022 •

edited

Loading

knlambert commented Jun 27, 2022 •

edited

Loading

haq204 left a comment •

edited

Loading