Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insights Rapid Recommendations proposal #1569

Merged

Conversation

tremes
Copy link
Contributor

@tremes tremes commented Feb 22, 2024

No description provided.

@openshift-ci openshift-ci bot requested review from dmage and zaneb February 22, 2024 12:59
@tremes tremes force-pushed the rapid_recommendations_squashed branch 2 times, most recently from 5b85ec9 to 53f6298 Compare February 22, 2024 13:26
This enables fast recovery time in case a remote configuration change caused issues and had to be reverted.
The frequency is the same as with the
[conditional gathering configuration](../insights/conditional-data-gathering.md) (since OCP 4.10).
40k connected clusters requesting a remote configuration update every two hours mean 11 requests per second on average.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use something like cronjobs (which would synchronize every 2hrs) or somehow do it based on something that will smear over time?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the Insights Operator does not use cronjobs. The interval is basically measured from the start of the Insights Operator process.

The previous sentence about conditional gathering was meant to say that the same system is already in use and is not causing any issues (none that we know of anyway). We wanted to quantify the traffic explicitly because this proposal will increase the remote configuration service complexity. There are no inherent traffic spikes and this proposal doesn't add any.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

6. Insights Operators in connected clusters download the new remote configuration
and start collecting the new data.
7. The data requester makes use of the new data in newly incoming Insights Operator archives.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the workload should also include steps to decommission/drop a particular data that has been requested and included. Otherwise, we will keep increasing the data gathered forever.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to answer this earlier (we discussed it separately). The problem already exists today. This proposal should improve the situation.

The principle stays the same: As long as we are not hitting any limits, we can keep collecting data. When we start hitting any limits, we will need to find a solution (drop some data points or increase the limit).

Our current ability to drop data points is somewhat limited. Once Insights Operator adds a data point, all consumers are invited to start using that data point. We don't know which consumer needs which data points.

With this feature, we will encourage consumers to specify a complete list of data points they need. Duplicate/overlapping data requests will be resolved by the feature (see here. We will know better who to reach out to when we will need to drop some data points.

enhancements/insights/rapid-recommendations.md Outdated Show resolved Hide resolved
enhancements/insights/rapid-recommendations.md Outdated Show resolved Hide resolved
enhancements/insights/rapid-recommendations.md Outdated Show resolved Hide resolved
enhancements/insights/rapid-recommendations.md Outdated Show resolved Hide resolved

This proposal does not require any specific details to account for the HyperShift.
For now, we can simply extend the remote configuration file to
request gathering of specific HyperShift resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work, if you identify the (namespace, pod, container) you want logs from, but on HyperShift that controller happens to live on the management cluster? Where in the pipe does the request get shared between where-that-controller-lives-in-standalone vs. where-that-controller-lives-in-hosted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the Insights Operator is installed on both - management and worker cluster. I think the idea is simply, if the data is available in the cluster, then collect it, otherwise provide information (in the archive) about why it was not collected.

The Insights Operator will add two new conditions to the ClusterOperator resource:

* `RemoteConfigurationUnavailable` will indicate whether the Insights Operator can access
the configured remote configuration endpoint (host unreachable or HTTP errors).
Copy link
Member

@wking wking Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this get bubbled up to cluster admins, e.g. if they set up firewall rules that block or garble access? Do we not complain, because we assume that if config-retrieval fails, archive-upload will also fail? Or do we complain via Degraded=True and/or an alert, so the cluster-admin can select between "explicitly disable gathers" or "fix the networking issue"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Let's not assume anything. :) I think the scenario will be as follows (and yes it's probably good idea to explicitly describe it in the proposal):

  • operator cannot connect to the remote config endpoint -> it will use the default fallback definition hardcoded in the binary
  • data gathering runs and there's an attempt to upload the archive (and from now it's the same as today)
  • if the upload fails (after few retries) then Degraded=True and it's up to the cluster-admin to either disable it (i.e no data, no upload) or configure your network (when there's still risk that only e.g upload is allowed, so perhaps a new Insights recommendation then?)

with a new one:

```bash
https://console.redhat.com/api/gathering/v2/%s/gathering_rules.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Monitoring team is considering a proposal for a very similar mechanism for telemetry metrics. Is it feasible to namespace the endpoint further? E.g. https://console.redhat.com/api/remote_config/gathering/v2/%s/gathering_rules.json
A telemetry endpoint could then use https://console.redhat.com/api/remote_config/telemetry or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point. I think we should be able to do this easily.

* [SchedulerLogs](https://github.com/openshift/insights-operator/blob/master/docs/gathered-data.md#schedulerlogs)


### Risks and Mitigations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth calling out adversarial risks here too. For example a MITM attack seems worth considering and recording, even if the mitigation is "we control console.redhat.com and trust existing certificates infrastructure".

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2024
@tremes
Copy link
Contributor Author

tremes commented Apr 23, 2024

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2024
@tremes tremes force-pushed the rapid_recommendations_squashed branch from 9403533 to 67ec433 Compare April 26, 2024 06:01
first proposal

answer some question and update the versioning section

Add a dedicated "Data Redaction/Obfuscation" section

I believe this aspect deserves a dedicated section. For the moment,
I added some notes on data redaction/obfuscation capabilities that
I think the solution must/should/could have.

some next updates and simple test plan section

Add stub section on air-gapped clusters

Move (incorporate) "air-gapped clusters" into "risks and mitigations"

fill in the graduation criteria

postpone the problem of the Prometheus metrics data & Removing a deprecated feature section

Fill in drawbacks sections & update metadata

Change wording from "external" to "remote" and start Upgrade/Downgrade section

next sections

failurer modes section

implementation history

support procedures section

add two sections about the data limitations

questions update

Cleaned up Motivation section

Incorporate feedback on the Motivation section

Updated first part of "Proposal" section

address one TODO in the Risk & mitigations section

udpate tittle

More proposal changes (and todo items)

Assign todos for the next iterations

Add "Status Reporting and Monitoring" section, updated corresponding risks and mitigations

Added air-gapped and disconnected clusters to non-goals

addressing some TODOs

remove sentence about moving existing container log data in the archive

Better examples of aggregated data gatherers

Updated "Insights Archive Structure Changes" and removed corresponding todo

Updated "Limits on Processed and Gathered Data"

Fix minor issues in first few sections

Resolving and removing todos

Update "Alternatives" and "Required Infrastructure"

Update "Graduation Criteria" to focus on container logs

Open question about limits relative to number of nodes

Mention canary rollouts as an option

More structured monitoring

Resolve last todos

fix some typos & minor changes

Move notes about remote config request frequency
@tremes tremes force-pushed the rapid_recommendations_squashed branch from 8f91bf9 to 9ee9e27 Compare May 7, 2024 11:10
Copy link
Contributor

openshift-ci bot commented May 7, 2024

@tremes: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@deads2k
Copy link
Contributor

deads2k commented May 7, 2024

an API that

  1. had a hard coded backup
  2. accommodated (though discouraged) schema changes in z-streams
  3. solved the need for specific matching for z-streams
  4. avoided the fleet synchronizing on refresh times

resolves my concerns. thanks.

/approve

Copy link
Contributor

openshift-ci bot commented May 7, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2024
@jholecek-rh
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit e24b3c7 into openshift:master May 9, 2024
2 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants