Insights Rapid Recommendations proposal #1569

tremes · 2024-02-22T12:59:21Z

No description provided.

enhancements/insights/rapid-recommendations.md

deads2k · 2024-03-04T21:43:41Z

enhancements/insights/rapid-recommendations.md

+This enables fast recovery time in case a remote configuration change caused issues and had to be reverted.
+The frequency is the same as with the 
+[conditional gathering configuration](../insights/conditional-data-gathering.md) (since OCP 4.10). 
+40k connected clusters requesting a remote configuration update every two hours mean 11 requests per second on average.


do we use something like cronjobs (which would synchronize every 2hrs) or somehow do it based on something that will smear over time?

No, the Insights Operator does not use cronjobs. The interval is basically measured from the start of the Insights Operator process.

The previous sentence about conditional gathering was meant to say that the same system is already in use and is not causing any issues (none that we know of anyway). We wanted to quantify the traffic explicitly because this proposal will increase the remote configuration service complexity. There are no inherent traffic spikes and this proposal doesn't add any.

enhancements/insights/rapid-recommendations.md

ikerreyes · 2024-03-08T12:30:57Z

enhancements/insights/rapid-recommendations.md

+6. Insights Operators in connected clusters download the new remote configuration 
+   and start collecting the new data.
+7. The data requester makes use of the new data in newly incoming Insights Operator archives.
+


I think the workload should also include steps to decommission/drop a particular data that has been requested and included. Otherwise, we will keep increasing the data gathered forever.

Forgot to answer this earlier (we discussed it separately). The problem already exists today. This proposal should improve the situation.

The principle stays the same: As long as we are not hitting any limits, we can keep collecting data. When we start hitting any limits, we will need to find a solution (drop some data points or increase the limit).

Our current ability to drop data points is somewhat limited. Once Insights Operator adds a data point, all consumers are invited to start using that data point. We don't know which consumer needs which data points.

With this feature, we will encourage consumers to specify a complete list of data points they need. Duplicate/overlapping data requests will be resolved by the feature (see here. We will know better who to reach out to when we will need to drop some data points.

enhancements/insights/rapid-recommendations.md

wking · 2024-03-13T05:30:12Z

enhancements/insights/rapid-recommendations.md

+
+This proposal does not require any specific details to account for the HyperShift. 
+For now, we can simply extend the remote configuration file to 
+request gathering of specific HyperShift resources.


How does this work, if you identify the (namespace, pod, container) you want logs from, but on HyperShift that controller happens to live on the management cluster? Where in the pipe does the request get shared between where-that-controller-lives-in-standalone vs. where-that-controller-lives-in-hosted?

So the Insights Operator is installed on both - management and worker cluster. I think the idea is simply, if the data is available in the cluster, then collect it, otherwise provide information (in the archive) about why it was not collected.

enhancements/insights/rapid-recommendations.md

wking · 2024-03-13T05:52:52Z

enhancements/insights/rapid-recommendations.md

+The Insights Operator will add two new conditions to the ClusterOperator resource:
+
+* `RemoteConfigurationUnavailable` will indicate whether the Insights Operator can access 
+  the configured remote configuration endpoint (host unreachable or HTTP errors).


How does this get bubbled up to cluster admins, e.g. if they set up firewall rules that block or garble access? Do we not complain, because we assume that if config-retrieval fails, archive-upload will also fail? Or do we complain via Degraded=True and/or an alert, so the cluster-admin can select between "explicitly disable gathers" or "fix the networking issue"?

Good question. Let's not assume anything. :) I think the scenario will be as follows (and yes it's probably good idea to explicitly describe it in the proposal):

operator cannot connect to the remote config endpoint -> it will use the default fallback definition hardcoded in the binary

data gathering runs and there's an attempt to upload the archive (and from now it's the same as today)

if the upload fails (after few retries) then Degraded=True and it's up to the cluster-admin to either disable it (i.e no data, no upload) or configure your network (when there's still risk that only e.g upload is allowed, so perhaps a new Insights recommendation then?)

enhancements/insights/rapid-recommendations.md

jan--f · 2024-03-13T12:43:55Z

enhancements/insights/rapid-recommendations.md

+with a new one:
+
+```bash
+https://console.redhat.com/api/gathering/v2/%s/gathering_rules.json


The Monitoring team is considering a proposal for a very similar mechanism for telemetry metrics. Is it feasible to namespace the endpoint further? E.g. https://console.redhat.com/api/remote_config/gathering/v2/%s/gathering_rules.json
A telemetry endpoint could then use https://console.redhat.com/api/remote_config/telemetry or similar.

Yes good point. I think we should be able to do this easily.

jan--f · 2024-03-13T12:52:09Z

enhancements/insights/rapid-recommendations.md

+* [SchedulerLogs](https://github.com/openshift/insights-operator/blob/master/docs/gathered-data.md#schedulerlogs)
+
+
+### Risks and Mitigations


I think it's worth calling out adversarial risks here too. For example a MITM attack seems worth considering and recording, even if the mitigation is "we control console.redhat.com and trust existing certificates infrastructure".

enhancements/insights/rapid-recommendations.md

openshift-bot · 2024-04-23T01:15:10Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

tremes · 2024-04-23T05:51:58Z

/remove-lifecycle stale

first proposal answer some question and update the versioning section Add a dedicated "Data Redaction/Obfuscation" section I believe this aspect deserves a dedicated section. For the moment, I added some notes on data redaction/obfuscation capabilities that I think the solution must/should/could have. some next updates and simple test plan section Add stub section on air-gapped clusters Move (incorporate) "air-gapped clusters" into "risks and mitigations" fill in the graduation criteria postpone the problem of the Prometheus metrics data & Removing a deprecated feature section Fill in drawbacks sections & update metadata Change wording from "external" to "remote" and start Upgrade/Downgrade section next sections failurer modes section implementation history support procedures section add two sections about the data limitations questions update Cleaned up Motivation section Incorporate feedback on the Motivation section Updated first part of "Proposal" section address one TODO in the Risk & mitigations section udpate tittle More proposal changes (and todo items) Assign todos for the next iterations Add "Status Reporting and Monitoring" section, updated corresponding risks and mitigations Added air-gapped and disconnected clusters to non-goals addressing some TODOs remove sentence about moving existing container log data in the archive Better examples of aggregated data gatherers Updated "Insights Archive Structure Changes" and removed corresponding todo Updated "Limits on Processed and Gathered Data" Fix minor issues in first few sections Resolving and removing todos Update "Alternatives" and "Required Infrastructure" Update "Graduation Criteria" to focus on container logs Open question about limits relative to number of nodes Mention canary rollouts as an option More structured monitoring Resolve last todos fix some typos & minor changes Move notes about remote config request frequency

openshift-ci · 2024-05-07T11:21:24Z

@tremes: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

deads2k · 2024-05-07T15:46:06Z

an API that

had a hard coded backup
accommodated (though discouraged) schema changes in z-streams
solved the need for specific matching for z-streams
avoided the fleet synchronizing on refresh times

resolves my concerns. thanks.

/approve

openshift-ci · 2024-05-07T15:54:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jholecek-rh · 2024-05-09T08:41:52Z

/lgtm

openshift-ci bot requested review from dmage and zaneb February 22, 2024 12:59

tremes force-pushed the rapid_recommendations_squashed branch 2 times, most recently from 5b85ec9 to 53f6298 Compare February 22, 2024 13:26

deads2k reviewed Feb 22, 2024

View reviewed changes

enhancements/insights/rapid-recommendations.md Outdated Show resolved Hide resolved

bparees reviewed Mar 1, 2024

View reviewed changes

enhancements/insights/rapid-recommendations.md Show resolved Hide resolved

deads2k reviewed Mar 4, 2024

View reviewed changes

enhancements/insights/rapid-recommendations.md Show resolved Hide resolved

deads2k reviewed Mar 4, 2024

View reviewed changes

enhancements/insights/rapid-recommendations.md Outdated Show resolved Hide resolved

deads2k reviewed Mar 4, 2024

View reviewed changes

enhancements/insights/rapid-recommendations.md Show resolved Hide resolved

deads2k reviewed Mar 4, 2024

View reviewed changes

enhancements/insights/rapid-recommendations.md Show resolved Hide resolved

ikerreyes reviewed Mar 8, 2024

View reviewed changes

wking reviewed Mar 13, 2024

View reviewed changes

enhancements/insights/rapid-recommendations.md Outdated Show resolved Hide resolved

wking reviewed Mar 13, 2024

View reviewed changes

enhancements/insights/rapid-recommendations.md Outdated Show resolved Hide resolved

wking reviewed Mar 13, 2024

View reviewed changes

enhancements/insights/rapid-recommendations.md Show resolved Hide resolved

jan--f reviewed Mar 13, 2024

View reviewed changes

tremes commented Mar 13, 2024

View reviewed changes

enhancements/insights/rapid-recommendations.md Outdated Show resolved Hide resolved

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2024

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2024

tremes force-pushed the rapid_recommendations_squashed branch from 9403533 to 67ec433 Compare April 26, 2024 06:01

tremes force-pushed the rapid_recommendations_squashed branch from 8f91bf9 to 9ee9e27 Compare May 7, 2024 11:10

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2024

openshift-ci bot assigned jholecek-rh May 9, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2024

openshift-merge-bot bot merged commit e24b3c7 into openshift:master May 9, 2024
2 checks passed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insights Rapid Recommendations proposal #1569

Insights Rapid Recommendations proposal #1569

tremes commented Feb 22, 2024

deads2k Mar 4, 2024

jholecek-rh Mar 7, 2024

deads2k May 7, 2024

ikerreyes Mar 8, 2024

jholecek-rh May 9, 2024

wking Mar 13, 2024

tremes Mar 13, 2024

wking Mar 13, 2024 •

edited

Loading

tremes Mar 13, 2024

jan--f Mar 13, 2024

tremes Mar 13, 2024

jan--f Mar 13, 2024

openshift-bot commented Apr 23, 2024

tremes commented Apr 23, 2024

openshift-ci bot commented May 7, 2024

deads2k commented May 7, 2024

openshift-ci bot commented May 7, 2024

jholecek-rh commented May 9, 2024

		* [SchedulerLogs](https://github.com/openshift/insights-operator/blob/master/docs/gathered-data.md#schedulerlogs)


		### Risks and Mitigations

Insights Rapid Recommendations proposal #1569

Insights Rapid Recommendations proposal #1569

Conversation

tremes commented Feb 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wking Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Apr 23, 2024

tremes commented Apr 23, 2024

openshift-ci bot commented May 7, 2024

deads2k commented May 7, 2024

openshift-ci bot commented May 7, 2024

jholecek-rh commented May 9, 2024

wking Mar 13, 2024 •

edited

Loading