-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insights Rapid Recommendations proposal #1569
Insights Rapid Recommendations proposal #1569
Conversation
5b85ec9
to
53f6298
Compare
This enables fast recovery time in case a remote configuration change caused issues and had to be reverted. | ||
The frequency is the same as with the | ||
[conditional gathering configuration](../insights/conditional-data-gathering.md) (since OCP 4.10). | ||
40k connected clusters requesting a remote configuration update every two hours mean 11 requests per second on average. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we use something like cronjobs (which would synchronize every 2hrs) or somehow do it based on something that will smear over time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the Insights Operator does not use cronjobs. The interval is basically measured from the start of the Insights Operator process.
The previous sentence about conditional gathering was meant to say that the same system is already in use and is not causing any issues (none that we know of anyway). We wanted to quantify the traffic explicitly because this proposal will increase the remote configuration service complexity. There are no inherent traffic spikes and this proposal doesn't add any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
6. Insights Operators in connected clusters download the new remote configuration | ||
and start collecting the new data. | ||
7. The data requester makes use of the new data in newly incoming Insights Operator archives. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the workload should also include steps to decommission/drop a particular data that has been requested and included. Otherwise, we will keep increasing the data gathered forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to answer this earlier (we discussed it separately). The problem already exists today. This proposal should improve the situation.
The principle stays the same: As long as we are not hitting any limits, we can keep collecting data. When we start hitting any limits, we will need to find a solution (drop some data points or increase the limit).
Our current ability to drop data points is somewhat limited. Once Insights Operator adds a data point, all consumers are invited to start using that data point. We don't know which consumer needs which data points.
With this feature, we will encourage consumers to specify a complete list of data points they need. Duplicate/overlapping data requests will be resolved by the feature (see here. We will know better who to reach out to when we will need to drop some data points.
|
||
This proposal does not require any specific details to account for the HyperShift. | ||
For now, we can simply extend the remote configuration file to | ||
request gathering of specific HyperShift resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work, if you identify the (namespace, pod, container)
you want logs from, but on HyperShift that controller happens to live on the management cluster? Where in the pipe does the request get shared between where-that-controller-lives-in-standalone vs. where-that-controller-lives-in-hosted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the Insights Operator is installed on both - management and worker cluster. I think the idea is simply, if the data is available in the cluster, then collect it, otherwise provide information (in the archive) about why it was not collected.
The Insights Operator will add two new conditions to the ClusterOperator resource: | ||
|
||
* `RemoteConfigurationUnavailable` will indicate whether the Insights Operator can access | ||
the configured remote configuration endpoint (host unreachable or HTTP errors). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this get bubbled up to cluster admins, e.g. if they set up firewall rules that block or garble access? Do we not complain, because we assume that if config-retrieval fails, archive-upload will also fail? Or do we complain via Degraded=True
and/or an alert, so the cluster-admin can select between "explicitly disable gathers" or "fix the networking issue"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Let's not assume anything. :) I think the scenario will be as follows (and yes it's probably good idea to explicitly describe it in the proposal):
- operator cannot connect to the remote config endpoint -> it will use the default fallback definition hardcoded in the binary
- data gathering runs and there's an attempt to upload the archive (and from now it's the same as today)
- if the upload fails (after few retries) then
Degraded=True
and it's up to the cluster-admin to either disable it (i.e no data, no upload) or configure your network (when there's still risk that only e.g upload is allowed, so perhaps a new Insights recommendation then?)
with a new one: | ||
|
||
```bash | ||
https://console.redhat.com/api/gathering/v2/%s/gathering_rules.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Monitoring team is considering a proposal for a very similar mechanism for telemetry metrics. Is it feasible to namespace the endpoint further? E.g. https://console.redhat.com/api/remote_config/gathering/v2/%s/gathering_rules.json
A telemetry endpoint could then use https://console.redhat.com/api/remote_config/telemetry
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good point. I think we should be able to do this easily.
* [SchedulerLogs](https://github.com/openshift/insights-operator/blob/master/docs/gathered-data.md#schedulerlogs) | ||
|
||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth calling out adversarial risks here too. For example a MITM attack seems worth considering and recording, even if the mitigation is "we control console.redhat.com
and trust existing certificates infrastructure".
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
9403533
to
67ec433
Compare
first proposal answer some question and update the versioning section Add a dedicated "Data Redaction/Obfuscation" section I believe this aspect deserves a dedicated section. For the moment, I added some notes on data redaction/obfuscation capabilities that I think the solution must/should/could have. some next updates and simple test plan section Add stub section on air-gapped clusters Move (incorporate) "air-gapped clusters" into "risks and mitigations" fill in the graduation criteria postpone the problem of the Prometheus metrics data & Removing a deprecated feature section Fill in drawbacks sections & update metadata Change wording from "external" to "remote" and start Upgrade/Downgrade section next sections failurer modes section implementation history support procedures section add two sections about the data limitations questions update Cleaned up Motivation section Incorporate feedback on the Motivation section Updated first part of "Proposal" section address one TODO in the Risk & mitigations section udpate tittle More proposal changes (and todo items) Assign todos for the next iterations Add "Status Reporting and Monitoring" section, updated corresponding risks and mitigations Added air-gapped and disconnected clusters to non-goals addressing some TODOs remove sentence about moving existing container log data in the archive Better examples of aggregated data gatherers Updated "Insights Archive Structure Changes" and removed corresponding todo Updated "Limits on Processed and Gathered Data" Fix minor issues in first few sections Resolving and removing todos Update "Alternatives" and "Required Infrastructure" Update "Graduation Criteria" to focus on container logs Open question about limits relative to number of nodes Mention canary rollouts as an option More structured monitoring Resolve last todos fix some typos & minor changes Move notes about remote config request frequency
8f91bf9
to
9ee9e27
Compare
@tremes: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
an API that
resolves my concerns. thanks. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
No description provided.