Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect metrics from Fedora CoreOS machines #86

Open
bgilbert opened this issue Nov 27, 2018 · 19 comments
Open

Collect metrics from Fedora CoreOS machines #86

bgilbert opened this issue Nov 27, 2018 · 19 comments
Labels
jira for syncing to jira kind/design

Comments

@bgilbert
Copy link
Contributor

The idea of a Linux distribution collecting metrics about its installed base has long been controversial. Many Linux users do not want their operating system to phone home about the state of their system. At the same time, it is difficult to effectively allocate development resources for an operating system whose installed base is not well understood. Decisions often need to be made about what platforms or container runtimes to prioritize; what hardware, system services, or system or network configurations are commonly used; which corner cases need to be especially well tested during upgrades; and which third-party services provide important compatibility constraints. Metrics can inform these decisions, which benefits the operating system and the userbase as a whole. External mechanisms for measuring the installed base, such as analysis of logs from download mirrors, are inaccurate at best; much better data can be collected from installed machines directly.

We would like to collect metrics from Fedora CoreOS systems by default. We want to be clear about exactly what will be collected, how it will be used, and how to disable that collection. We may also allow opting into collection of additional metrics beyond the default set.

Background: Container Linux

Container Linux metrics are collected via the update system. CoreUpdate collects metrics about each machine that checks in: its update channel, its state in the client state machine, what OS version is running, what version was originally installed, the OEM ID (platform) of the machine, and its checkin history. This works okay but gives us an incomplete picture of the installed base: we do not receive any information about machines behind private CoreUpdate servers, behind a third-party update server such as CoreRoller, or which have updates disabled.

Fedora CoreOS

Fedora CoreOS will not couple metrics to its update system. This not only allows greater freedom in the design of both, but allows privacy-conscious users to disable metrics while continuing to receive automatic updates. (In any event, the Cincinnati update protocol (#83) is not designed to collect client metrics.) We will need a separate metrics-collection system, including:

  • A client daemon or timer unit to perform the collection
  • A cloud service to collect and aggregate metrics data

Initial metrics might include:

  • Random unique identifier for deduplicating reports
  • Platform (cloud environment or hypervisor)
  • On bare-metal systems, a summary of hardware
  • On cloud systems, the instance type
  • Original OS version
  • Current OS version
  • Container runtimes in use
  • Summary of network configuration

Metrics might be grouped into multiple levels, such as minimal and full, which collect different amounts of information. The default could be minimal, with the option to switch to full or off by writing a config file via Ignition. Corresponding documentation should explain what metrics are collected, how the metrics are used, and the consequences of disabling them.

Due to time constraints, functioning metrics collection may not be included in the first release of Fedora CoreOS. However, the first release should include at least the following:

  • A configuration file
  • A service that parses the config file and rejects invalid configs
  • Documentation for configuring metrics collection

In other words, metrics collection should be configurable and documented from day 1. This should reduce unpleasant surprises for the user community when the metrics infrastructure is deployed and enabled.

@ajeddeloh
Copy link
Contributor

Something that should be added: a way of clearly seeing what will be collected with a given configuration. Users should be able to run a program or look at a file and see the exact data that would be collected. This ensures they can check and make sure any privacy constraints they have will not be violated and ensures they can set the metrics level to whatever suits them best.

@arithx
Copy link
Contributor

arithx commented Nov 28, 2018

One thought I had for a metric to track would be whether or not the user has layered a package, if they specify the full metric logging we could also track which packages they're layering so we can decide on whether or not we might need to add additional packages to the base image.

@bgilbert bgilbert removed the meeting topics for meetings label Nov 28, 2018
@dustymabe
Copy link
Member

We discussed this in the meeting two weeks ago.

In general I'm 👍 for this but I made a comment during that meeting which I'd like to also mention here: I think we need to be careful how we approach this. we need to be very vocal about this so users so we have the necessary transparency. I think we can achieve this by having something like a fedora magazine article and devel mailing list post where we reference our plans and get feedback from the community. My goal here is to make sure the community isn't surprised by this and we talk about it early and often.

@MureDanta
Copy link

As a user in a privacy-oriented regulatory environment, I'd like to suggest that data collection be off by default and opt-in. It's the only way to respect user privacy because language issues, lack of expectations for such collection, etc, could easily lead to a situation where someone who doesn't want collection done has their information leaked by accident. For example, in some situations, depending on the granularity, things like network configuration would definitely be considered sensitive information, and from a pen testing point of view, some of those metric would offer some useful hints.

Also any time you talk about collecting data, you open Pandora's Box and have think about how the individual host data will be transported, how it will be secured after it is collected, how long it will be kept, who will have access, how access will be limited/logged, etc. Sometimes it's just not worth it.

@dustymabe
Copy link
Member

related: discussion on fedora devel list: F30: System-Wide Change proposal: DNF UUID

@bgilbert
Copy link
Contributor Author

bgilbert commented Jan 8, 2019

That DNF change proposal includes a mechanism for instances to flag themselves as short-lived, which is interesting. It seems as though that could be done server-side, though.

@MureDanta
Copy link

The DNF proposal is interesting, though it would be better if it were opt-in, otherwise I'm not sure how they will prevent people from being counted before they've had a chance to opt out. But the more important point in the context of this discussion is that all they're talking about is counting. So... some kind of UUID not used for any purpose other than to guarantee uniqueness of the counted systems. They would not (and indeed should not) even need to associate the UUID with an IP address. Just count them and that's it. So it's quite different from the more rich data being discussed above.

What do people think of requiring opt-in? Ethically it seems like the only defensible position because it's the only way to be sure that people really want the data collected (and perhaps they will if it's clear that this will be used to prioritize development), and the system isn't taking advantage of people who are too busy, don't read Fedora magazine, don't read the documentation carefully, don't read English, etc. Admittedly opt-in will result in less data because you'll lose data from the people who don't notice the issue or can't be bothered to click "yes," but that's sort of the root question: What should the world value most by default? respecting privacy or collecting data?

Granted "privacy" may not be entirely the right term. We're (probably) not talking about personally identifiable information. But it's still data that belongs to people and/or companies, crosses that line between what is public and what is "inside" their domain, inside their personal property, and that they will not expect to be collected without their consent. Obviously I'm wary of this, but you knew that!

@bgilbert
Copy link
Contributor Author

Hey @MureDanta, thanks for pursuing this!

The distinction between counting and tracking is an important one, and unfortunately my original writeup didn't talk about it at all. The goal here is entirely to obtain aggregate data about how many users Fedora CoreOS has and how the OS is being used. We do not want to associate individual checkins with a human, company, IP address, etc., and we should carefully avoid doing so. For example, the DNF proposal mentions rotating the UUID on a regular schedule to avoid long-term tracking, and we could do a similar thing here.

If we made the entire system opt-in, I suspect that the resulting data would be sufficiently nonrepresentative that the metrics wouldn't be worth collecting at all. And those metrics are actually important, for the reasons described in the original writeup. Fedora and Container Linux have historically both struggled to make good decisions without a solid understanding of the size and shape of their userbases, and we're trying to avoid making the same mistakes here.

However, you're right that some of the metrics listed above are inherently sensitive, and should be disabled by default. That's where the metrics levels come in. For example, we might group the original list into:

minimal (enabled by default):

  • Random unique identifier for deduplicating reports
  • Platform (cloud environment or hypervisor)
  • On cloud systems, the instance type
  • Original OS version
  • Current OS version

full (disabled by default):

  • On bare-metal systems, a summary of hardware
  • Container runtimes in use
  • Summary of network configuration

The minimal list would thus contain only generic items that couldn't readily be used to fingerprint individual systems.

Fedora CoreOS won't really have an installer: as with Container Linux, users will launch a generic image and use an Ignition config to customize the machine. That means Fedora CoreOS won't have the early-opt-out issue you mentioned, since Ignition would configure metrics before the system first booted. But it also means that we don't have a good way to prompt for permission to gather metrics. I expect we'd prominently mention metrics in the getting-started documentation, but you're right that those docs probably won't reach 100% of users. Given the current pervasive norms in the software world, though, I honestly don't think small amounts of telemetry will take users completely by surprise.

Do you have thoughts about additional ways to improve the privacy properties of the proposal? We want to do the right thing here, within the constraints of the Fedora CoreOS design.

@MureDanta
Copy link

MureDanta commented Jan 18, 2019

At this point I'm a total newcomer so am not familiar enough with the process of installing and configuring CL or Fedora CoreOS to propose anything concrete. All I can think of offhand would be perhaps a notice from the transpiler that data collection is enabled by default if the YAML file doesn't contain an explicit setting one way or another, but have no idea how effective that would be. I imagine a lot of those files will be generated automatically from templates, so I don't know if a human being would even see a notice like that in a production environment.

Maybe the best answer is, as you say, just to prominently feature this function in the documentation, collect only what is truly meaningful/necessary, and then for the message protocol and the back end, do as much as possible to make sure the information isn't traceable? My main concern is probably that last part... that information could be traced back and then exploited by someone on the dark side. There's a tendency in instrumentation to try to collect/keep all kinds of stuff on the theory that it might be useful, but it's probably better to collect only what's really needed.

@bgilbert bgilbert added this to Proposed in Fedora CoreOS preview via automation Jan 22, 2019
@bgilbert
Copy link
Contributor Author

All I can think of offhand would be perhaps a notice from the transpiler that data collection is enabled by default if the YAML file doesn't contain an explicit setting one way or another, but have no idea how effective that would be. I imagine a lot of those files will be generated automatically from templates, so I don't know if a human being would even see a notice like that in a production environment.

I'm not sure either, but it's worth considering.

Maybe the best answer is, as you say, just to prominently feature this function in the documentation, collect only what is truly meaningful/necessary, and then for the message protocol and the back end, do as much as possible to make sure the information isn't traceable?

👍

@bgilbert
Copy link
Contributor Author

It seems as though Lennart's countme idea from the DNF UUID devel@ thread would work here too. In essence we'd drop the unique UUID in favor of trusting the client to report exactly once per time interval. We'd probably want to be able to aggregate data over multiple intervals, e.g. unique machines per day and per month, but the client could maintain multiple timers and tell the server whether a particular checkin is a daily or a monthly one.

@basvdlei
Copy link

What are the thoughts about how the error should be handled when the system can not 'call home'? I assume failing silently is the preferred option here (apart from some logging somewhere maybe).

As someone running Container Linux in a corporate environment, I've had to do put some additional drop-ins in place for services like update-engine so they can reach the internet through a HTTP proxy. In case of the update-engine there is a clear incentive to do this, for a telemetry style agent/process this will require some documentation or maybe even an option in the ignition config.

As for implementing a clear opt-in, the thing that comes to my mind is the have it disabled by default when booting without any config and make it a mandatory option in a user ignition config. But I'm not sure how I feel about provisioning failing on a missing telemetry setting...

@bgilbert bgilbert added this to Proposed in Fedora CoreOS stable via automation May 24, 2019
@bgilbert bgilbert removed this from Proposed in Fedora CoreOS preview May 24, 2019
@bgilbert bgilbert moved this from Proposed to Selected in Fedora CoreOS stable Jul 16, 2019
@LorbusChris
Copy link
Contributor

LorbusChris commented Jul 25, 2019

It might be useful to align FCOS pinger and MCO prometheus host metrics (and their format) to facilitate using FCOS as OKD/Kubernetes base OS.

@bgilbert
Copy link
Contributor Author

For the record, the plan is to proceed along the lines of #86 (comment). We no longer intend to transmit any unique identifiers.

@mattdm
Copy link

mattdm commented Jun 20, 2020

An update here from my side: we have the DNF Countme stuff in place and running for non-ostree-based Fedora systems in Fedora 32, and a simple backend collector as well. I'd really love for CoreOS (and Silverblue and IoT) systems to have an exactly-compatible implementation (possibly even hitting similar URLs) so I have integrated data.

The best description of the implementation is in the man page for dnf.conf, and you can see the Behave tests here: https://github.com/rpm-software-management/ci-dnf-stack/blob/master/dnf-behave-tests/features/countme.feature.

Where's the best place for me to track this -- in the "pinger" service, here, or somewhere else?

Also of note: the Fedora Workstation team is looking into more detailed metrics as well, using the work from Endless Computing (guadec talk). Collaboration there might make sense, both for comparability and, y'know, shared work.

@cgwalters
Copy link
Member

I lean a bit towards implementing the same dnf logic in ostree - that way it will naturally work across other distributions using ostree too. But there are a variety of other approaches; once the dnf logic is lowered into libdnf (a separate issue) then we could in theory change rpm-ostree to use libdnf to fetch just the toplevel repomd file or something.

@travier
Copy link
Member

travier commented Jan 14, 2021

The rpm-ostree implementation of Count Me logic has been merged. Fedora infrastructure changes to support the new user agent are in progress: https://pagure.io/mirrors-countme/pull-request/2

@bgilbert
Copy link
Contributor Author

We've decided to drop the fedora-coreos-pinger stub from the distro (#770), with the option to re-add it in the future if it becomes ready to deploy.

@travier
Copy link
Member

travier commented Jul 17, 2023

I've posted a link to this discussion with more examples of data that could be collected and would be useful in the Fedora Change proposal discussion related to telemetry: https://discussion.fedoraproject.org/t/what-data-will-be-collected-exactly-a-breakout-topic-for-the-f40-change-request-on-privacy-preserving-telemetry-for-fedora-workstation/85417/47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira kind/design
Projects
No open projects
Development

No branches or pull requests

10 participants