Collect metrics from Fedora CoreOS machines #86

bgilbert · 2018-11-27T20:48:28Z

The idea of a Linux distribution collecting metrics about its installed base has long been controversial. Many Linux users do not want their operating system to phone home about the state of their system. At the same time, it is difficult to effectively allocate development resources for an operating system whose installed base is not well understood. Decisions often need to be made about what platforms or container runtimes to prioritize; what hardware, system services, or system or network configurations are commonly used; which corner cases need to be especially well tested during upgrades; and which third-party services provide important compatibility constraints. Metrics can inform these decisions, which benefits the operating system and the userbase as a whole. External mechanisms for measuring the installed base, such as analysis of logs from download mirrors, are inaccurate at best; much better data can be collected from installed machines directly.

We would like to collect metrics from Fedora CoreOS systems by default. We want to be clear about exactly what will be collected, how it will be used, and how to disable that collection. We may also allow opting into collection of additional metrics beyond the default set.

Background: Container Linux

Container Linux metrics are collected via the update system. CoreUpdate collects metrics about each machine that checks in: its update channel, its state in the client state machine, what OS version is running, what version was originally installed, the OEM ID (platform) of the machine, and its checkin history. This works okay but gives us an incomplete picture of the installed base: we do not receive any information about machines behind private CoreUpdate servers, behind a third-party update server such as CoreRoller, or which have updates disabled.

Fedora CoreOS

Fedora CoreOS will not couple metrics to its update system. This not only allows greater freedom in the design of both, but allows privacy-conscious users to disable metrics while continuing to receive automatic updates. (In any event, the Cincinnati update protocol (#83) is not designed to collect client metrics.) We will need a separate metrics-collection system, including:

A client daemon or timer unit to perform the collection
A cloud service to collect and aggregate metrics data

Initial metrics might include:

Random unique identifier for deduplicating reports
Platform (cloud environment or hypervisor)
On bare-metal systems, a summary of hardware
On cloud systems, the instance type
Original OS version
Current OS version
Container runtimes in use
Summary of network configuration

Metrics might be grouped into multiple levels, such as minimal and full, which collect different amounts of information. The default could be minimal, with the option to switch to full or off by writing a config file via Ignition. Corresponding documentation should explain what metrics are collected, how the metrics are used, and the consequences of disabling them.

Due to time constraints, functioning metrics collection may not be included in the first release of Fedora CoreOS. However, the first release should include at least the following:

A configuration file
A service that parses the config file and rejects invalid configs
Documentation for configuring metrics collection

In other words, metrics collection should be configurable and documented from day 1. This should reduce unpleasant surprises for the user community when the metrics infrastructure is deployed and enabled.

The text was updated successfully, but these errors were encountered:

ajeddeloh · 2018-11-27T21:26:16Z

Something that should be added: a way of clearly seeing what will be collected with a given configuration. Users should be able to run a program or look at a file and see the exact data that would be collected. This ensures they can check and make sure any privacy constraints they have will not be violated and ensures they can set the metrics level to whatever suits them best.

arithx · 2018-11-28T16:57:09Z

One thought I had for a metric to track would be whether or not the user has layered a package, if they specify the full metric logging we could also track which packages they're layering so we can decide on whether or not we might need to add additional packages to the base image.

dustymabe · 2018-12-11T15:48:22Z

We discussed this in the meeting two weeks ago.

In general I'm 👍 for this but I made a comment during that meeting which I'd like to also mention here: I think we need to be careful how we approach this. we need to be very vocal about this so users so we have the necessary transparency. I think we can achieve this by having something like a fedora magazine article and devel mailing list post where we reference our plans and get feedback from the community. My goal here is to make sure the community isn't surprised by this and we talk about it early and often.

MureDanta · 2018-12-27T01:25:56Z

As a user in a privacy-oriented regulatory environment, I'd like to suggest that data collection be off by default and opt-in. It's the only way to respect user privacy because language issues, lack of expectations for such collection, etc, could easily lead to a situation where someone who doesn't want collection done has their information leaked by accident. For example, in some situations, depending on the granularity, things like network configuration would definitely be considered sensitive information, and from a pen testing point of view, some of those metric would offer some useful hints.

Also any time you talk about collecting data, you open Pandora's Box and have think about how the individual host data will be transported, how it will be secured after it is collected, how long it will be kept, who will have access, how access will be limited/logged, etc. Sometimes it's just not worth it.

dustymabe · 2019-01-08T17:02:59Z

related: discussion on fedora devel list: F30: System-Wide Change proposal: DNF UUID

bgilbert · 2019-01-08T20:35:06Z

That DNF change proposal includes a mechanism for instances to flag themselves as short-lived, which is interesting. It seems as though that could be done server-side, though.

MureDanta · 2019-01-11T15:04:09Z

The DNF proposal is interesting, though it would be better if it were opt-in, otherwise I'm not sure how they will prevent people from being counted before they've had a chance to opt out. But the more important point in the context of this discussion is that all they're talking about is counting. So... some kind of UUID not used for any purpose other than to guarantee uniqueness of the counted systems. They would not (and indeed should not) even need to associate the UUID with an IP address. Just count them and that's it. So it's quite different from the more rich data being discussed above.

What do people think of requiring opt-in? Ethically it seems like the only defensible position because it's the only way to be sure that people really want the data collected (and perhaps they will if it's clear that this will be used to prioritize development), and the system isn't taking advantage of people who are too busy, don't read Fedora magazine, don't read the documentation carefully, don't read English, etc. Admittedly opt-in will result in less data because you'll lose data from the people who don't notice the issue or can't be bothered to click "yes," but that's sort of the root question: What should the world value most by default? respecting privacy or collecting data?

Granted "privacy" may not be entirely the right term. We're (probably) not talking about personally identifiable information. But it's still data that belongs to people and/or companies, crosses that line between what is public and what is "inside" their domain, inside their personal property, and that they will not expect to be collected without their consent. Obviously I'm wary of this, but you knew that!

bgilbert · 2019-01-12T22:53:28Z

Hey @MureDanta, thanks for pursuing this!

The distinction between counting and tracking is an important one, and unfortunately my original writeup didn't talk about it at all. The goal here is entirely to obtain aggregate data about how many users Fedora CoreOS has and how the OS is being used. We do not want to associate individual checkins with a human, company, IP address, etc., and we should carefully avoid doing so. For example, the DNF proposal mentions rotating the UUID on a regular schedule to avoid long-term tracking, and we could do a similar thing here.

If we made the entire system opt-in, I suspect that the resulting data would be sufficiently nonrepresentative that the metrics wouldn't be worth collecting at all. And those metrics are actually important, for the reasons described in the original writeup. Fedora and Container Linux have historically both struggled to make good decisions without a solid understanding of the size and shape of their userbases, and we're trying to avoid making the same mistakes here.

However, you're right that some of the metrics listed above are inherently sensitive, and should be disabled by default. That's where the metrics levels come in. For example, we might group the original list into:

minimal (enabled by default):

Random unique identifier for deduplicating reports
Platform (cloud environment or hypervisor)
On cloud systems, the instance type
Original OS version
Current OS version

full (disabled by default):

On bare-metal systems, a summary of hardware
Container runtimes in use
Summary of network configuration

The minimal list would thus contain only generic items that couldn't readily be used to fingerprint individual systems.

Fedora CoreOS won't really have an installer: as with Container Linux, users will launch a generic image and use an Ignition config to customize the machine. That means Fedora CoreOS won't have the early-opt-out issue you mentioned, since Ignition would configure metrics before the system first booted. But it also means that we don't have a good way to prompt for permission to gather metrics. I expect we'd prominently mention metrics in the getting-started documentation, but you're right that those docs probably won't reach 100% of users. Given the current pervasive norms in the software world, though, I honestly don't think small amounts of telemetry will take users completely by surprise.

Do you have thoughts about additional ways to improve the privacy properties of the proposal? We want to do the right thing here, within the constraints of the Fedora CoreOS design.

MureDanta · 2019-01-18T02:24:03Z

At this point I'm a total newcomer so am not familiar enough with the process of installing and configuring CL or Fedora CoreOS to propose anything concrete. All I can think of offhand would be perhaps a notice from the transpiler that data collection is enabled by default if the YAML file doesn't contain an explicit setting one way or another, but have no idea how effective that would be. I imagine a lot of those files will be generated automatically from templates, so I don't know if a human being would even see a notice like that in a production environment.

Maybe the best answer is, as you say, just to prominently feature this function in the documentation, collect only what is truly meaningful/necessary, and then for the message protocol and the back end, do as much as possible to make sure the information isn't traceable? My main concern is probably that last part... that information could be traced back and then exploited by someone on the dark side. There's a tendency in instrumentation to try to collect/keep all kinds of stuff on the theory that it might be useful, but it's probably better to collect only what's really needed.

bgilbert · 2019-01-23T04:38:15Z

All I can think of offhand would be perhaps a notice from the transpiler that data collection is enabled by default if the YAML file doesn't contain an explicit setting one way or another, but have no idea how effective that would be. I imagine a lot of those files will be generated automatically from templates, so I don't know if a human being would even see a notice like that in a production environment.

I'm not sure either, but it's worth considering.

Maybe the best answer is, as you say, just to prominently feature this function in the documentation, collect only what is truly meaningful/necessary, and then for the message protocol and the back end, do as much as possible to make sure the information isn't traceable?

👍

bgilbert · 2019-01-23T05:07:33Z

It seems as though Lennart's countme idea from the DNF UUID devel@ thread would work here too. In essence we'd drop the unique UUID in favor of trusting the client to report exactly once per time interval. We'd probably want to be able to aggregate data over multiple intervals, e.g. unique machines per day and per month, but the client could maintain multiple timers and tell the server whether a particular checkin is a daily or a monthly one.

basvdlei · 2019-03-12T17:49:24Z

What are the thoughts about how the error should be handled when the system can not 'call home'? I assume failing silently is the preferred option here (apart from some logging somewhere maybe).

As someone running Container Linux in a corporate environment, I've had to do put some additional drop-ins in place for services like update-engine so they can reach the internet through a HTTP proxy. In case of the update-engine there is a clear incentive to do this, for a telemetry style agent/process this will require some documentation or maybe even an option in the ignition config.

As for implementing a clear opt-in, the thing that comes to my mind is the have it disabled by default when booting without any config and make it a mandatory option in a user ignition config. But I'm not sure how I feel about provisioning failing on a missing telemetry setting...

LorbusChris · 2019-07-25T17:23:06Z

It might be useful to align FCOS pinger and MCO prometheus host metrics (and their format) to facilitate using FCOS as OKD/Kubernetes base OS.

bgilbert · 2019-07-26T04:50:22Z

For the record, the plan is to proceed along the lines of #86 (comment). We no longer intend to transmit any unique identifiers.

mattdm · 2020-06-20T15:25:24Z

An update here from my side: we have the DNF Countme stuff in place and running for non-ostree-based Fedora systems in Fedora 32, and a simple backend collector as well. I'd really love for CoreOS (and Silverblue and IoT) systems to have an exactly-compatible implementation (possibly even hitting similar URLs) so I have integrated data.

The best description of the implementation is in the man page for dnf.conf, and you can see the Behave tests here: https://github.com/rpm-software-management/ci-dnf-stack/blob/master/dnf-behave-tests/features/countme.feature.

Where's the best place for me to track this -- in the "pinger" service, here, or somewhere else?

Also of note: the Fedora Workstation team is looking into more detailed metrics as well, using the work from Endless Computing (guadec talk). Collaboration there might make sense, both for comparability and, y'know, shared work.

cgwalters · 2020-09-23T15:35:18Z

I lean a bit towards implementing the same dnf logic in ostree - that way it will naturally work across other distributions using ostree too. But there are a variety of other approaches; once the dnf logic is lowered into libdnf (a separate issue) then we could in theory change rpm-ostree to use libdnf to fetch just the toplevel repomd file or something.

travier · 2021-01-14T10:05:03Z

The rpm-ostree implementation of Count Me logic has been merged. Fedora infrastructure changes to support the new user agent are in progress: https://pagure.io/mirrors-countme/pull-request/2

bgilbert · 2022-06-22T18:50:25Z

We've decided to drop the fedora-coreos-pinger stub from the distro (#770), with the option to re-add it in the future if it becomes ready to deploy.

travier · 2023-07-17T16:57:41Z

I've posted a link to this discussion with more examples of data that could be collected and would be useful in the Fedora Change proposal discussion related to telemetry: https://discussion.fedoraproject.org/t/what-data-will-be-collected-exactly-a-breakout-topic-for-the-f40-change-request-on-privacy-preserving-telemetry-for-fedora-workstation/85417/47

bgilbert added meeting topics for meetings kind/design priority/medium labels Nov 27, 2018

dustymabe mentioned this issue Nov 28, 2018

Throttled update rollouts #83

Closed

bgilbert removed the meeting topics for meetings label Nov 28, 2018

dustymabe mentioned this issue Jan 9, 2019

Public communication to socialize plans for collecting metrics #109

Closed

bgilbert added this to Proposed in Fedora CoreOS preview via automation Jan 22, 2019

bgilbert mentioned this issue Mar 9, 2019

Container Linux migration documentation #159

Closed

36 tasks

bgilbert added this to Proposed in Fedora CoreOS stable via automation May 24, 2019

bgilbert removed this from Proposed in Fedora CoreOS preview May 24, 2019

rfairley mentioned this issue Jun 4, 2019

Package stub pinger for FCOS coreos/fedora-coreos-pinger#3

Closed

9 tasks

bgilbert moved this from Proposed to Selected in Fedora CoreOS stable Jul 16, 2019

This was referenced Sep 16, 2019

Client: Collect metrics from Local FCOS machines coreos/fedora-coreos-pinger#30

Open

Use DynamicUser=yes coreos/fedora-coreos-pinger#29

Merged

zonggen mentioned this issue Sep 23, 2019

Server: Fedora CoreOS Pinger Backend Design coreos/fedora-coreos-pinger#33

Open

travier added the jira for syncing to jira label Sep 28, 2020

mattdm mentioned this issue Oct 9, 2020

RFE: implement a weekly DNF-countme-compatible query coreos/rpm-ostree#2251

Closed

This was referenced Jan 14, 2021

Allow access to metalink/countme fields for external DNF Count Me implementations rpm-software-management/libdnf#1068

Closed

Enabling DNF Count Me support in Fedora CoreOS #717

Closed

travier removed the priority/medium label Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect metrics from Fedora CoreOS machines #86

Collect metrics from Fedora CoreOS machines #86

bgilbert commented Nov 27, 2018

ajeddeloh commented Nov 27, 2018

arithx commented Nov 28, 2018

dustymabe commented Dec 11, 2018

MureDanta commented Dec 27, 2018

dustymabe commented Jan 8, 2019

bgilbert commented Jan 8, 2019

MureDanta commented Jan 11, 2019

bgilbert commented Jan 12, 2019

MureDanta commented Jan 18, 2019 •

edited

Loading

bgilbert commented Jan 23, 2019

bgilbert commented Jan 23, 2019

basvdlei commented Mar 12, 2019

LorbusChris commented Jul 25, 2019 •

edited

Loading

bgilbert commented Jul 26, 2019

mattdm commented Jun 20, 2020

cgwalters commented Sep 23, 2020

travier commented Jan 14, 2021 •

edited

Loading

bgilbert commented Jun 22, 2022

travier commented Jul 17, 2023

Collect metrics from Fedora CoreOS machines #86

Collect metrics from Fedora CoreOS machines #86

Comments

bgilbert commented Nov 27, 2018

Background: Container Linux

Fedora CoreOS

ajeddeloh commented Nov 27, 2018

arithx commented Nov 28, 2018

dustymabe commented Dec 11, 2018

MureDanta commented Dec 27, 2018

dustymabe commented Jan 8, 2019

bgilbert commented Jan 8, 2019

MureDanta commented Jan 11, 2019

bgilbert commented Jan 12, 2019

MureDanta commented Jan 18, 2019 • edited Loading

bgilbert commented Jan 23, 2019

bgilbert commented Jan 23, 2019

basvdlei commented Mar 12, 2019

LorbusChris commented Jul 25, 2019 • edited Loading

bgilbert commented Jul 26, 2019

mattdm commented Jun 20, 2020

cgwalters commented Sep 23, 2020

travier commented Jan 14, 2021 • edited Loading

bgilbert commented Jun 22, 2022

travier commented Jul 17, 2023

MureDanta commented Jan 18, 2019 •

edited

Loading

LorbusChris commented Jul 25, 2019 •

edited

Loading

travier commented Jan 14, 2021 •

edited

Loading