Skip to content

Latest commit

 

History

History
147 lines (74 loc) · 19.3 KB

2016-08-29.md

File metadata and controls

147 lines (74 loc) · 19.3 KB

Differential privacy and correlated data

A recent blog post announces:

Differential Privacy is Vulnerable to Correlated Data

The announced paper, Dependence make you vulnerable continues a line of thought initiated in the No Free Lunch paper we discussed just a post ago: "differential privacy has limitations when applied to data that are not independent of one another". Unlike the previous works, this paper is brave enough to make falsifiable statements (those specific enough that they could in principle be disproved) about differential privacy.

You can probably imagine what this post will be about.

I have several comments about the technical details of the paper, and in the interest of being friendly let's imagine I am the adversarial reviewer that NDSS apparently did not provide. I think a good deal of the paper could use some expository tune-up, in that the most natural readings of many of their statements do not seem to be correct (either my readings or the results). Some of their statements are flat out wrong, and I'll point these out first.

After this we will get to what I took away from the paper, which is the importance of "privacy expectations", especially as they apply to data with correlations. The examples in the paper include "social, behavioral, and genetic relationships", all of which clearly exist; we can't assume them away, and frankly anyone offering you assumption-based privacy guarantees has some serious explaining to do. Rather, when correlations exist what you think of as your "personal" data is not exactly your "private" data, in the sense that it is not known to you alone.

This point, that your personal data may not be private data, is best seen through one of the authors' examples: genetic privacy. People's genomic data are correlated, massively. Some 98% of our genetic data is common to all people, meaning (I think) that I can correctly announce some 98% of your genome. Scary! That 98% may have been your personal genomic data, but it was not your private genomic data.

In the case of genomic data we do have some privacy expectations: we understand that the 98% common to people is not private information and no outrage is due if someone publishes it. On the other hand, 100% accuracy about your genome is, with present day technology, likely private and worth complaining about if disclosed. How did we come to these numbers, 98% and 100%, and how do we fill in the space between them a bit more crisply?

I believe we should frame "expectations of data privacy" in terms of the difference between personal information and private information. Differential privacy provides one way to frame this distinction: secrets that you can keep by withholding your data are private.

Much of the hand-wringing about differential privacy and "correlated data" or "independent data generation" boils down to wishing that personal data were actually private data, when in fact it may not be (as I've defined things). The anxiety in the present paper is that your location data may be correlated with the data of others, allowing one to learn about your location from their data. This is creepy, and likely a "violation of privacy" in the social sense, but we need to distinguish this from the technical intent of data privacy if we want to complain about either.

By way of analogy, we use cryptography to protect our financial information, even though there is probably lots of correlation among our bank accounts. If I see you moving into your new house and call to ask about your mortgage, because often people with new houses have mortgages, have we violated the security guarantees of the cryptosystem your bank uses? Is it now time to write a paper: "cryptography vulnerable to correlated data", announcing that to provide its guarantees, modern cryptography assumes all messages are independent? Please don't.

Cryptography provides the guarantee that access to encrypted information provides no new capabilities to an adversary; they can do as much with the encryption as without. Differential privacy provides the same flavor of guarantee; an adversary can do roughly the same things with and without access to your data. I think the main difference is that we've sorted out our expectations for cryptography, and still need to work on our expectations for data privacy.

On words, which have meanings

Our subject today is a paper, whose abstract starts (emphasis mine):

Differential privacy (DP) is a widely accepted mathematical framework for protecting data privacy. Simply stated, it guarantees that the distribution of query results changes only slightly due to the modification of any one tuple in the database. This allows protection, even against powerful adversaries, who know the entire database except one tuple. For providing this guarantee, differential privacy mechanisms assume independence of tuples in the database - a vulnerable assumption that can lead to degradation in expected privacy levels especially when applied to real-world datasets that manifest natural dependence owing to various social, behavioral, and genetic relationships between users.

The bolded text is not correct. No differentially private mechanism assumes anything about the distribution of their input data, precisely because differential privacy does not permit such assumptions. The guarantee of differential privacy, "that the distribution of query results changes only slightly due to the modification of any one tuple in the database", must hold for all input datasets and single tuple modifications.

The authors continue to write, later

To provide its guarantees, DP mechanisms assume that the data tuples (or records) in the database, each from a different user, are all independent.

and later still

However, the privacy guarantees provided by the existing DP mechanisms are valid only under the assumption that the data tuples forming the database are pairwise independent

Each of these statements are incorrect. None of them are supported in the paper, nor in any of their citations. The authors should perhaps be slightly embarrassed for making such statements in the scientific record, but not as much as NDSS should be embarrassed that their reviewing process fails to separate papers with incorrect statements from those without.

The authors surely have a point they would like to make about the analysis of correlated data, and how one might learn more than the authors initially suspected, but they'll need to find a better way to make it (and they will, in just a few sections).

An attack on differential privacy

Perhaps we can learn more about the authors' intentions (or prove my sass unfounded) by looking into the details of their attack on differential privacy.

In Section IV, we perform a real-world inference attack to demonstrate that a DDP-adversary can extract more private information than guaranteed by DP.

Ok, that sounds interesting. It certainly sound like "guaranteed" isn't the right word, one way or the other. A DDP-adversary is one with access to all the data except yours, and optionally some side information about pair-wise relationships between records (in the form of a social network, in their case). Once we get to Section IV, we read:

In this section, we demonstrate (1) a real-world inference attack on the LPM- based differential privacy mechanism, as a realistic confirmation of the feasibility of such attacks in practical scenarios; and (2) the capability of a DDP-adversary to use released data, satisfying DP definition, to build an inference attack which violates the security guarantees of DP mechanisms.

Here LPM is "Laplace Perturbation Mechanism", where you add some Laplace noise to low-sensitivity queries. I'm already a little worried that the released data both "satisfies the DP definition" and something then "violates the security guarantees of DP mechanisms". The definition and guarantees are ... the same thing.

Let's see what happens next!

We consider a scenario where the data provider uses the classical K-means clustering approach [20] on the Gowalla location dataset to compute cluster centroids, applies DP mechanism on the centroids, and publishes the perturbed centroids to other applications or researchers.

The authors try to pin down a value of your data by looking at the conditional probability that your data has some value, given access to the reported centroids and all other user data.

Pr[ your data | centroids + all other user data ]

If centroids and all other user data strongly imply something about your data, well then they might learn about it!

The authors consider two types of attacks: in the first the data are independent, and in the second the data are correlated through some side information, like a social network that identifies connected records. Indeed, when the data are treated as independent the authors can learn little about a record, but when correlations exist (here: providing the adversary with a social network between users) they can learn the value accurately!

Ok. At this point the authors have only shown that centroids plus some pretty significant side information can tell them a lot about your data. It isn't yet clear whose fault that is, or what could have prevented it.

The authors introduce a theorem relating differential privacy and "leaked information", the change in entropy due to showing someone all the other records (and the centroids).

Leaked

Theorem

The authors do observe a substantial change in entropy due to revealing all other records, which makes sense if people hang out with their friends, and that observation plus this theorem would mean that differential privacy isn't doing its job. Here are their empirical measurements, which show that "leaked information" is much greater than epsilon when correlations exist.

proof

What is the conclusion? Well, if we believe their algorithm is differentially private it leads to a contradiction, meaning (i) perhaps their measurements aren't differentially private, (ii) perhaps their theorem is incorrect, or (iii) perhaps their measurements are wonky.

Why not go for the hat trick?

  1. The k-means algorithm they describe is not differentially private. Clustering data points and then adding noise does not provide differential privacy because each point could have an unbounded influence on the means (because k-means is an unstable algorithm). There are differentially private k-means algorithms, but what the authors describe is not one of them.

  2. Their theorem is incorrect. It is most easily seen by imagining what happens if the centroids were computed with zero-differential privacy. The centroids measurement would be entirely noise and contribute literally zero information, about anything, and so "leaked information" would simplify to

    Leaked Information = avg_i ( H(Di) - H(Di | all except Di ) )
    

    The theorem says this should be zero. However, when there are correlations between Di and the other records it will not be, because revealing the remaining data changes the posterior distribution over Di (you are still likely near your friends, totally random centroids or not).

    Here is the "detailed proof" from the appendix:

    proof

    You can be excused for not understanding this. What I think the authors are saying is that because differential privacy relates the probabilities of adjacent datasets D and D', by a factor of exp(epsilon), then the probabilities for any D are pretty close to whatever distribution they have in mind (which is what the . means, I think). The relation of differential privacy only holds for adjacent datasets, those with just one record in difference, not D and arbitrary datasets, which is what you would need to perform their integration. I think this is what goes wrong in their proof. Authors?

  3. Their measurements are definitely wonky. For any epsilon differentially private measurement, one can prove that

    H(Di | measurement + all except Di) <= H(Di | all except Di) + epsilon.
    

    as the probability density for measurement changes only by exp(epsilon) as we vary Di and hold all except Di fixed, which is essentially what we are doing by conditioning Di on all except Di. In other words, having already provided the attacker with all of the other input data, there is very little more information that measurements can provide.

    This should mean that whatever observations are made about leaked information, the slope should be at most one. The curve doesn't have to start at zero, however, because as just argued information leaks even with epsilon = zero. Clearly the curve doesn't have slope one in their experiments; we see a relatively massive spike up from epsilon "nearly zero" to epsilon around 0.2.

    This could just be due to the algorithm not being differentially private, so perhaps the measurements aren't wonky, just reflecting other wonkiness.

Something is definitely "up" with this section. There are no further details to follow up on, so for the moment let's not try and sort out exactly what is wrong other than summarizing of the section as evidence of differential privacy's shortcomings.

The section does leave us with some questions. If "leaked information" can be significant even when your privacy mechanism totally ignores the data, how could you invent a privacy mechanism that bounds it by epsilon? If I can predict your location just by looking at your friends lists and where your friends are, how could you invent a k-means perturbation that keeps me from guessing "wherever your friends are"?

The authors claim to do this in the next sections, but don't explain why their "leaked information" curves do not start at the same offset H(Di) - H(Di | all except Di) that the differentially private computations do. I hope they can clear that up, as I am totally mystified by what guarantees these mechanisms would actually have. If I learn more from them, I will let you all know.

Accidental insight

The authors conclude this section with what I think is the most insightful bit in the paper, though not for the reason I think it was written:

From our analysis, we can see that a differential privacy technique performed on a dependent data set will disclose more information than expected, and this is a serious privacy violation which hinders its applications to real-world data that may be inherently dependent.

The problem absolutely lies in what was expected. I can't actually tell what the authors expected, I suspect bounded "leaked information", which isn't what differential privacy provides. And indeed, differential privacy's application is surely hindered, but by privacy expectations rather than serious privacy violations.

I think it is really important for people, both authors and subjects, to work on sorting out their privacy expectations. If you publish your list of friends and your friends publish their locations, do you expect that your location will remain a secret? You might. Do you expect that publishing differentially private k-means centroids in addition will somehow fix your privacy problems? If that is really the hold-up, then what we need to fix is your expectations.

I think differential privacy provides a good basis for expectations: you should expect to maintain all the secrets you would keep by withholding your data. If you would like me to keep secrets that you yourself cannot keep, even if very noble, we need to at least have a discussion about autonomy and why you are deciding what others can do with their information.

A possible explanation

The authors have some point they would like to make and have just, in my opinion, made it quite badly. Here is a re-written version of the abstract (changes bolded) which isn't fundamentally inaccurate, and gets closer to the actual point of their paper:

Differential privacy (DP) is a widely accepted mathematical framework for protecting data privacy. Simply stated, it guarantees that the distribution of query results changes only slightly due to the modification of any one tuple in the database. This allows protection, even against powerful adversaries, who know the entire database except one tuple. However, differential privacy's guarantees only mask the presence of those records received from each user, they do not mask larger statistical trends that may reveal information about each user. This distinction can lead to degradation in expected privacy levels especially when applied to real-world datasets that manifest natural dependence owing to various social, behavioral, and genetic relationships between users.

This gets at a really important point: differential privacy does not guarantee that secrets about you will remain secrets. This may make differential privacy a non-answer when you need to maintain secrets about people beyond the secret of what data they contributed.

What the authors study in their paper (the attack described above is just their motivation) are algorithms that suppress statistical trends across many records in the data. If you are tasked with suppressing statistical trends (perhaps a protected class has not collectively consented to some disclosure), it is worth understanding what is going on here (and in related papers).

Expectations of privacy

These cited examples of correlation---social, behavior, genetic, and more---are great examples of where we need to carefully evaluate our privacy expectations. Should we expect that information held by others will remain secret? It is a bit out of our control. Obviously we would want our genomic data to be a total secret, right? Well, 98% of it sure isn't (thanks monkeys; read the papers next time). Well, we sure expect "the rest" to be kept secret then. Really, what's that?

How much of your genome is currently still a secret, and for how long will that hold?

That is a really hard questions to answer, because specific answers rely on understanding real dependences in genomic data, and we aren't there yet scientifically (in genomics, at least). The answers will also change with time (as genomics advances) and depend on who you ask (a geneticist may know much more than your barber). It is difficult even to define what is currently a secret about you, and much more difficult to enforce that it remains such.

On the other hand, we do not need to answer these hard questions to provide actionable privacy guarantees: your contribution of genomic data, when guarded with differential privacy, will reveal no further secrets about you. We can't tell you what secrets you started with or how for long you'll have those secrets, nor can anyone else, but we can guarantee that you aren't disclosing any new secrets by participating in a differentially private computation. This is a privacy expectation you should have: your contribution to a privacy-respecting computation should disclose no new secrets about you, no matter what secrets you happen to have.