-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Behavior of censor_dims=true
that I don't understand
#568
Comments
Thanks for such a concise repro code example. One way to see what is happening is to add a If you use One issue here is that This approach would have the drawback that the curator would need to have a bit more domain knowledge about the population, and this dual reservoir-sampling could introduce additional sorts of bias than are introduced by population-wide reservoir sampling. However, this could be manageable with good documentation, and wouldn't be hard to implement. Both are variants of "count Laplace" or "count Gaussian" described in section 1.1 of [1]. If you check Table 2 (on page 19) of that paper, you can see that count Laplace and count gaussian work well when Having said that, I'm not sure if your use case is really one where you want a large We could categorize the common scenarios into 4 groups, depending on whether A. |
We wouldn't likely be able to address this in metadata at the table level, but would instead need to address at the column level. Of course, the simplest pattern would be to assume (and enforce through sampling) that We could mark the columns where max_ids=20
region:
type: string
max_bins_per_user: 1
sales_month:
type: string
max_bins_per_user: 12 And then set |
Yes, it is the latter. I can work around this with some preprocessing like in my example, but I was confused as to why the preprocessing affected Thanks for your very comprehensive answer. I think I get the gist of it, the 4 groups of cases are very illustrative, but I'll have to go chew on this a bit. Feel free to close this since I think you've answered my question, or leave open if you think you might want to implement some of the |
Thanks; I'll keep this open as a work item to do at least two things:
In the docs we should also provide some advice about selecting max_ids, how to address by pivoting, handling public dimensions, etc. Please add more comments if you think of other heuristics we could use to provide better defaults. |
I have a case in which I don't understand the behavior of
censor_dims
. Namely, I run essentially the same query in two different ways but get different results. The data is a table with two columnscategory
andid
. The two ways areCOUNT(*) ... GROUP BY category
query withmax_ids=100
COUNT(*) AS counts ... GROUP BY category, id
and then on the result of that aSUM(counts) ... GROUP BY category
query withmax_ids=2
(becausecategory
takes two different values) butupper=100
forcounts
.I would expect the results be the same, and they roughly are, except for
censor_dims
which consistently censors the results in the first case, but not in the second.I'm not sure if this is an issue with smartnoise-sql or with my understanding of DP/the censor dims mechanism, but I'm hoping to get some clarity on that. Here's a code snippet that illustrates:
Typical output of running the above is
Tagging @fhoussiau because we had a chat about this earlier today.
The text was updated successfully, but these errors were encountered: