-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
by
drops zero-length categorical groups
#2106
Comments
There are two issues:
How do you see the second one? For now what you want can be achieved by something like:
or
and then you can work with the indices to compute what you need. |
Thanks @bkamins - I only raise it for discussion as it's sort of cropped up in #2104 and I think it's an important thing to design around and consider early. I fully appreciate that there are ways to address this already (although your code is much cleaner than whatever I would have stumbled upon) - I wasn't raising it because it's impossible, but rather just inconvenient.
From my perspective, the purpose here is to treat your data as observations of a system, not as a complete dataset. If you want to summarize observations of possible outcomes, you often also want to reflect which outcomes weren't observed. When grouping by multiple factor columns, you would then want to characterize all the permutations of those columns. This can produce some enormous datasets, so I wouldn't want this behavior to be the default, but think that a convenient way of retaining that information would be helpful when desired. |
OK. Now I get the idea. However, if this is the case maybe it is better to add this option to #1864 (i.e. to add an additional kwarg that would request expansion over all levels of a categorical columns) rather than to |
Just as an additional comment why this is problematic with Potentially as an extension to #2095 we might add this option. |
Something like that would certainly work. My first reaction is that it feels comfortable when you expect one observation per set of indexing variables, but it feels like a bit a hack if the goal is to summarize over possible values. In both cases, it assumes that Expanding observationsWhen the goal is just to reflect unobserved possible values as x = categorical(["a", "b"])
levels!(x, ["a", "b", "c"])
df = DataFrame(x = x, y = [1, 2])
expanddf!(df, [:x])
# 2×2 DataFrame
# │ Row │ x │ y │
# │ │ Categorical… │ Int64⍰ │
# ├─────┼──────────────┼─────────┤
# │ 1 │ a │ 1 │
# │ 2 │ b │ 2 │
# │ 3 │ c │ missing │ Summarizing unobserved groupsTo do some simple summary statistics where you also want to reflect unobserved groups, this feels like a bit of a hack. You have to introduce new missing data only to summarize it. If x = categorical(["a", "a", "b"])
levels!(x, ["a", "b", "c"])
df = DataFrame(x = x, y = [1, 1, 2])
expanddf!(df, [:x])
by(df, :x, n_observations = x -> length(skipmissing(x)))
# 2×2 DataFrame
# │ Row │ x │ n_observations │
# │ │ Categorical… │ Int64 │
# ├─────┼──────────────┼────────────────┤
# │ 1 │ a │ 2 │
# │ 2 │ b │ 1 │
# │ 3 │ c │ 0 │ |
Do you mean with what you write as If yes then note that its signature is:
and you can choose the value of
if we added a keyword like CC @nalimilan - in this case the column |
Good catch - I didn't think to group first and then expand but that's probably a more suitable approach. The reason expanding first seems more idiomatic to me is that you will often summarize by more than one function at a time, and the handling of the "zero-observation" case might produce different values for each summarizing function where the For instance, if someone were to summarize over an unobserved group within their data and try to get back the number of records in that category and also the maximum of a single value column. They may expect that the number of records gets filled with 0 and the maximum gets filled with |
I can see that being able to do As long as we don't change the default, we can introduce this feature at any point in the 1.x series. Though as @bkamins noted it would require adapting the internals. That's tricky since we don't want to make a copy of all unique values when there are many groups. |
Just to add - I expect that after #2095 we will see |
I am giving it 2.0 milestone, as, after thinking, I think it is nice to have also in |
Related to #2104 and #1256
One of the recently revised behaviors within
dplyr
(as of 0.8.0) was the move to grouping-by categorical variables producing zero-length groups for unrepresented category levels. I haven't seen this behavior touched on in other issues, and wanted to raise it as a topic of consideration. They've added field called.drop
which when set toFALSE
will retain groupings for unrepresented categorical levels.Just a couple ideas - this could possibly use
skipmissing=false
, which could be interpreted colloquially as "missing", although I understand this is a bit of conceptual conflation with the value ofmissing
. Alternatively, it might be nice to introduce something analogous to.drop
which specifies behavior of zero-length groups specifically.There are certainly times when you want to retain the fact that a dataset doesn't contain values of a specific level where this can be very handy.
The text was updated successfully, but these errors were encountered: