Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby + count append 0 if not exists #2136

Closed
darrencl opened this issue Mar 3, 2020 · 1 comment
Closed

Groupby + count append 0 if not exists #2136

darrencl opened this issue Mar 3, 2020 · 1 comment

Comments

@darrencl
Copy link

darrencl commented Mar 3, 2020

Hi,

I would like to do group by count on every possible combination between 2 columns using by as below.

julia> using RDatasets, DataFrames

julia> ds = dataset("MASS", "biopsy");

julia> sort(by(ds, [:V1, :Class], nrow))
18×3 DataFrame
│ Row │ V1    │ Class        │ x1    │
│     │ Int32 │ Categorical │ Int64 │
├─────┼───────┼──────────────┼───────┤
│ 11     │ benign       │ 142   │
│ 21     │ malignant    │ 3     │
│ 32     │ benign       │ 46    │
│ 42     │ malignant    │ 4     │
│ 53     │ benign       │ 96    │
│ 63     │ malignant    │ 12    │
│ 74     │ benign       │ 68    │
│ 84     │ malignant    │ 12    │
│ 95     │ benign       │ 85    │
│ 105     │ malignant    │ 45    │
│ 116     │ benign       │ 16    │
│ 126     │ malignant    │ 18    │
│ 137     │ benign       │ 1     │
│ 147     │ malignant    │ 22    │
│ 158     │ benign       │ 4     │
│ 168     │ malignant    │ 42    │
│ 179     │ malignant    │ 14    │
│ 1810    │ malignant    │ 69

As can be seen, the combination that doesn't exist are benign when V1= 9 and 10, is it possible to use by function to append 0 if the combination doesn't exists?

Thanks.

@bkamins
Copy link
Member

bkamins commented Mar 3, 2020

No - unfortunately it is not possible currently.

This is duplilcate of #2106 so I will close this issue (if you feel that something should be added can you please comment there?).

For the time being you can do the following as a half-measure (you will get missing instead of 0 in the target column but it can be easily fixed with coalesce):

julia> df = DataFrame(x=rand(1:3, 10), y=rand(1:4, 10), z=1:10)
10×3 DataFrame
│ Row │ x     │ y     │ z     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 1     │ 1     │
│ 2   │ 1     │ 4     │ 2     │
│ 3   │ 1     │ 3     │ 3     │
│ 4   │ 2     │ 4     │ 4     │
│ 5   │ 3     │ 4     │ 5     │
│ 6   │ 2     │ 1     │ 6     │
│ 7   │ 2     │ 2     │ 7     │
│ 8   │ 2     │ 4     │ 8     │
│ 9   │ 2     │ 4     │ 9     │
│ 10  │ 1     │ 4     │ 10    │

julia> join(join(DataFrame(x=unique(df.x)), DataFrame(y=unique(df.y)), kind=:cross), df, on=[:x,:y], kind=:outer)
16×3 DataFrame
│ Row │ x      │ y      │ z       │
│     │ Int64⍰ │ Int64⍰ │ Int64⍰  │
├─────┼────────┼────────┼─────────┤
│ 1   │ 2      │ 1      │ 1       │
│ 2   │ 2      │ 1      │ 6       │
│ 3   │ 2      │ 4      │ 4       │
│ 4   │ 2      │ 4      │ 8       │
│ 5   │ 2      │ 4      │ 9       │
│ 6   │ 2      │ 3      │ missing │
│ 7   │ 2      │ 2      │ 7       │
│ 8   │ 1      │ 1      │ missing │
│ 9   │ 1      │ 4      │ 2       │
│ 10  │ 1      │ 4      │ 10      │
│ 11  │ 1      │ 3      │ 3       │
│ 12  │ 1      │ 2      │ missing │
│ 13  │ 3      │ 1      │ missing │
│ 14  │ 3      │ 4      │ 5       │
│ 15  │ 3      │ 3      │ missing │
│ 16  │ 3      │ 2      │ missing │

@bkamins bkamins closed this as completed Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants