Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of dropmissing #3256

Merged
merged 41 commits into from
Jan 27, 2023
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
940c1cd
init
svilupp Dec 23, 2022
e3ae2b0
Merge remote-tracking branch 'upstream/main' into improve-dropmissing…
svilupp Jan 14, 2023
472a715
add dropmissing8 implementation
svilupp Jan 14, 2023
7b5e010
Update src/other/utils.jl
svilupp Jan 15, 2023
dc705e8
move last_ind to separate line, add inbounds
svilupp Jan 15, 2023
f52ed41
support general vectors in `getindex_disallowmissing`
svilupp Jan 15, 2023
72171c9
add general fallback for getindex_disallowmissing
svilupp Jan 15, 2023
4a596ee
method for `getindex_disallowmissing` for abstractdataframe
svilupp Jan 15, 2023
6326f32
changed inner to disallowmissing+view
svilupp Jan 21, 2023
080c018
Clean up
svilupp Jan 21, 2023
a0bf440
updated comments
svilupp Jan 22, 2023
dcfa4e1
update view format
svilupp Jan 22, 2023
9b4478f
Change back to @view, due to errors with `view()`
svilupp Jan 22, 2023
afef173
import view() explicitly
svilupp Jan 22, 2023
fc643bb
add note on dropmissing parallelism
svilupp Jan 22, 2023
d8248a4
added test for zero-column and CategoricalArrays
svilupp Jan 22, 2023
b63ee37
added test for view=true in dropmissing
svilupp Jan 22, 2023
bd343a3
add case for dropmissing!
svilupp Jan 22, 2023
16c278d
add disallowmissing=false test
svilupp Jan 22, 2023
eb998cd
removed duplicative test case
svilupp Jan 22, 2023
20526dd
add test case that disallowmissing=true targets only the required column
svilupp Jan 22, 2023
13deb84
simplified the docstring
svilupp Jan 22, 2023
fdc8e60
Update docs/src/lib/functions.md
svilupp Jan 22, 2023
504e4cd
Update test/data.jl
svilupp Jan 22, 2023
c004a91
Update src/abstractdataframe/abstractdataframe.jl
svilupp Jan 22, 2023
b895692
Update src/abstractdataframe/abstractdataframe.jl
svilupp Jan 22, 2023
ba2760a
Update src/other/utils.jl
svilupp Jan 22, 2023
d0f1187
Update test/data.jl
svilupp Jan 22, 2023
8c51985
Update test/data.jl
svilupp Jan 22, 2023
267e566
Update test/data.jl
svilupp Jan 22, 2023
94b091a
use `@spawn_or_run` to avoid duplication
svilupp Jan 23, 2023
fa288f6
add test for multithreaded path
svilupp Jan 23, 2023
5172e72
Reduced test case memory requirement
svilupp Jan 23, 2023
301bdea
Apply suggestions from code review
bkamins Jan 24, 2023
c6477bf
Update src/abstractdataframe/abstractdataframe.jl
bkamins Jan 24, 2023
3d82b90
Apply suggestions from code review
bkamins Jan 24, 2023
f26c99a
Apply suggestions from code review
bkamins Jan 24, 2023
fffda53
Update test/data.jl
bkamins Jan 24, 2023
82f193c
Update test/data.jl
bkamins Jan 24, 2023
1426e3a
Apply suggestions from code review
bkamins Jan 26, 2023
39580a2
Apply suggestions from code review
bkamins Jan 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/src/lib/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ This is a list of operations that currently make use of multi-threading:
* a transformation produces one row per group and the passed transformation
is a custom function (i.e. not for standard reductions, which use
optimized single-threaded methods).
- `dropmissing` when the provided data frame has more than 1 column and `view=false`
(subsetting of individual columns is spawned in separate tasks).

In general at least Julia 1.4 is required to ensure that multi-threading is used
and the Julia process must be started with more than one thread. Some operations
Expand Down
23 changes: 21 additions & 2 deletions src/abstractdataframe/abstractdataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -970,15 +970,34 @@ julia> dropmissing(df, [:x, :y])
@inline function dropmissing(df::AbstractDataFrame,
cols::Union{ColumnIndex, MultiColumnIndex}=:;
view::Bool=false, disallowmissing::Bool=!view)
# Identify Bool mask of which rows have no Missings
bkamins marked this conversation as resolved.
Show resolved Hide resolved
rowidxs = completecases(df, cols)
if view
if disallowmissing
throw(ArgumentError("disallowmissing=true is incompatible with view=true"))
end
return Base.view(df, rowidxs, :)
else
newdf = df[rowidxs, :]
disallowmissing && disallowmissing!(newdf, cols)
# Faster when there are many columns (indexing with integers than via Bool mask)
# or when there are many Missing (as we skip a lot of iterations)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
selected_rows = _findall(rowidxs)
new_columns = Vector{AbstractVector}(undef, ncol(df))

# What column indices should disallowmissing be applied to
cols_inds = index(df)[cols]
bkamins marked this conversation as resolved.
Show resolved Hide resolved

use_threads = Threads.nthreads() > 1 && ncol(df) > 1 && length(selected_rows) >= 100_000
@sync for (i, col) in enumerate(eachcol(df))
@spawn_or_run use_threads if disallowmissing && (i in cols_inds)
new_columns[i] = Missings.disallowmissing(Base.view(col, selected_rows))
else
new_columns[i] = col[selected_rows]
end
end
nalimilan marked this conversation as resolved.
Show resolved Hide resolved

newdf = DataFrame(new_columns, copy(index(df)), copycols=false)

_copy_all_note_metadata!(newdf, df)
return newdf
end
end
Expand Down
61 changes: 61 additions & 0 deletions test/data.jl
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,17 @@ end
@test df1b == df1
end

# Zero column case
@test isempty(dropmissing(DataFrame())) && dropmissing(DataFrame()) isa DataFrame
@test isempty(dropmissing!(DataFrame())) && dropmissing!(DataFrame()) isa DataFrame
df = DataFrame(a=1:3, b=4:6)
dfv = @view df[:, 2:1]
# TODO: re-enable after https://github.com/JuliaData/DataFrames.jl/issues/3272 is resolved
# @test isempty(dropmissing(dfv)) && dropmissing(dfv) isa DataFrame
@test_throws ArgumentError dropmissing!(DataFrame())
bkamins marked this conversation as resolved.
Show resolved Hide resolved
@test_throws ArgumentError dropmissing(df1, [])
@test_throws ArgumentError dropmissing!(df1, [])

df = DataFrame(a=[1, missing, 3])
sdf = view(df, :, :)
@test dropmissing(sdf) == DataFrame(a=[1, 3])
Expand All @@ -186,6 +197,16 @@ end
@test eltype(df2.a) == Union{Int, Missing}
@test df.a == df2.a == [1, 3]

# view=true
df = DataFrame(a=[1, missing, 3])
@test dropmissing(df, view=false) == DataFrame(a=[1, 3])
@test dropmissing(df, view=true) == view(df, [1, 3], :)
@test typeof(dropmissing(df, view=true)) <: SubDataFrame
@test eltype(dropmissing(df, view=true, disallowmissing=false).a) == Union{Int, Missing}
@test_throws ArgumentError dropmissing(df, view=true, disallowmissing=true)
@test eltype(dropmissing(df, view=false, disallowmissing=false).a) == Union{Int, Missing}
@test eltype(dropmissing(df, view=false, disallowmissing=true).a) == Int

a = [1, 2]
df = DataFrame(a=a, copycols=false)
@test dropmissing!(df) === df
Expand All @@ -200,6 +221,46 @@ end
df = DataFrame(b=b)
@test eltype(dropmissing(df).b) == Int
@test eltype(dropmissing!(df).b) == Int

# disallowmissing argument
a = Union{Int, Missing}[3, 4]
b = Union{Int, Missing}[1, 2]
df = DataFrame(;a,b)
@test eltype(dropmissing(df, disallowmissing=false).a) == Union{Int, Missing}
@test eltype(dropmissing!(copy(df), disallowmissing=false).a) == Union{Int, Missing}
@test eltype(dropmissing(df, disallowmissing=true).a) == Int
@test eltype(dropmissing!(copy(df), disallowmissing=true).a) == Int
@test eltype(dropmissing(df, :a, disallowmissing=true).a) == Int
@test eltype(dropmissing!(copy(df), :a, disallowmissing=true).a) == Int
@test eltype(dropmissing(df, :b, disallowmissing=true).a) == Union{Int, Missing}
@test eltype(dropmissing!(copy(df), :b, disallowmissing=true).a) == Union{Int, Missing}

# CategoricalArrays
c = categorical([1, 2, 1, missing])
df = DataFrame(c=c)
@test dropmissing(df) == DataFrame(c=categorical([1, 2, 1]))
@test eltype(dropmissing(df).c) == CategoricalValue{Int, UInt32}
@test eltype(dropmissing!(df).c) == CategoricalValue{Int, UInt32}

# Multithreaded execution test (must be at least ncol > 1, nrow > 100_000)
N_rows, N_cols = 110_000, 3
df = DataFrame([rand(N_rows) for i in 1:N_cols], :auto) |> allowmissing
# Deterministic drop mask: IF remainder of index position divided by 10 == column index THEN missing
for i in 1:ncol(df)
missing_mask = (eachindex(df[!, i]) .% 10) .== i
df[missing_mask, i] .= missing
end

notmissing_rows = [i for i in 1:N_rows if i % 10 == 0 || i % 10 > ncol(df)]
@test isequal(dropmissing(df), df[notmissing_rows, :])
bkamins marked this conversation as resolved.
Show resolved Hide resolved

cols = [:x1, :x2]
notmissing_rows = [i for i in 1:N_rows if i % 10 == 0 || i % 10 > length(cols)]
returned = dropmissing(df, cols)
@test isequal(returned, df[notmissing_rows, :])
bkamins marked this conversation as resolved.
Show resolved Hide resolved
@test eltype(returned[:, cols[1]]) == nonmissingtype(eltype(df[:, cols[1]]))
@test eltype(returned[:, cols[2]]) == nonmissingtype(eltype(df[:, cols[2]]))
@test eltype(returned[:, ncol(df)]) == eltype(df[:, ncol(df)])
end

@testset "deleteat! https://github.com/JuliaLang/julia/pull/41646 bug workaround" begin
Expand Down