find performance #63

davidagold · 2015-09-13T17:21:01Z

I'm not thrilled about the current implementation of find(X::NullableArray{Bool}); it seems inefficient to loop through everything twice:

function Base.find(X::NullableArray{Bool}) # -> Array{Int}
    ntrue = 0
    @inbounds for (i, isnull) in enumerate(X.isnull)
        ntrue += !isnull && X.values[i]
    end
    target = Array(Int, ntrue)
    ind = 1
    @inbounds for (i, isnull) in enumerate(X.isnull)
        if !isnull && X.values[i]
            target[ind] = i
            ind += 1
        end
    end
    return target
end

I coded an example alternative, which is faster but uses more memory -- I assume due to the allocation of a larger Array:

function f(X)
    res = Array(Int, length(X))
    ind = 1
    @inbounds for i in eachindex(X)
        !X.isnull[i] && X.values[i] ? (res[ind] = i; ind += 1) : nothing
    end
    resize!(res, ind-1)
    return res
end

(After warm-up)

julia> Y = NullableArray(rand(Bool, 50_000_000), rand(Bool, 50_000_000));

julia> @time find(Y);
  0.681382 seconds (8 allocations: 95.358 MB, 4.81% gc time)

julia> @time f(Y);
  0.404130 seconds (6 allocations: 381.470 MB, 2.50% gc time)

Thoughts anybody?

The text was updated successfully, but these errors were encountered:

davidagold · 2015-09-13T17:31:17Z

Actually, I think it's worth discussing, if briefly, whether or not this is the desired default behavior for find(::NullableArray). The extension of isnan that we provide returns a NullableArray{Bool} and reflects the positions of null entries in the argument NullableArray. I wonder if that should be default behavior for find methods as well and if we should then provide a skipnull kwarg.

nalimilan · 2015-09-13T20:32:56Z

It looks like the implementation in Base for StridedArray also goes over the data twice. It sounds reasonable to do the same for NullableArray, as allocating a vector of the same size as the input can be really wasteful.

Regarding the behaviour of find, I don't really see what you're suggesting. isnan can return NULL for missing elements. But it wouldn't make much sense to insert a NULL for each missing element in the array passed to find. Or do would you simply want to raise an error? This would be consistent with how other functions (like sum) currently behave.

johnmyleswhite · 2015-09-13T21:16:18Z

I don't think this is a big perf win and it would be inconsistent with Base, where only the minimal memory is allocated.

On a private branch I've improved perf here by eliminating irrelevant branches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find performance #63

find performance #63

davidagold commented Sep 13, 2015

davidagold commented Sep 13, 2015

nalimilan commented Sep 13, 2015

johnmyleswhite commented Sep 13, 2015

find performance #63

find performance #63

Comments

davidagold commented Sep 13, 2015

davidagold commented Sep 13, 2015

nalimilan commented Sep 13, 2015

johnmyleswhite commented Sep 13, 2015