Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

find performance #63

Open
davidagold opened this issue Sep 13, 2015 · 3 comments
Open

find performance #63

davidagold opened this issue Sep 13, 2015 · 3 comments

Comments

@davidagold
Copy link
Contributor

I'm not thrilled about the current implementation of find(X::NullableArray{Bool}); it seems inefficient to loop through everything twice:

function Base.find(X::NullableArray{Bool}) # -> Array{Int}
    ntrue = 0
    @inbounds for (i, isnull) in enumerate(X.isnull)
        ntrue += !isnull && X.values[i]
    end
    target = Array(Int, ntrue)
    ind = 1
    @inbounds for (i, isnull) in enumerate(X.isnull)
        if !isnull && X.values[i]
            target[ind] = i
            ind += 1
        end
    end
    return target
end

I coded an example alternative, which is faster but uses more memory -- I assume due to the allocation of a larger Array:

function f(X)
    res = Array(Int, length(X))
    ind = 1
    @inbounds for i in eachindex(X)
        !X.isnull[i] && X.values[i] ? (res[ind] = i; ind += 1) : nothing
    end
    resize!(res, ind-1)
    return res
end

(After warm-up)

julia> Y = NullableArray(rand(Bool, 50_000_000), rand(Bool, 50_000_000));

julia> @time find(Y);
  0.681382 seconds (8 allocations: 95.358 MB, 4.81% gc time)

julia> @time f(Y);
  0.404130 seconds (6 allocations: 381.470 MB, 2.50% gc time)

Thoughts anybody?

@davidagold
Copy link
Contributor Author

Actually, I think it's worth discussing, if briefly, whether or not this is the desired default behavior for find(::NullableArray). The extension of isnan that we provide returns a NullableArray{Bool} and reflects the positions of null entries in the argument NullableArray. I wonder if that should be default behavior for find methods as well and if we should then provide a skipnull kwarg.

@nalimilan
Copy link
Member

It looks like the implementation in Base for StridedArray also goes over the data twice. It sounds reasonable to do the same for NullableArray, as allocating a vector of the same size as the input can be really wasteful.

Regarding the behaviour of find, I don't really see what you're suggesting. isnan can return NULL for missing elements. But it wouldn't make much sense to insert a NULL for each missing element in the array passed to find. Or do would you simply want to raise an error? This would be consistent with how other functions (like sum) currently behave.

@johnmyleswhite
Copy link
Member

I don't think this is a big perf win and it would be inconsistent with Base, where only the minimal memory is allocated.

On a private branch I've improved perf here by eliminating irrelevant branches.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants