Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base.getindex over allocating #104

Open
m-wells opened this issue Nov 4, 2022 · 10 comments
Open

Base.getindex over allocating #104

m-wells opened this issue Nov 4, 2022 · 10 comments

Comments

@m-wells
Copy link
Collaborator

m-wells commented Nov 4, 2022

I have a pull request that addresses this issue.

I came across this while developing my package LazyTables.jl

This should greatly improve performance of TypedTables (although for complex row by row operations LazyTables should still outperform).

As you can see for a simply indexing operation TypedTables makes 24 allocations when it only needs to make one.

(@v1.8) pkg> free TypedTables
   Resolving package versions...
    Updating `~/.julia/environments/v1.8/Project.toml`
  [9d95f2ec] ~ TypedTables v1.4.1 `~/.julia/dev/TypedTables`  v1.4.1
    Updating `~/.julia/environments/v1.8/Manifest.toml`
  [9d95f2ec] ~ TypedTables v1.4.1 `~/.julia/dev/TypedTables`  v1.4.1

julia> using TypedTables
[ Info: Precompiling TypedTables [9d95f2ec-7b3d-5a63-8d20-e2491e220bb9]
WARNING: method definition for broadcasted at /home/mark/.julia/packages/TypedTables/dycVq/src/DictTable.jl:301 declares type variable N but does not use it.
WARNING: method definition for broadcasted at /home/mark/.julia/packages/TypedTables/dycVq/src/DictTable.jl:305 declares type variable N but does not use it.
WARNING: method definition for broadcasted at /home/mark/.julia/packages/TypedTables/dycVq/src/DictTable.jl:309 declares type variable N but does not use it.
WARNING: method definition for broadcasted at /home/mark/.julia/packages/TypedTables/dycVq/src/DictTable.jl:313 declares type variable N but does not use it.
WARNING: method definition for broadcasted at /home/mark/.julia/packages/TypedTables/dycVq/src/DictTable.jl:317 declares type variable N but does not use it.

julia> table = Table(NamedTuple(Symbol(s) => rand(rand((Bool, Float32, Int, Float64)), 1000) for s in Char.(97:122)));

julia> table[3]; @time table[3];
  0.000011 seconds (24 allocations: 1.250 KiB)
(@v1.8) pkg> dev TypedTables
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.8/Project.toml`
  No Changes to `~/.julia/environments/v1.8/Manifest.toml`

julia> using TypedTables

julia> table = Table(NamedTuple(Symbol(s) => rand(rand((Bool, Float32, Int, Float64)), 1000) for s in Char.(97:122)));

julia> table[3]; @time table[3];
  0.000004 seconds (1 allocation: 144 bytes)
@m-wells
Copy link
Collaborator Author

m-wells commented Nov 10, 2022

Pinging @quinnj and @andyferris for visibility.

@andyferris
Copy link
Member

Thanks for noticing this and the contribution @m-wells

@andyferris
Copy link
Member

@m-wells LazyTables seems really cool - some great ideas there. I've been wanting a "lazy" row for a long time but never got around to it.

To be honest I haven't had the bandwidth to give many of the packages I started the love they deserve. If you really want "All the good of TypedTables.jl but FASTER and without as many allocations!" that sounds really great to me - I'd probably use it myself! :) All that is to say, if you want contribute to and help maintain TypedTables you would be very welcome to work here (I don't have much attachment to the existing code and hopefully wouldn't get in your way, and a breaking change to the eltype of Table would likely be worth it in the long run (we have #105 to fix anyway)). Does that interest you? (If on the other hand you want to make many deeply breaking changes, or are just exploring, that's totally understandable).

@m-wells
Copy link
Collaborator Author

m-wells commented Nov 17, 2022

@andyferris

To be honest I haven't had the bandwidth to give many of the packages I started the love they deserve.

I understand that 100%.

All that is to say, if you want contribute to and help maintain TypedTables you would be very welcome to work here (I don't have much attachment to the existing code and hopefully wouldn't get in your way, and a breaking change to the eltype of Table would likely be worth it in the long run (we have #105 to fix anyway)). Does that interest you?

Sure.

(If on the other hand you want to make many deeply breaking changes, or are just exploring, that's totally understandable).

If you are willing to drop NamedTuple as the eltype then most of the wrapper code in src/columnops.jl can go away. That is only needed because of the generation of NamedTuples for each iteration. If we implemented a TypedRow (analogous to how I implemented the LazyRow) then we could substantially simplify the backend.
This wouldn't be needed anymore

# Resultant element type of given column arrays
@generated function _eltypes(a::NamedTuple{names, T}) where {names, T <: Tuple}
Ts = []
for V in T.parameters
push!(Ts, eltype(V))
end
return NamedTuple{names, Tuple{Ts...}}
end
_ndims(::NamedTuple{<:Any, T}) where {T} = _ndims(T)
_ndims(::Type{<:Tuple{Vararg{AbstractArray{<:Any, n}}}}) where {n} = n
# The following code causes newer versions of Julia to hang in precompilation
# and the workaround should not be needed after JuliaLang/julia#30577 was
# merged.
if VERSION < v"1.2.0-DEV.291"
# Workaround for JuliaLang/julia#29970
let
for n in 1:32
Ts = [Symbol("T$i") for i in 1:n]
xs = [:(x[$i]) for i in 1:n]
NT = :(Core.NamedTuple{names, Tuple{$(Ts...)}})
eval(quote
$NT(x::Tuple{$(Ts...)}) where {names, $(Ts...)} = $(Expr(:new, NT, xs...))
end)
end
end
end
as the element type would be fully inferrable from compile time just from the column types. (This would also make LazyTables redundant, which is okay with me.)

The definition of Table could become

struct Table{K, N, V<:Tuple{Vararg{AbstractArray{<:Any,N}}}} <: AbstractArray{TypedRow{K, N, V}, N}
    data::NamedTuple{K, V}
end

I have other ideas and none of them involve changing how users interface with Table, just backend stuff.

@andyferris
Copy link
Member

If you are willing to drop NamedTuple as the eltype then most of the wrapper code in src/columnops.jl can go away

True that. And yes, I am willing.

The only thing I might add is it would be nice if there were some kind of row type shared between different different tables (like the other tables here like the dictionary ones, but also it would be good to at least theoretically be able to support other Tables.jl tables or even DataFrames DataFrameRow in future). Do you think it's worth having an AbstractRow{names, T <: Tuple} type? And in general would Union{NamedTuple{names, T}, AbstractRow{names{T}} be usable for inserting rows, comparing rows (things like ==, isequal, isless and hash), and share the named tuple interface (meaning propertynames + getproperty)?

@andyferris
Copy link
Member

In the meantime @m-wells I have added you as a maintainer. Welcome, and thank you! :)

Please free to merge #103. You are also welcome to make releases (following semver) - let me know if you aren't sure how that works. As a general development process it would be nice to follow a convention of having pull requests reviewed/approved, but if people are not available for please continue to make progress rather than getting stuck (use your judegement how long is too long).

@m-wells
Copy link
Collaborator Author

m-wells commented Nov 22, 2022

The only thing I might add is it would be nice if there were some kind of row type shared between different different tables (like the other tables here like the dictionary ones, but also it would be good to at least theoretically be able to support other Tables.jl tables or even DataFrames DataFrameRow in future). Do you think it's worth having an AbstractRow{names, T <: Tuple} type? And in general would Union{NamedTuple{names, T}, AbstractRow{names{T}} be usable for inserting rows, comparing rows (things like ==, isequal, isless and hash), and share the named tuple interface (meaning propertynames + getproperty)?

I don't see a particular use case for an AbstractRow (there is already the Tables.Row struct, which is essentially what you are describing). In fact any Tables.jl object can be called as a Tables.columntable which would return the columns as a NamedTuple. This is all you need to make a typed table. In fact because TypedRow has to be defined before Table

struct TypedRow{K, V} <: Tables.AbstractRow
    data::NamedTuple{K, V}
    row::Int
end

struct Table{K, N, V} <: AbstractArray{TypedRow{K, V}, N}
    data::NamedTuple{K, V}
    # inner constructor to check equal length columns perhaps
end

TypedRow can be reused throughout TypedTables.jl although if the underlying table (like FlexTable) is not type stable it might make more sense to have a generic Row.

struct FlexRow{A<:AbstractArray} <: Tables.AbstractRow
    table::A
    row::Int
end

I haven't played with Dictionaries enough to mess with DictTable but it looks like it could also use the TypedRow.

@anirudh2
Copy link
Contributor

I'm tacking onto this issue since it's related, but please let me know if you'd prefer that I open a new issue.

I noticed that CI fails for julia 1.0.5 with @m-wells's awesome updates to Base.getindex. In particular, it fails on the inferred type checking of @test @inferred(mapreduce(s, (acc, row) -> acc + row.sum, t; init = 0.0)) === 18.0.

I locally modified the Base.getindex implementations as follows:

@inline function Base.getindex(t::Table{T}, i::Int) where {T}
    @boundscheck checkbounds(t, i)
    old = convert(T, map(col -> @inbounds(getindex(col, i)), columns(t)))::T
    new = T(@inbounds(getindex(col, i)) for col in columns(t))
    @assert old === new
    return new
end

@inline function Base.getindex(t::Table{T}, i::Int...) where {T}
    @boundscheck checkbounds(t, i...)
    old = convert(T, map(col -> @inbounds(getindex(col, i...)), columns(t)))::T
    new = T(@inbounds(getindex(col, i...) for col in columns(t)))
    @assert old === new
    return new
end

Technically, for the failing test, only the first one matters, but I did both just for completeness.

Both @assert checks pass perfectly fine. However, when I run the test suite, returning old passes, but returning new fails.

@Select and @Compute on Tables: Error During Test at /home/anirudh/projects/TypedTables.jl/test/Table.jl:169
  Test threw exception
  Expression: #= /home/anirudh/projects/TypedTables.jl/test/Table.jl:169 =# @inferred(mapreduce(s, ((acc, row)->begin
                    acc + row.sum
                end), t; init=0.0)) === 18.0
  return type Float64 does not match inferred return type Any
  Stacktrace:
   [1] error(::String) at ./error.jl:33
   [2] macro expansion at /home/anirudh/projects/TypedTables.jl/test/Table.jl:169 [inlined]
   [3] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
   [4] macro expansion at /home/anirudh/projects/TypedTables.jl/test/Table.jl:151 [inlined]
   [5] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
   [6] top-level scope at /home/anirudh/projects/TypedTables.jl/test/Table.jl:2

Honestly, I'm a bit stuck here. My understanding of a === b is that a and b are indistinguishable from one another, so I'm not sure how the assert could pass but the test could fail in this case. If anyone understands what's going on here, I'd love some help!

@anirudh2
Copy link
Contributor

anirudh2 commented Mar 22, 2023

I'm still not sure why the changed submitted by @m-wells broke that test of Julia 1.0.5, but adding ::T as follows seems to have fixed the problem for me.

T(@inbounds(getindex(col, i)) for col in columns(t))::T

On Julia 1.0.5, this does add an allocation for some reason, but it doesn't add an allocation on Julia 1.8 (still just 1 allocation total as shown in the earlier example by @m-wells). If you think this is an acceptable solution, I can open a PR. I'd still be curious if someone has an explanation for what was wrong before though.

@lewisl
Copy link

lewisl commented Dec 3, 2024

Any update on the overall proposal to combine work with LazyTables? I note that TypedTables has many more recent commits. What is entailed in "merging" the code to get LazyRows and other benefits with the cost of some breaking changes that Andy seems willing to accept. How can I help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants