-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic discussions about using metaprogramming with DataFrames #1
Comments
hey @tshort can I just briefly state that this is one of the most important packages for my daily work. the combination with Lazy is hard to beat. do you have any plans of integrating this with DataFrames at some point? |
Hi Florian. Thanks for the feedback. This is still an experimental package. DataFrames needs something extra to improve performance, tighten up syntax, and add LINQ-like features. Is the DataFramesMeta the best way to do that? I'm not sure. More feedback would be great! (I haven't used it much for real projects.)
As to plans for integrating into DataFrames, that's up to (mainly) @johnmyleswhite and @simonster. It'll need more testing and user input. Maybe we start small with |
I think encouraging people to try this package out would be a good idea. I'm still really hung up on making sure we get the lower level stuff like Nullable right, but there's no reason people can't start trying out this package to see how it helps them. |
Hi Tom and John, I am actually using it for real work. It allows me to plough through large dataframes very quickly and very intuitively. My typical usage is with Lazy.jl to do something like this:
|
Glad that works so well for you. I personally find that code really hard to make sense of because of all the implicit arguments. |
yeah i can see why you say that. it took me a little while to get used to it. it's always the first argument that gets piped in from the previous expression. |
Thanks, that helps me understand. Is there a version where the piping is explicit? That's what I find most confusing. |
Also, if you'd like to put this in METADATA.jl, I think that's a good idea. |
@johnmyleswhite As far as I could tell, the It at least makes the arguments explicit: @as _ begin
sim1
@where(_, (:j .== j) & (:year .> 1997))
@transform(_, move_own = :move .* :own,
move_rent = :move .* (!:own),
buy = (:h .== 0) .* (:hh .== 1))
@by(_, :year, move_own = mean(:move_own.data, WeightVec(:density.data)),
move_rent = mean(:move_rent.data, WeightVec(:density.data)))
end More than worth the extra three characters per line IMO. I'm undecided how I feel about the lack of pipes there. Fortunately, I'm pretty sure the Lazy.jl version below won't make it in without pipes: @as _ sim1 @where(_, (:j. == j) & (:year .> 1997)) @transform(_, move_own = :move .* :own) |
This package is awesome, and should definitely be in METADATA.jl. I really like the macros and As for performance and increased expressiveness, there may be some optimization opportunities: (please correct me if I'm not being correct/reasonable)
There are some concerns with these points though:
|
I would be REALLY happy to see anything that made composition of functional On Fri, Oct 24, 2014 at 2:48 PM, Shashi Gowda notifications@github.com
|
I guess I don't really understand why iterators aren't fast right now. When you nest iterators, do they no longer allow inlining? |
I haven't looked into it enough to have much to say, but another thing is that when it comes to things like filtering, there are extra loops in the |
|
I definitely think that using I guess I assume you need nested loops for certain compositions of iterators. In particular, I was thinking we should remove |
Oh yes, that makes sense, and On Sat, Oct 25, 2014 at 5:24 PM, John Myles White notifications@github.com
|
DataFramesMeta is now in METADATA... |
With regards to JuliaData/DataFrames.jl#369 I have been working on a simple
You can also use
The code is here. I haven't tested performance yet. My approach is very simple and doesn't generate a new function for which type inference is unimpeded, as in @tshort 's implementation of I'm very happy to continue working on this functionality for DataFramesMeta if folks here like it. @byrow df begin
if :a > ^x
:a = 2(^x)
end
end to get for row in 1:length(df[1])
if df[row, :a] > x[row]
df[row, :a] = 2x[row]
end
end EDIT: Actually, the above turns out to be unnecessary. If
If anybody has thoughts, please do share them! @nalimilan, is this along the lines of what you had in mind in 369 above? (Thank you for humoring my reference to your posts from over a year ago...) |
@davidagold Interesting. Indeed it looks like what I was describing in JuliaData/DataFrames.jl#369 I don't think you need to worry about Devec.jl, I see it as orthogonal to this kind of macro. |
I like this idea, @davidagold--it would make a good addition to DataFramesMeta. I don't think it will perform well with the indexing inside the loop. For better performance, you could try to convert to something like the following: @with df for row in 1:length(df[1])
if :a[row] > x[row]
:a[row] = 2x[row]
end
end |
Thank you both for your inputs! Tom, you're right about performance: using DataArrays, DataFrames, DataFramesMeta
srand(1)
n = 10_000_000
a = rand(n)
b = rand(n)
c = rand(n)
d = zeros(n)
df = DataFrame(a=a, b=b, c=c, d=d)
function f1()
x = 0.0
@byrow df (begin
if :a < :b
x += :b * :c
end
end)
return x
end
function f2()
x = 0.0
a = convert(Array, df[:a])
b = convert(Array, df[:b])
c = convert(Array, df[:c])
for row in 1:10_000_000
if a[row] < b[row]
x += b[row] * c[row]
end
end
return x
end
function f3()
x = 0.0
@with df (begin
for row in 1:length(df[1])
if :a[row] < :b[row]
x += :b[row] * :c[row]
end
end
end)
return x
end
function g1()
@byrow df begin
if :a < :b
:d = :b * :c
end
end
end
function g2()
a = convert(Array, df[:a])
b = convert(Array, df[:b])
c = convert(Array, df[:c])
for row in 1:10_000_000
if a[row] < b[row]
df[row, :d] = b[row] * c[row]
end
end
end
function g3()
@with df begin
for row in 1:length(df[1])
if :a[row] < :b[row]
:d[row] = :b[row] * :c[row]
end
end
end
end
f1()
f2()
f3()
g1()
g2()
g3()
println("f1: ", @time(f1()))
println("f2: ", @time(f2()))
println("f3: ", @time(f3()))
println("g1: ", @time(g1()))
println("g2: ", @time(g2()))
println("g3: ", @time(g3())) gives
I'll go work on your suggestion. May I submit it as a PR when it's ready? |
@tshort After implementing your suggestion the same tests as above now give:
where (as a reminder) f1/g1 use |
Aaactually, I've run into an issue. Going to file. |
Yes to the PR when ready, @davidagold. As far as your tests, try them out without using globals. They should all be faster. |
Here are several issues that discuss metaprogramming and/or different
approaches to query and manipulate DataFrames:
Please add additional comments here on better approaches to querying DataFrames.
The text was updated successfully, but these errors were encountered: