-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid hard-coding for concrete implementations #14
Comments
The Actually, it won't be too hard to stop hard-coding Then the other easy step is to replace Longer-term, I think it would indeed be great to move to a tuple-based approach like Relations.jl, which would avoid lots of copies and allow supporting any kind of source. But for now we don't really need it to support both DataFrames and DataTables. EDIT: Regarding the |
I agree about moving to a tuple-based interface, but generalizing on DataFrames and DataTables is a great first step. |
I think I'm super close to have everything in place for a very generic solution to this problem in Query.jl. Query is based on iterators of NamedTuples. I have a branch where the old modeling features in DataFrames all work with any source that is an iterator of NamedTuples, so one can use that already to essentially use any of the Query data sources with the modeling features (i.e. DataFrames, DataTables, TypedTables, any DataStreams source etc.). This is also a completely streaming interface, so it works really nicely for non-in-memory sources. The main thing I still need to do is move this over to a traits based dispatch story. I think I've figured pretty much everything out for that, main problem is to find an afternoon where I can finalize all of this. |
Nice work, @davidanthoff! @quinnj, perhaps inspiration could be taken from there for defining things in AbstractTables. |
I think what I should do is actually move all of that functionality out into a package called |
That's precisely what the plan is for AbstractTables, actually. 🙂 |
It would be good to coordinate this effort. As I understand it, @quinnj had been planning to lead the effort on this end (Jacob--sorry if that's incorrect and I've volunteered you). |
Alright, I took a first stab at a traits based iterable tables implementation in Query in queryverse/Query.jl#93. Here is what this does so far:
I'm being a bit sloppy with the term "iterable table" here, that is actually not a trait in my current implementation, the details are slightly more complicated, but probably not super important. |
And I just added this trait support for any using Query, DataTables, DataFrames, CSV
df = DataFrame(CSV.Source("test.csv"))
mf = ModelFrame(@formula(a~b), CSV.Source("test.csv"))
dt = DataTable(CSV.Source("test.csv")) and pretty much all permutations of this. And this should work for any |
I pushed one more thing, so that now things like using Query, DataFrames, DataTables, GLM
dt = DataTable(X=[1,2,3], Y=[2,4,7])
lm(@formula(Y~X), dt) |
It looks almost too good to be true. Can you clarify the implied dependencies that this would require? A new |
We would have one package called Packages that want to support this trait would add that package, SimpleTraits and NamedTuples as a dependency. If a type wants to have the source trait we would add code like this to the package where that type lives. If a type wants to be able to consume something with this trait, we would add code like this. There is no need for a common type hierarchy or anything like that, essentially you just add a bunch of new methods to a package to enable integration with this, you don't have to change the core design of your package at all. There would be no dependency on Query, but a fair amount of the code in Query would move into the packages that declare the sources and sinks (which would be good in any case). There is one complication: right now all my sources and sinks replace any But for now, I think the easiest way to handle this is to just keep everything in Query. Essentially, if you do a For a package like StatsModels, this would imply the following strategy: code it up using one concrete type for input data, most likely |
And I just merged everything into |
Sounds great! Though as regards StatsModels, I think we'd rather use directly the Anyway, would you be fine with defining the trait in AbstractTables.jl? We'll likely need more function definitions there. For example, for categorical variables, we need a way to get the levels in their user-defined order (with a fallback to Regarding the |
Yes, you are right, that would make more sense. I think the story is probably this: it is super little work to make a package that currently consumes
So, the trait is actually not an iterable table trait, I was quite misleading above. The trait is
Looking at categorical variables is on my todo list. I don't understand them well enough at this point to have a clear picture how they fit into this whole story.
I think at the end of the day the interface I'm using in Query (and here) is an alternative to the field based streaming interface in
I thought about this, and I don't see how I can do this... There is essentially no extra layer between these iterable table sources and the Query machinery, so there is no point at which I could do that conversion. Given that I will just keep everything in Query for now, until I can move back to |
Yes, this definitely looks like great progress, thanks for jumping in here @davidanthoff. I'm not a huge fan of the SimpleTraits usage, partly because I have a hard time following all the macros and curly brackets, but also because I don't think we really need to use traits here. I've actually been in the design-phase of adding strongly-typed row iterating to DataStreams, so I'll definitely dig into this more to see how they compare. Ideally, I see a package like |
@quinnj If you don't want to use traits nor a type hierarchy, you essentially lose the ability to dispatch on tabular data, i.e. a function that really needs a tabular iterator would have to look like We should definitely coordinate this. I guess the main question is whether there is actually a need for a second strong-typed row iterating interface in |
I think we still want an |
I don't think we need to subtype/inherit, it's too demanding of a requirement that wouldn't buy us much. @davidanthoff, I'll need to dig in a bit more to better understand the need for a trait. I'm more thinking the latter (DataStreams could adopt Query's row-iteration approach), since DataStreams already houses 2 other table iteration protocols, I think that makes the most sense. |
Yeah, so I picked traits so that you can a) dispatch on tabular data, but b) don't impose a common super-type requirement (I agree with @quinnj that that would be a way too strong requirement). Traits are really great for this kind of situation, note that I'm even able to use the traits based dispatch system without making any change to any of the sources, i.e. |
Actually, that was again imprecise. You can dispatch on a strongly typed iterator, not on tabular data, in my scheme... One could probably add another trait @quinnj Here is an example why I think the trait is useful. Take the @traitfn DataFrames.ModelFrame{X; IsTypedIterable{X}}(f::DataFrames.Formula, d::X; kwargs...) =
DataFrames.ModelFrame(f, DataFrames.DataFrame(d); kwargs...) If I didn't have the trait, I would have to define it like this: DataFrames.ModelFrame(f::DataFrames.Formula, d; kwargs...) =
DataFrames.ModelFrame(f, DataFrames.DataFrame(d); kwargs...) That works, but it seems not ideal that you now have a method that captures data of type |
Ok, @nalimilan and I chatted a little more about this offline and I'm convinced that we can't just rely on duck-typing for cases like |
We also noted that if necessary, we could provide a |
Ok, so I think I've figured out how we can actually have a proper
That is exactly what
What would we win over the design that I have in Query right now that is based on traits? |
I've added a StatsModels sink to Query (or will, once tests pass and queryverse/Query.jl#96 is merged). So in theory you should now also be able to use StatsModels with any iterable table source, if you have Query loaded. This is the same functionality that I had previously already added for the DataFrames modeling stuff, now this just also works for the StatsModels version. |
Just that it is simpler and more standard, so that it's easier for everybody to understand this design. This is a pretty fundamental piece of the tabular data ecosystem, which will be used by many packages (e.g. Gadfly), so we'd rather not require all users to learn SimpleTraits.jl (whose interface will probably evolve over time, especially if these features are integrated to Base at some point). That doesn't seem to make a big difference for implementations right now, and let's us focus on the hard parts of the design. (Actually even this debate about traits and inheritance is already a distraction from the core issues we need to discuss.) |
SimpleTraits does not introduce any new traits interface, it simply provides some macros that make it easier to implement the holy trait pattern that is used for the whole indexing story for arrays in base. So unless that story gets completely changed again (which seems extremely unlikely), I don't think we would have to worry about this not working at some point.
Hm, I'm not sure I agree. If you do something based on inheritance, I don't see how you can for example make a Query query that returns an iterable table play in this system (I can't inherit from something like an
What do you have in mind there? What I have in Query seems to cover a pretty broad range of use-cases and works right now, and the design of that is done and finished and implemented. There are some minor question how that could be distributed over various packages, but that really is just a packaging story, not a design question. |
Just a short update: I managed to factor the iterable table design out of Query into its own package: https://github.com/davidanthoff/IterableTables.jl. It is still a bit rough and you need I'll try to clean up everything this week, so fingers crossed I'll register the package at the end of the week. Any feedback or help is of course welcome! |
Can we close this after #71? That uses the Tables API (although it does convert everything to |
Closing this as Terms 2.0 largely addresses it. If someone wants to add better support for streaming data / |
I'm wondering to what extent it would be practical to avoid hard-coding for specific implementations in this package. In particular for NullableArrays and DataTables since I'd like to use the functionality here with DataFrames and DataArrays. Commenting out specific uses of NullableArrays and removing the uses of
DataTable
signatures (could be some kind of AbstractTable in the future) made some of the basic functionality work. Some definitions like https://github.com/JuliaStats/StatsModels.jl/blob/master/src/modelframe.jl#L108 seem misplaced here in any case. I believe they should be defined in their home packages. @nalimilan why do you define the underscore versions of these functions? Other uses might be harder to get rid of. In particular the use of the Categorical types. I wondered if we could define an abstract categorical which also PooledDataArrays could be subtypes of.A very different approach could be to use https://github.com/davidagold/Relations.jl/ as the unifying framework here. I think it is very interesting and might be a way to go beyond in-memory data frame representations. However, there are probably a lot to sort out before could consider such a change whereas the changes discussed above are pretty simple.
The text was updated successfully, but these errors were encountered: