Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid hard-coding for concrete implementations #14

Closed
andreasnoack opened this issue Feb 22, 2017 · 30 comments
Closed

Avoid hard-coding for concrete implementations #14

andreasnoack opened this issue Feb 22, 2017 · 30 comments

Comments

@andreasnoack
Copy link
Member

I'm wondering to what extent it would be practical to avoid hard-coding for specific implementations in this package. In particular for NullableArrays and DataTables since I'd like to use the functionality here with DataFrames and DataArrays. Commenting out specific uses of NullableArrays and removing the uses of DataTable signatures (could be some kind of AbstractTable in the future) made some of the basic functionality work. Some definitions like https://github.com/JuliaStats/StatsModels.jl/blob/master/src/modelframe.jl#L108 seem misplaced here in any case. I believe they should be defined in their home packages. @nalimilan why do you define the underscore versions of these functions? Other uses might be harder to get rid of. In particular the use of the Categorical types. I wondered if we could define an abstract categorical which also PooledDataArrays could be subtypes of.

A very different approach could be to use https://github.com/davidagold/Relations.jl/ as the unifying framework here. I think it is very interesting and might be a way to go beyond in-memory data frame representations. However, there are probably a lot to sort out before could consider such a change whereas the changes discussed above are pretty simple.

@andreasnoack andreasnoack changed the title Avoid hardcoding for concrete implementations Avoid hard-coding for concrete implementations Feb 22, 2017
@nalimilan
Copy link
Member

nalimilan commented Feb 22, 2017

The _unique methods are needed to unwrap nullables. Fortunately, special-casing Nullable does not interfere at all with DataArray support, which does not need unwrapping. Since this code is working on complete cases, it will never see a null or a NA and should work fine with either. See also JuliaStats/DataArrays.jl#237.

Actually, it won't be too hard to stop hard-coding CategoricalArray. In #15, I remove half of the uses, which are no longer needed. As you suggest, it would make sense to define AbstractCategoricalArray (probably in AbstractTables again), since the remaining occurrence of CategoricalArray in the code base is for modelmat_cols, where it's much more efficient to use the integer codes than to compare strings one by one to get the index of the column they correspond to. That's pretty easy change to make, with the slight complication that we also need AbstractNullableCategoricalArray, which won't be used by PooledDataArray since no unwrapping is required for it.

Then the other easy step is to replace AbstractDataTable with AbstractTable and make DataFrame and DataTable inherit from the latter.

Longer-term, I think it would indeed be great to move to a tuple-based approach like Relations.jl, which would avoid lots of copies and allow supporting any kind of source. But for now we don't really need it to support both DataFrames and DataTables.

EDIT: Regarding the _unique(x::CategoricalArray) method you refer to, it can be made more generic by dispatching on AbstractCategoricalArray. This method is needed since we want to sort levels of e.g. Vector{String} in alphabetical order, but we want to keep the user-defined order for categorical arrays. Let's discuss another possible approach at JuliaStats/DataArrays.jl#237.

@kleinschmidt
Copy link
Member

I agree about moving to a tuple-based interface, but generalizing on DataFrames and DataTables is a great first step.

@davidanthoff
Copy link
Contributor

I think I'm super close to have everything in place for a very generic solution to this problem in Query.jl.

Query is based on iterators of NamedTuples. I have a branch where the old modeling features in DataFrames all work with any source that is an iterator of NamedTuples, so one can use that already to essentially use any of the Query data sources with the modeling features (i.e. DataFrames, DataTables, TypedTables, any DataStreams source etc.). This is also a completely streaming interface, so it works really nicely for non-in-memory sources.

The main thing I still need to do is move this over to a traits based dispatch story. I think I've figured pretty much everything out for that, main problem is to find an afternoon where I can finalize all of this.

@ararslan
Copy link
Member

Nice work, @davidanthoff! @quinnj, perhaps inspiration could be taken from there for defining things in AbstractTables.

@davidanthoff
Copy link
Contributor

I think what I should do is actually move all of that functionality out into a package called IterableTables. It would be a super minimal package that provides a minimum interface for tabular data exchange.

@ararslan
Copy link
Member

That's precisely what the plan is for AbstractTables, actually. 🙂

@ararslan
Copy link
Member

It would be good to coordinate this effort. As I understand it, @quinnj had been planning to lead the effort on this end (Jacob--sorry if that's incorrect and I've volunteered you).

@davidanthoff
Copy link
Contributor

Alright, I took a first stab at a traits based iterable tables implementation in Query in queryverse/Query.jl#93.

Here is what this does so far:

  • DataTable, DataFrame and TypedTables are now iterable table sources. And any uncollected Query query (haha) that produces NamedTuples will also work.
  • DataTable and DataFrame have new constructors that dispatch on this iterable table trait, so you can call them with DataFrame(x), and as long as x is an iterable table (right now any of the three iterable table sources from the previous bullet point, plus any iterable that produces NamedTuples).
  • I also added a DataFrames.ModelFrame constructor that dispatches on the iterable table trait. So you can now call ModelFrame(@formula(a~b), q), and as long as q has the iterable table trait, it will just work. If we want functions like lm in GLM to work with this, we would just need to add a method there that dispatches on that trait.

I'm being a bit sloppy with the term "iterable table" here, that is actually not a trait in my current implementation, the details are slightly more complicated, but probably not super important.

@davidanthoff
Copy link
Contributor

And I just added this trait support for any DataStream source, so you can also do things like this now:

using Query, DataTables, DataFrames, CSV

df = DataFrame(CSV.Source("test.csv"))

mf = ModelFrame(@formula(a~b), CSV.Source("test.csv"))

dt = DataTable(CSV.Source("test.csv"))

and pretty much all permutations of this. And this should work for any DataStreams source, i.e. feather, SQLite etc.

@davidanthoff
Copy link
Contributor

I pushed one more thing, so that now things like lm from GLM.jl work with any data source that has the table trait. So this now works:

using Query, DataFrames, DataTables, GLM

dt = DataTable(X=[1,2,3], Y=[2,4,7])

lm(@formula(Y~X), dt)

@andreasnoack
Copy link
Member Author

It looks almost too good to be true. Can you clarify the implied dependencies that this would require? A new AbstractTables/IterableTables that both DataFrames and DataTables should depend on, right? Where is Query in this chain?

@davidanthoff
Copy link
Contributor

We would have one package called IterableTables, the only thing in it would be this.

Packages that want to support this trait would add that package, SimpleTraits and NamedTuples as a dependency. If a type wants to have the source trait we would add code like this to the package where that type lives. If a type wants to be able to consume something with this trait, we would add code like this. There is no need for a common type hierarchy or anything like that, essentially you just add a bunch of new methods to a package to enable integration with this, you don't have to change the core design of your package at all. There would be no dependency on Query, but a fair amount of the code in Query would move into the packages that declare the sources and sinks (which would be good in any case).

There is one complication: right now all my sources and sinks replace any Nullable with a DataValue type. DataValue is almost like a Nullable, but it has some slightly different semantics. I'm not keen on DataValue, but until JuliaLang/julia#20504 is in base I won't be able to move everything back to Nullable. So, once again, the missing values are a pure joy ;)

But for now, I think the easiest way to handle this is to just keep everything in Query. Essentially, if you do a using Query everything works, at least if you are on master :) Because there is no need to modify anything in packages like DataFrame or DataTable for this to work, this seems a fairly simple way forward.

For a package like StatsModels, this would imply the following strategy: code it up using one concrete type for input data, most likely DataTable. Then rely on Query (for now) to provide additional methods that work with any other tabular data source; these methods then essentially just convert things into a DataTable. That is the strategy here (although it works the other way around, everything gets converted into a DataFrame because that extends the implementation in DataFrames).

@davidanthoff
Copy link
Contributor

And I just merged everything into master on the Query side, so if you are on master for Query, this should all be more or less functional.

@nalimilan
Copy link
Member

Sounds great! Though as regards StatsModels, I think we'd rather use directly the IterableTable interface to fill the model matrix, as converting (i.e. copying) a whole table would be really wasteful.

Anyway, would you be fine with defining the trait in AbstractTables.jl? We'll likely need more function definitions there. For example, for categorical variables, we need a way to get the levels in their user-defined order (with a fallback to unique). That will also help coordination with DataStreams, which currently has its own interface for tables.

Regarding the Nullable/DataValue issue, would it be possible to define the IterableTable interface only in terms of Nullable, and convert to DataValue from inside Query on the fly? Nullable is currently the only common type between all packages since it's defined in Base; it doesn't need to be exposed to the user in all cases.

@davidanthoff
Copy link
Contributor

Though as regards StatsModels, I think we'd rather use directly the IterableTable interface to fill the model matrix, as converting (i.e. copying) a whole table would be really wasteful.

Yes, you are right, that would make more sense. I think the story is probably this: it is super little work to make a package that currently consumes DataFrames work with iterable tables, essentially all you have to do is add a one-liner like this. So it should be very simple to spread this pattern widely (plotting packages and all the other gazillion packages that consume DataFrames right now). And then, if you want better performance, you can implement something that doesn't go through an extra copy, and that would look more like this.

Anyway, would you be fine with defining the trait in AbstractTables.jl?

So, the trait is actually not an iterable table trait, I was quite misleading above. The trait is IsTypedIterable. This is simply a very small pattern that enables dispatching on iterables that can potentially have one extra step of indirection so that one ends up with an iterator that has type-stable start, next and done methods, even if the original type one is iterating wouldn't enable this (like DataFrames and DataTables). That code actually has nothing to do with tabular data. My current thinking is that it should probably be located in a package called TypedIterators or something like that (or we might even just put it into SimpleTraits). Other than that there is no need for shared code, i.e. everything else is essentially just a convention on how to use the normal julia iterator protocol with NamedTuples for tabular data.

For example, for categorical variables, we need a way to get the levels in their user-defined order (with a fallback to unique).

Looking at categorical variables is on my todo list. I don't understand them well enough at this point to have a clear picture how they fit into this whole story.

That will also help coordination with DataStreams, which currently has its own interface for tables.

I think at the end of the day the interface I'm using in Query (and here) is an alternative to the field based streaming interface in DataStreams. I can wrap the DataStreams interface in my iterable table scheme, so that we get interop, but in principle one could also imagine a world where things like CSV or Feather etc. just implemented the iterable tables interface directly.

Regarding the Nullable/DataValue issue, would it be possible to define the IterableTable interface only in terms of Nullable, and convert to DataValue from inside Query on the fly? Nullable is currently the only common type between all packages since it's defined in Base; it doesn't need to be exposed to the user in all cases.

I thought about this, and I don't see how I can do this... There is essentially no extra layer between these iterable table sources and the Query machinery, so there is no point at which I could do that conversion.

Given that I will just keep everything in Query for now, until I can move back to Nullables, I guess another option would be for packages like StatsModels etc. to just take a dependency on Query. That doesn't pull in a whole lot of stuff (see REQUIRE, and I think I can actually get rid of FunctionWrappers soon).

@quinnj
Copy link
Member

quinnj commented Mar 16, 2017

Yes, this definitely looks like great progress, thanks for jumping in here @davidanthoff. I'm not a huge fan of the SimpleTraits usage, partly because I have a hard time following all the macros and curly brackets, but also because I don't think we really need to use traits here. I've actually been in the design-phase of adding strongly-typed row iterating to DataStreams, so I'll definitely dig into this more to see how they compare. Ideally, I see a package like AbstractTables.jl being a collection of empty function definitions with documentation around how "table types" should implement each and the kind of functionality they get. This avoids any need for subtyping or traits. The functions I have in mind are the various value iterating protocols, field/row/column.

@davidanthoff
Copy link
Contributor

@quinnj If you don't want to use traits nor a type hierarchy, you essentially lose the ability to dispatch on tabular data, i.e. a function that really needs a tabular iterator would have to look like function f(x), i.e. it would be the function that has a Any type signature. That might work for some cases, but I'd be surprised if there wouldn't be cases where people actually want to dispatch on "tabular data". Or do you have another idea how to do that, without a type hierarchy or traits?

We should definitely coordinate this. I guess the main question is whether there is actually a need for a second strong-typed row iterating interface in DataStreams, or whether the stuff I've defined in Query does the job and you could just adopt that?

@ararslan
Copy link
Member

I think we still want an AbstractTable type from which DataFrame and DataTable both inherit for the purposes of dispatch. At least that was my understanding.

@quinnj
Copy link
Member

quinnj commented Mar 16, 2017

I don't think we need to subtype/inherit, it's too demanding of a requirement that wouldn't buy us much. @davidanthoff, I'll need to dig in a bit more to better understand the need for a trait. I'm more thinking the latter (DataStreams could adopt Query's row-iteration approach), since DataStreams already houses 2 other table iteration protocols, I think that makes the most sense.

@davidanthoff
Copy link
Contributor

Yeah, so I picked traits so that you can a) dispatch on tabular data, but b) don't impose a common super-type requirement (I agree with @quinnj that that would be a way too strong requirement). Traits are really great for this kind of situation, note that I'm even able to use the traits based dispatch system without making any change to any of the sources, i.e. DataFrames, DataTables, TypedTables, DataStreams and all the other sources I support all participate in my interface without any requirement to change them from how they are right now.

@davidanthoff
Copy link
Contributor

Actually, that was again imprecise. You can dispatch on a strongly typed iterator, not on tabular data, in my scheme... One could probably add another trait IsTypedIterableTable or something like that, but I'd have to think a bit about that.

@quinnj Here is an example why I think the trait is useful. Take the ModelFrame constructor. Right now I define that like this:

@traitfn DataFrames.ModelFrame{X; IsTypedIterable{X}}(f::DataFrames.Formula, d::X; kwargs...) = 
    DataFrames.ModelFrame(f, DataFrames.DataFrame(d); kwargs...)

If I didn't have the trait, I would have to define it like this:

DataFrames.ModelFrame(f::DataFrames.Formula, d; kwargs...) = 
    DataFrames.ModelFrame(f, DataFrames.DataFrame(d); kwargs...)

That works, but it seems not ideal that you now have a method that captures data of type Any...

@quinnj
Copy link
Member

quinnj commented Mar 16, 2017

Ok, @nalimilan and I chatted a little more about this offline and I'm convinced that we can't just rely on duck-typing for cases like plot(x, ...) since we can't always assume an ::Any item is a table structure. The ideal solution here is formal interface types; where any structure x implementing specific interface methods will satisfy isa(x, MyInterfaceType). In the short-term, I'm inclined to go with a subtyping solution, since it's the simplest, most direct way to both restrict and abstract tabular data structures. I'm certainly going to try and push Base julia towards formal interface types in the mean time, however, to ensure that future tabular data structures don't need to subtype.

@nalimilan
Copy link
Member

We also noted that if necessary, we could provide a TableWrapper type which would inherit AbstractType and could be used to wrap non-standard table types so that they work for dispatch and support the same interface.

@davidanthoff
Copy link
Contributor

Ok, so I think I've figured out how we can actually have a proper IsIterableTable trait. With that you can dispatch specifically on iterable tables.

The ideal solution here is formal interface types; where any structure x implementing specific interface methods will satisfy isa(x, MyInterfaceType).

That is exactly what SimpleTraits provides, and how the whole dispatch story works in my framework now. SimpleTraits is just a nicer interface for writing holy-traits, so this is also a pattern that is quite widely used in Base.

In the short-term, I'm inclined to go with a subtyping solution, since it's the simplest, most direct way to both restrict and abstract tabular data structures.

We also noted that if necessary, we could provide a TableWrapper type which would inherit AbstractType and could be used to wrap non-standard table types so that they work for dispatch and support the same interface.

What would we win over the design that I have in Query right now that is based on traits?

@davidanthoff
Copy link
Contributor

I've added a StatsModels sink to Query (or will, once tests pass and queryverse/Query.jl#96 is merged).

So in theory you should now also be able to use StatsModels with any iterable table source, if you have Query loaded. This is the same functionality that I had previously already added for the DataFrames modeling stuff, now this just also works for the StatsModels version.

@nalimilan
Copy link
Member

What would we win over the design that I have in Query right now that is based on traits?

Just that it is simpler and more standard, so that it's easier for everybody to understand this design. This is a pretty fundamental piece of the tabular data ecosystem, which will be used by many packages (e.g. Gadfly), so we'd rather not require all users to learn SimpleTraits.jl (whose interface will probably evolve over time, especially if these features are integrated to Base at some point). That doesn't seem to make a big difference for implementations right now, and let's us focus on the hard parts of the design. (Actually even this debate about traits and inheritance is already a distraction from the core issues we need to discuss.)

@davidanthoff
Copy link
Contributor

so we'd rather not require all users to learn SimpleTraits.jl (whose interface will probably evolve over time, especially if these features are integrated to Base at some point).

SimpleTraits does not introduce any new traits interface, it simply provides some macros that make it easier to implement the holy trait pattern that is used for the whole indexing story for arrays in base. So unless that story gets completely changed again (which seems extremely unlikely), I don't think we would have to worry about this not working at some point.

That doesn't seem to make a big difference for implementations right now

Hm, I'm not sure I agree. If you do something based on inheritance, I don't see how you can for example make a Query query that returns an iterable table play in this system (I can't inherit from something like an AbstractTable for a query result because I'm using inheritance for something else already). That seems really not ideal at all...

and let's us focus on the hard parts of the design

What do you have in mind there? What I have in Query seems to cover a pretty broad range of use-cases and works right now, and the design of that is done and finished and implemented. There are some minor question how that could be distributed over various packages, but that really is just a packaging story, not a design question.

@davidanthoff
Copy link
Contributor

Just a short update: I managed to factor the iterable table design out of Query into its own package: https://github.com/davidanthoff/IterableTables.jl. It is still a bit rough and you need master of SimpleTraits right now, but the difficult things are done. So essentially the design is now that IterableTables has the general tabular data exchange stuff, and Query just has the real query code. The dependency is now that Query depends on IterableTable, and if you just want the tabular data conversion stuff you only need IterableTables, not Query.

I'll try to clean up everything this week, so fingers crossed I'll register the package at the end of the week.

Any feedback or help is of course welcome!

@kleinschmidt
Copy link
Member

Can we close this after #71? That uses the Tables API (although it does convert everything to ColumnTable under the hood so that's not idea for truly streaming data, but perhaps that deserves its own issue though)

@palday
Copy link
Member

palday commented Oct 13, 2020

Closing this as Terms 2.0 largely addresses it.

If someone wants to add better support for streaming data / RowTable, that can be a new issue.

@palday palday closed this as completed Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants