Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrames seems a little bloated with modeling-specific functionality #1018

Closed
kleinschmidt opened this issue Jul 21, 2016 · 49 comments
Closed
Labels

Comments

@kleinschmidt
Copy link
Contributor

There's lots of functionality here that's not specific about data frames, like the formula language, contrast coding (if #870 is merged), model matrix construction, etc. Is it time to refactor the modeling-oriented bits out into one or more JuliaStats packages? I'd suggest something like

  • Formulas: model specification DSL
  • Contrasts: converting categorical data into numerical matrices for modeling, in a way that's agnostic to the underlying data type (e.g., PooledDataArray, CategoricalArray, etc.)
  • DataFramesModels: ModelFrame, ModelMatrix, DataFrameRegressionModel, etc. types.

Alternatively, we could just have one big DataFramesModels package that has all the modeling stuff that's currently in DataFrames.jl. It's not immediately clear to me how to cleanly separate the ModelMatrix and ModelFrame logic from the formula and contrasts logic, but that might be doable.

@kleinschmidt kleinschmidt changed the title DataFrames seems a little bloated. DataFrames seems a little bloated with modeling-specific functionality Jul 21, 2016
@nalimilan
Copy link
Member

Let's move these to StatsBase?

@ararslan
Copy link
Member

Then wouldn't StatsBase have a dependency on DataFrames? That seems a little weird to me; IMO StatsBase should be agnostic of how the underlying data is stored. Unless I'm misunderstanding you, @nalimilan.

(Welcome back from vacation, by the way! ☀️)

@quinnj
Copy link
Member

quinnj commented Jul 21, 2016

What we should really do is build the Formulas, Contracts, and Models stuff to all be based on an AbstractDataFrame type (or as yet unannounced, AbstractTable).

With all the "AbstractTable" code able to live on it's own, it makes things like Formulas/Models much easier to split out because they only have a dependency on the AbstractTable (small, simple definitions of the Table interface) instead of all DataFrames (which would be a full implementation of the AbstractTable interface).

@ararslan
Copy link
Member

So if someone has data stored in an actual matrix, as in Array{Whatever,2}, they'll have to convert their data to a table type for use in modeling? Hm, I guess that does make sense, otherwise you don't have a clear way of referring to specific columns in the data for modeling purposes. (Well, you do; you have the column's position. But specifying a model in terms of matrix column positions sounds like a disaster.)

@ararslan
Copy link
Member

Btw @quinnj, I'd love to help out with JuliaData stuff. 😄

@kleinschmidt
Copy link
Contributor Author

If someone already has data stored in a Matrix they don't need any of the formulas etc. stuff to use in modeling. The essential function of the modeling bits of DataFrames.jl is to convert data in a tabular form to a matrix suitable for, e.g. regression.

(please pardon my thumb-typing)

dave.f.kleinschmidt@gmail.com
http://davekleinschmidt.com
413-884-2741

On Jul 21, 2016, at 6:08 PM, Alex Arslan notifications@github.com wrote:

So if someone has data stored in an actual matrix, as in Array{Whatever,2}, they'll have to convert their data to a table type for use in modeling? Hm, I guess that does make sense, otherwise you don't have a clear way of referring to specific columns in the data for modeling purposes. (Well, you do; you have the column's position. But specifying a model in terms of matrix column positions sounds like a disaster.)


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@kleinschmidt
Copy link
Contributor Author

@quinnj I agree. I think these packages should provide a backend-agnostic modeling API.

@johnmyleswhite
Copy link
Contributor

Really glad to see people trying to think about this from a data-source agnostic approach. This is probably obvious, but moving to a backend-agnostic API will eventually create some tensions between categorical arrays and working with data sources that have no representation of categorical data (which covers most SQL databases in their practical use in my experience).

@davidagold
Copy link

davidagold commented Aug 3, 2016

Bumping this discussion now that #870 is merged and since interest in revising the interface for tabular data types seems to be at a high.

EDIT: I'm still working through modeling logic and don't have strong opinions yet, but my initial sense is that it would be handy to have things break down according to a StatsBase, AbstractTables, DataFrames, StatsModels package structure, where the latter includes ModelMatrix, Formula and contrasts logic.

@datnamer
Copy link

datnamer commented Aug 3, 2016

@tbreloff is working on some general learning stuff in Juliaml. Pinging him here in the event he has any opinions on this, since the goal is to have abstractions that subsume both conventional stats and machine learning.

@tbreloff
Copy link

tbreloff commented Aug 3, 2016

+1 to slimming down DataFrames in a major way. -1000000 to adding all the modeling stuff into StatsBase.

I really like the AbstractTable concept, and I hope everyone can start getting behind @quinnj's efforts there. DataFrames should be just one implementation of an AbstractTable, and all that modeling stuff should be based on AbstractTable (and it should be in a separate package).

@tbreloff is working on some general learning stuff in Juliaml.

I'm not the only one!! But yes we're actively working on experimental designs for general learning tools. If the project is a success I'll try to convince everyone to switch their workflow, but until then, carry on.

@kleinschmidt
Copy link
Contributor Author

I think the modeling stuff that's currently in DataFrames is best thought of as transforming tabular data into a matrix-like format that's suitable for ingesting into models. As such, building it as a separate package that depends on an AbstractTable interface seems like the right way to go. @quinnj, have you made any progress on defining such an interface (even just for data frames)?

@quinnj
Copy link
Member

quinnj commented Aug 3, 2016

The AbstractTable progress has been coming along, though somewhat informally. I'm almost done with another round of updates to the DataStreams framework, where Sources and Sinks are actually decoupled; to avoid the combinatorial explosion of required Data.stream! methods for new Sources/Sinks. Once the update is done, a new Source/Sink will just have to "register" the kind of streaming it supports (row/field-based, and/or column-based) and it will get the rest of the DataStreams ecosystem for free (and fast!). Right now, this interface work is happening in DataStreams, but my plan is to move some of that abstraction into the AbstractTables.jl package that could then be depended on downstream (and improved upon).

@davidagold
Copy link

Once the update is done, a new Source/Sink will just have to "register" the kind of streaming it supports (row/field-based, and/or column-based) and it will get the rest of the DataStreams ecosystem for free (and fast!).

GitHub needs a "yaaaas" reaction.

@datnamer
Copy link

datnamer commented Aug 5, 2016

So what would this transformation mean to an out of core db? Chunks? Dagger lazyarray?

Out of curiosity, have you done any benchmarks @quinnj ? Would julia be up there with go for this kind of etl stuff?

@kleinschmidt
Copy link
Contributor Author

I think the first priority should be to port the existing modeling stuff in DataFrame to the DataStream/AbstractTable interface, but still producing an in-memory model matrix. Ultimately I think it should be possible to create a streaming model matrix replacement (e.g., a Source/Sink) that can stream rows/columns/chunks (including out of core). But that depends first on decoupling the formulas/modelmatrix stuff from DataFrame

@davidagold
Copy link

Is the idea that ModelMatrix or the such would be a viable type of Sink
with a logic for filling it in encoded in a Formula? And the creation of a
ModelMatrix would fall back on Data.Stream?

On Friday, August 5, 2016, Dave Kleinschmidt notifications@github.com
wrote:

I think the first priority should be to port the existing modeling stuff
in DataFrame to the DataStream/AbstractTable interface, but still
producing an in-memory model matrix. Ultimately I think it should be
possible to create a streaming model matrix replacement (e.g., a Source/
Sink) that can stream rows/columns/chunks (including out of core). But
that depends first on decoupling the formulas/modelmatrix stuff from
DataFrame


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1018 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALUCm2LR9MNvIggtdl8c0DBCB_kBuLv-ks5qc0CHgaJpZM4JSJvl
.

@kleinschmidt
Copy link
Contributor Author

Yes, I think think that's one possibility. Especially since as far as I know DataFrames can be seen as a Source. The central logic for constructing a model matrix at the moment involves iterating over columns of the dataframe (which is one of the streaming modalities).

@johnmyleswhite
Copy link
Contributor

Given the discussions in #1025, I think we might want to consider the possibility of moving towards a tuple-based model. The model matrix transformation will need some column-level invariants, but it should be possible to formulate a plan based on those invariants that can be applied row-by-row given a source of tuples.

@kleinschmidt
Copy link
Contributor Author

Yes, I think that's right. All you really need at the column level (if I understand correctly) is the type and (for categorical data) the levels.

@kleinschmidt
Copy link
Contributor Author

Given that we're aiming for something that generalizes to other tabular-like data stores, what about calling the top-level package TableModels.jl?

@nalimilan
Copy link
Member

I guess the question is whether we also want to move general definitions like AIC and BIC from StatsBase to the new package or not. I would think centralizing all modeling functions in a single package would be a good idea.

@kleinschmidt
Copy link
Contributor Author

That seems like a good compromise to me (vs. putting all the table models stuff in StatsBase). Then the idea is that GLM.jl etc. would then re-export these?

@davidagold
Copy link

Given that we're aiming for something that generalizes to other tabular-like data stores, what about calling the top-level package TableModels.jl?

I will plug for StatsModels.jl.

@nalimilan
Copy link
Member

Then the idea is that GLM.jl etc. would then re-export these?

Yes.

StatsModels also sounds more standard to me than TableModels, which could be understood as a special class of models at first.

@kleinschmidt
Copy link
Contributor Author

kleinschmidt commented Oct 14, 2016

I've taken a very crude first stab at pulling all the modeling-related code out, and putting it in a StatsModels.jl package, which passes all the tests from DataFrames.jl, and confirmed that the remaining DataFrames.jl tests pass there, too. Since I've based it on the master branch I can't figure out how to get the tests to pass on travis.

Still need to see about the modeling-related StatsBase stuff, too, and documentation.

@ararslan
Copy link
Member

Awesome, thanks for taking the initiative, @kleinschmidt! For Travis, you can modify the YAML to do Pkg.checkout("DataFrames", "master"). You're also welcome to transfer that to JuliaStats if you'd like.

@nalimilan
Copy link
Member

Cool. Though please preserve git history, it shouldn't be too hard. See for example http://gbayer.com/development/moving-files-from-one-git-repository-to-another-preserving-history/. Then moving it to JuliaStats sounds logical.

@kleinschmidt
Copy link
Contributor Author

Preserving history is a good idea. It gets slightly tricky to do it both for both the tests and the src (since the tests are not in their own subdirectory). But at least it should be easy to preserve the src/statsmodels history at least (although it doesn't look to me like it handles re-names and re-organizations of files in that directory...)

@kleinschmidt
Copy link
Contributor Author

(On the off chance this will help anyone in the future, this SO answer is a good way to filter history for any arbitrary subset of files/subfolders.)

@kleinschmidt
Copy link
Contributor Author

I've updated that repo with the history, and the tests pass (at least on linux, still waiting on the mac builds). Transferring ownership to JuliaStats sounds good now that things are reasonably stable. What's the procedure for that?

@ararslan
Copy link
Member

Repo settings -> Transfer ownership

@davidagold
Copy link

Perhaps we should start a roadmap issue for systematically decoupling functionality from the DataFrame type? Time to dust off the ol' Roadmap.jl repo??

@nalimilan
Copy link
Member

Would be nice to add StatsModels to the JuliaStats webpage.

@ararslan
Copy link
Member

I agree that it would be good to have it on the website, but maybe after it's registered?

@Evizero
Copy link
Contributor

Evizero commented Oct 19, 2016

Just wanted to say that I also love the AbstractTable idea and hope it hasn't disappeared meanwhile in the meantime. I would love to support DataFrames in the new MLDataUtils refactor for eachobs, eachbatch, and splitobs etc, but am a bit hesitant to require the full DataFrames.jl repo, since all I would need is nrow, getindex, and a type to dispatch on.

@nalimilan
Copy link
Member

@Evizero "Meanwhile" what? We're rather making progress on that front. I think the plan is to replace JuliaData/AbstractTables with davidagold/AbstractTables.jl soon.

@tbreloff
Copy link

I think he's just saying exactly what I'm thinking, which is that we can't
wait for tabular data to be as flexible and extendable as AbstractArray.
Looking forward to the results of AbstractTables!

On Wed, Oct 19, 2016 at 8:53 AM, Milan Bouchet-Valat <
notifications@github.com> wrote:

@Evizero https://github.com/Evizero "Meanwhile" what? We're rather
making progress on that front. I think the plan is to replace
JuliaData/AbstractTables with davidagold/AbstractTables.jl
https://github.com/davidagold/AbstractTables.jl soon.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1018 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA492uqtWtzqhqYyoUCxR7IoHNveyUdLks5q1hKygaJpZM4JSJvl
.

@davidanthoff
Copy link
Contributor

I'd like to suggest another, super simple interface for tabular data: an iterator of immutables, where the convention is that you iterate through the rows of a table. Each row would be represented by an immutable type. This kind of interface would not require any abstract base types, it can purely work based on utilizing existing julia base conventions.

A function that consumes such an iterable and wants to explore the schema of the data source can simply use the following standard julia methods for doing so:

  • to get the names of the columns use fieldnames(eltype(iter)).
  • to get the types of the columns use eltype(iter).types.
  • to get the number of columns use length(eltype(iter).types) or length(fieldnames(eltype(iter))
  • to get the number of rows use length(iter) (but check first whether that is supported by using the usual julia base methods)

We could of course define some helper functions that wrap these, but they would be helpers, not part of the interface contract. It might also make sense to define a SimpleTrait that indicates that a type supports this interface, so that one can dispatch on it.

This kind of interface is pervasive in Query.jl, so if for example ModelFrame could work with this interface, there would be a really, really nice integration between the query framework and the model estimation framework. I stumbled over this yesterday when trying to do everything in this chapter using Query.jl, and literally the only thing that is missing right now is the ability to pass an iterable of immutables to something like lm in GLM.jl to make everything work (and I actually think in a more natural syntax than the R original).

This kind of interface might be in addition to something like AbstractTables, which might provide additional benefits. But it would be really great if the most simple interface in this universe did not require types to inherit from some base type, because that doesn't square at all with e.g. the design of Query.jl.

@quinnj
Copy link
Member

quinnj commented Oct 19, 2016

I don't think the idea was ever to require types to subtype a certain abstract type. AbstractTable would indeed be an abstract type, but interface methods would be carefully defined around it so that as long as types implement the required interface methods, they'd be able to participate in any provided interface functionality. I think David Gold has been making some great progress on the required interface methods in his AbstractTables.jl package and it'd be great to coordinate with Query.jl as well. I think the kind of RowIterator interface you're talking about would definitely fit with what David's already put together.

@davidanthoff
Copy link
Contributor

I'm suggesting we don't define new functions for this most basic interface, but instead just rely on what is in base already, i.e. plug into the existing iterator and type interface in julia. The major benefit would be that any iterator in base is automatically a data source and could e.g. be passed to ModelFrame or plot etc. If we define new functions for this interface, like the ones in davidagold/AbstractTables.jl, then we would have to implement these new functions for all the iterators in base for them to work in this framework. That seems unnecessary, right?

@quinnj
Copy link
Member

quinnj commented Oct 19, 2016

No, in my mind, we would make something like the helper functions you mentioned above apart of the "official" AbstractTable interface. Downstream packages will then code to the AbstractTable interface. I don't think it would require re-implementing anything for all iterators anyway; if needed, we define the helper functions once that take any iterable and then it's good to go.

@davidanthoff
Copy link
Contributor

Ah, yes, I don't mind if we make the helper functions the API for clients. But it would be great if sources don't even have to depend on AbstractTable for everything to work.

@ararslan
Copy link
Member

I'm with @quinnj on this one. I think packages that want to support tabular data formats should write things in terms of AbstractTables, then DataFrames and whatever else will "just work" when plugged in. While I understand the appeal, I don't think we should have a table type masquerading as something from Base, or bloat Base with something specific to tables. I think the use of a tabular data structure should be explicit and having it come from a package seems 👌 to me.

@davidanthoff
Copy link
Contributor

My proposal would add nothing to base. All I'm suggesting is that the most basic way we think about a table is as an iterator of named tuples. For that interface, there is no need to define any new methods or types, one can just use the standard methods in base to inquire about the complete scheme of a data source.

@ararslan
Copy link
Member

Oh sorry, I misunderstood.

@kleinschmidt
Copy link
Contributor Author

@davidanthoff, making ModelMatrix work for something like an iterator for named tuples is on my TODO list for StatsModels.

@davidagold
Copy link

I agree with David that a subtype declaration is overly restrictive, and that an interface contract would be more appropriate. I also agree that an iterator over row-like objects should be able to satisfy this interface, hence the interface oughtn't to require anything over and above what such an iterator can provide. Indeed, the most basic AbstractTable interface should really just allow you to extract schema information from a "table". Here's some (constructive, I hope) criticism of the proposal:

  • The interface David suggests essentially requires that the immutables be named tuples. This seems overly restrictive. It seems as though one ought to be able to satisfy the interface contract with an iterator over plain tuples that stores the field -> column index mapping in the iterator itself.
  • The selector methods David describes are kind of verbose and unclear. For instance, it seems preferable to just be able to do ncol(itr) as opposed to length(eltype(iter).types).
  • Immutable-returning iteration perhaps shouldn't be part of the most basic tabular interface, since this would preclude a table type that just wraps a database connection.
  • An abstract AbstractTable type is useful for hooking into generic functionality that relies on dispatch, e.g. show.

I think that AbstractTables.jl would be an appropriate place to formalize and document this interface contract and house the AbstractTable type, for whatever the latter may be useful for. I also realize I house backend support for SQ there. I'm happy to move that code elsewhere and make the AbstractTables package more neutral.

@ararslan
Copy link
Member

ararslan commented Sep 7, 2017

DataFrames master no longer has modeling functionality; that's been moved to StatsModels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants