-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrames seems a little bloated with modeling-specific functionality #1018
Comments
Let's move these to StatsBase? |
Then wouldn't StatsBase have a dependency on DataFrames? That seems a little weird to me; IMO StatsBase should be agnostic of how the underlying data is stored. Unless I'm misunderstanding you, @nalimilan. (Welcome back from vacation, by the way! ☀️) |
What we should really do is build the Formulas, Contracts, and Models stuff to all be based on an AbstractDataFrame type (or as yet unannounced, AbstractTable). With all the "AbstractTable" code able to live on it's own, it makes things like Formulas/Models much easier to split out because they only have a dependency on the AbstractTable (small, simple definitions of the Table interface) instead of all DataFrames (which would be a full implementation of the AbstractTable interface). |
So if someone has data stored in an actual matrix, as in |
Btw @quinnj, I'd love to help out with JuliaData stuff. 😄 |
If someone already has data stored in a Matrix they don't need any of the formulas etc. stuff to use in modeling. The essential function of the modeling bits of DataFrames.jl is to convert data in a tabular form to a matrix suitable for, e.g. regression. (please pardon my thumb-typing) dave.f.kleinschmidt@gmail.com
|
@quinnj I agree. I think these packages should provide a backend-agnostic modeling API. |
Really glad to see people trying to think about this from a data-source agnostic approach. This is probably obvious, but moving to a backend-agnostic API will eventually create some tensions between categorical arrays and working with data sources that have no representation of categorical data (which covers most SQL databases in their practical use in my experience). |
Bumping this discussion now that #870 is merged and since interest in revising the interface for tabular data types seems to be at a high. EDIT: I'm still working through modeling logic and don't have strong opinions yet, but my initial sense is that it would be handy to have things break down according to a StatsBase, AbstractTables, DataFrames, StatsModels package structure, where the latter includes |
@tbreloff is working on some general learning stuff in Juliaml. Pinging him here in the event he has any opinions on this, since the goal is to have abstractions that subsume both conventional stats and machine learning. |
+1 to slimming down DataFrames in a major way. -1000000 to adding all the modeling stuff into StatsBase. I really like the AbstractTable concept, and I hope everyone can start getting behind @quinnj's efforts there. DataFrames should be just one implementation of an AbstractTable, and all that modeling stuff should be based on AbstractTable (and it should be in a separate package).
I'm not the only one!! But yes we're actively working on experimental designs for general learning tools. If the project is a success I'll try to convince everyone to switch their workflow, but until then, carry on. |
I think the modeling stuff that's currently in DataFrames is best thought of as transforming tabular data into a matrix-like format that's suitable for ingesting into models. As such, building it as a separate package that depends on an |
The AbstractTable progress has been coming along, though somewhat informally. I'm almost done with another round of updates to the DataStreams framework, where Sources and Sinks are actually decoupled; to avoid the combinatorial explosion of required |
GitHub needs a "yaaaas" reaction. |
So what would this transformation mean to an out of core db? Chunks? Dagger lazyarray? Out of curiosity, have you done any benchmarks @quinnj ? Would julia be up there with go for this kind of etl stuff? |
I think the first priority should be to port the existing modeling stuff in |
Is the idea that ModelMatrix or the such would be a viable type of Sink On Friday, August 5, 2016, Dave Kleinschmidt notifications@github.com
|
Yes, I think think that's one possibility. Especially since as far as I know |
Given the discussions in #1025, I think we might want to consider the possibility of moving towards a tuple-based model. The model matrix transformation will need some column-level invariants, but it should be possible to formulate a plan based on those invariants that can be applied row-by-row given a source of tuples. |
Yes, I think that's right. All you really need at the column level (if I understand correctly) is the type and (for categorical data) the levels. |
Given that we're aiming for something that generalizes to other tabular-like data stores, what about calling the top-level package TableModels.jl? |
I guess the question is whether we also want to move general definitions like |
That seems like a good compromise to me (vs. putting all the table models stuff in |
I will plug for StatsModels.jl. |
Yes. StatsModels also sounds more standard to me than TableModels, which could be understood as a special class of models at first. |
I've taken a very crude first stab at pulling all the modeling-related code out, and putting it in a StatsModels.jl package, which passes all the tests from DataFrames.jl, and confirmed that the remaining DataFrames.jl tests pass there, too. Since I've based it on the master branch I can't figure out how to get the tests to pass on travis. Still need to see about the modeling-related StatsBase stuff, too, and documentation. |
Awesome, thanks for taking the initiative, @kleinschmidt! For Travis, you can modify the YAML to do |
Cool. Though please preserve git history, it shouldn't be too hard. See for example http://gbayer.com/development/moving-files-from-one-git-repository-to-another-preserving-history/. Then moving it to JuliaStats sounds logical. |
Preserving history is a good idea. It gets slightly tricky to do it both for both the tests and the src (since the tests are not in their own subdirectory). But at least it should be easy to preserve the |
(On the off chance this will help anyone in the future, this SO answer is a good way to filter history for any arbitrary subset of files/subfolders.) |
I've updated that repo with the history, and the tests pass (at least on linux, still waiting on the mac builds). Transferring ownership to JuliaStats sounds good now that things are reasonably stable. What's the procedure for that? |
Repo settings -> Transfer ownership |
Perhaps we should start a roadmap issue for systematically decoupling functionality from the |
Would be nice to add StatsModels to the JuliaStats webpage. |
I agree that it would be good to have it on the website, but maybe after it's registered? |
Just wanted to say that I also love the AbstractTable idea and hope it hasn't disappeared |
@Evizero "Meanwhile" what? We're rather making progress on that front. I think the plan is to replace JuliaData/AbstractTables with davidagold/AbstractTables.jl soon. |
I think he's just saying exactly what I'm thinking, which is that we can't On Wed, Oct 19, 2016 at 8:53 AM, Milan Bouchet-Valat <
|
I'd like to suggest another, super simple interface for tabular data: an iterator of immutables, where the convention is that you iterate through the rows of a table. Each row would be represented by an immutable type. This kind of interface would not require any abstract base types, it can purely work based on utilizing existing julia base conventions. A function that consumes such an iterable and wants to explore the schema of the data source can simply use the following standard julia methods for doing so:
We could of course define some helper functions that wrap these, but they would be helpers, not part of the interface contract. It might also make sense to define a SimpleTrait that indicates that a type supports this interface, so that one can dispatch on it. This kind of interface is pervasive in Query.jl, so if for example This kind of interface might be in addition to something like |
I don't think the idea was ever to require types to subtype a certain abstract type. AbstractTable would indeed be an abstract type, but interface methods would be carefully defined around it so that as long as types implement the required interface methods, they'd be able to participate in any provided interface functionality. I think David Gold has been making some great progress on the required interface methods in his AbstractTables.jl package and it'd be great to coordinate with Query.jl as well. I think the kind of RowIterator interface you're talking about would definitely fit with what David's already put together. |
I'm suggesting we don't define new functions for this most basic interface, but instead just rely on what is in base already, i.e. plug into the existing iterator and type interface in julia. The major benefit would be that any iterator in base is automatically a data source and could e.g. be passed to |
No, in my mind, we would make something like the helper functions you mentioned above apart of the "official" AbstractTable interface. Downstream packages will then code to the AbstractTable interface. I don't think it would require re-implementing anything for all iterators anyway; if needed, we define the helper functions once that take any iterable and then it's good to go. |
Ah, yes, I don't mind if we make the helper functions the API for clients. But it would be great if sources don't even have to depend on AbstractTable for everything to work. |
I'm with @quinnj on this one. I think packages that want to support tabular data formats should write things in terms of AbstractTables, then DataFrames and whatever else will "just work" when plugged in. While I understand the appeal, I don't think we should have a table type masquerading as something from Base, or bloat Base with something specific to tables. I think the use of a tabular data structure should be explicit and having it come from a package seems 👌 to me. |
My proposal would add nothing to base. All I'm suggesting is that the most basic way we think about a table is as an iterator of named tuples. For that interface, there is no need to define any new methods or types, one can just use the standard methods in base to inquire about the complete scheme of a data source. |
Oh sorry, I misunderstood. |
@davidanthoff, making |
I agree with David that a subtype declaration is overly restrictive, and that an interface contract would be more appropriate. I also agree that an iterator over row-like objects should be able to satisfy this interface, hence the interface oughtn't to require anything over and above what such an iterator can provide. Indeed, the most basic AbstractTable interface should really just allow you to extract schema information from a "table". Here's some (constructive, I hope) criticism of the proposal:
I think that AbstractTables.jl would be an appropriate place to formalize and document this interface contract and house the |
DataFrames master no longer has modeling functionality; that's been moved to StatsModels. |
There's lots of functionality here that's not specific about data frames, like the formula language, contrast coding (if #870 is merged), model matrix construction, etc. Is it time to refactor the modeling-oriented bits out into one or more JuliaStats packages? I'd suggest something like
ModelFrame
,ModelMatrix
,DataFrameRegressionModel
, etc. types.Alternatively, we could just have one big DataFramesModels package that has all the modeling stuff that's currently in DataFrames.jl. It's not immediately clear to me how to cleanly separate the
ModelMatrix
andModelFrame
logic from the formula and contrasts logic, but that might be doable.The text was updated successfully, but these errors were encountered: