WIP: First steps in defining a common interface for Tables/Frames #16

andreasnoack · 2017-02-24T04:17:26Z

There is almost nothing here yet. Just wanted to open the PR such that we have a place to discuss details of #14. This will require a new package with the common interface. So far I've named it TableBase. Both DataFrames and DataTables should then become subtypes of this new Table and some subset of the functions in the two packages should be defined as erroring methods in TableBase and overloaded in DataFrames and DataTables.

nalimilan · 2017-02-24T14:54:32Z

Why not call this AbstractTable in the end? It find it more explicit, and that's the name we've been using for a long time now.

andreasnoack · 2017-02-24T15:10:48Z

That was also my first idea but I realized that @davidagold had developed AbstractTable in https://github.com/davidagold/AbstractTables.jl with a dependency on NullableArrays and StructuredQueries which seemed different from the XBase like package needed here. If I'm wrong and the dependency on NullableArrays isn't critical then it is just a search replace change. The main task for now is to figure out how to write the methods such that they avoid exploiting details of concrete implementations.

kleinschmidt · 2017-02-24T15:11:54Z

docs/src/formula.md

@@ -14,7 +14,7 @@ fields with possibly heterogeneous types.  One of the primary goals of
 `StatsModels` is to make it simpler to transform tabular data into matrix format
 suitable for statistical modeling.

-At the moment, "tabular data" means an `AbstractDataTable`.  Ultimately, the
+At the moment, "tabular data" means an `Table`.  Ultimately, the


a Table, not an

...or change Table to AbstractTable. :-)

nalimilan · 2017-02-24T15:18:37Z

We would need to ask David, but I don't think we want to provide different table abstractions. And there's only one use of NullableArray in the src/ dir. The described interface is very general (and he said it wasn't completely settled yet).

daemonomania · 2017-02-25T20:32:55Z

So, this is David, who has permanently locked himself out of his davidagold GH account. Alas.

I think it should be fine to remove those dependencies from AbstractTables. I'd be hesitant, however, to start changing all the method signatures in this PR from DataTable to (Abstract)Table, since most of them do make assumptions about the underlying storage pattern. It will probably be more useful, as we think about abstractions, to have DataTable in these signatures so it's immediately obvious which methods make such assumptions, and we're not trying to recover that information each time we look at the method.

andreasnoack · 2017-02-25T20:41:38Z

The idea here is to remove DataTables as a dependency which requires that DataTable is not in the signature. If methods in this package require a specific storage then we should in the longer run try to loosen such assumptions. In the short run, we'd have to make these assumptions for the abstract type. If that is too strong for AbstractTable then it could either be a subtype of AbstractTable or a completely new type.

daemonomania · 2017-02-25T20:46:17Z

Seems like traits would be useful here.

quinnj · 2017-03-09T05:57:17Z

@andreasnoack, where is the current TableBase code? I'm starting to actively compile the various "abstracttable" attempts into a single package (taking parts of DataStreams, the AbstractDataFrame code, David's AbstractTables as well as his Relations.jl package).

For everyone else, I think the biggest thing that would help as we try to converge on a single "AbstractTable" interface is the strongest set of use-cases for an AbstractTable. I'm going to try and dig more into the code here in StatsModels to figure out what exactly it "needs" from an AbstractTable interface, but it'd be great to have other strong use-cases. For me, I'm coming from the context of DataStreams, which requires functionality like getting the "schema" of a table and being able to get/set individual cells, as well as entire columns at a time. I think the ideal goal is to come up with:

The minimum set of required and optional methods a table implementation would need to implement
The "interface" functionality that would be fully provided (and cover all necessary use-cases) once a table has implemented the interface

In terms of a starter list of use-cases to consider, I can think of:

DataStreams: generalizing tabular data IO across storage/transport formats
StatsModels: encoding data + terms + formulas in a representation amenable to various statistics operations
Query/StructuredQueries: performing SQL-like "get" operations on tabular data; selecting, filtering, joining, grouping, etc.
Other data processing packages? I'm thinking like Clustering.jl, which could potentially work directly on any table-type, though I'm less clear on how generally applicable this might be because different processing libraries might require more specific data manipulation routines.

nalimilan · 2017-03-09T10:14:37Z

I think in theory the required interface for StatsModels is quite limited:

get the number and types of the variables
get the number of observations
skip observations with missing values in one of the (selected) variables
optionally get the ordering of levels for categorical variables (and possibly the name of the reference level, so that we don't necessarily assume it's the first one)
access the values of cells one by one (to fill the model matrix); variable/column-wise would be faster but row-wise would work too

The issue is that the current code makes stronger assumptions about being able to access columns as vectors. I think @kleinschmidt had plans to change this, but it will take some work.

So in the short term I think we should just write an abstraction which specifies that the abstract data table (or whatever we call that particular interface/trait) can be indexed with variable names, as that it returns vectors. We can always remove that requirement later, or provide inefficient fallbacks which would create such vectors for data tables which use a different storage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: First steps in defining a common interface for Tables/Frames #16

WIP: First steps in defining a common interface for Tables/Frames #16

andreasnoack commented Feb 24, 2017

nalimilan commented Feb 24, 2017

andreasnoack commented Feb 24, 2017 •

edited

Loading

kleinschmidt Feb 24, 2017

nalimilan Feb 24, 2017

nalimilan commented Feb 24, 2017

daemonomania commented Feb 25, 2017

andreasnoack commented Feb 25, 2017

daemonomania commented Feb 25, 2017

quinnj commented Mar 9, 2017

nalimilan commented Mar 9, 2017

kleinschmidt commented Mar 10, 2019

WIP: First steps in defining a common interface for Tables/Frames #16

WIP: First steps in defining a common interface for Tables/Frames #16

Conversation

andreasnoack commented Feb 24, 2017

nalimilan commented Feb 24, 2017

andreasnoack commented Feb 24, 2017 • edited Loading

kleinschmidt Feb 24, 2017

Choose a reason for hiding this comment

nalimilan Feb 24, 2017

Choose a reason for hiding this comment

nalimilan commented Feb 24, 2017

daemonomania commented Feb 25, 2017

andreasnoack commented Feb 25, 2017

daemonomania commented Feb 25, 2017

quinnj commented Mar 9, 2017

nalimilan commented Mar 9, 2017

kleinschmidt commented Mar 10, 2019

andreasnoack commented Feb 24, 2017 •

edited

Loading