Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: First steps in defining a common interface for Tables/Frames #16

Closed
wants to merge 1 commit into from

Conversation

andreasnoack
Copy link
Member

There is almost nothing here yet. Just wanted to open the PR such that we have a place to discuss details of #14. This will require a new package with the common interface. So far I've named it TableBase. Both DataFrames and DataTables should then become subtypes of this new Table and some subset of the functions in the two packages should be defined as erroring methods in TableBase and overloaded in DataFrames and DataTables.

@nalimilan
Copy link
Member

Why not call this AbstractTable in the end? It find it more explicit, and that's the name we've been using for a long time now.

@andreasnoack
Copy link
Member Author

andreasnoack commented Feb 24, 2017

That was also my first idea but I realized that @davidagold had developed AbstractTable in https://github.com/davidagold/AbstractTables.jl with a dependency on NullableArrays and StructuredQueries which seemed different from the XBase like package needed here. If I'm wrong and the dependency on NullableArrays isn't critical then it is just a search replace change. The main task for now is to figure out how to write the methods such that they avoid exploiting details of concrete implementations.

@@ -14,7 +14,7 @@ fields with possibly heterogeneous types. One of the primary goals of
`StatsModels` is to make it simpler to transform tabular data into matrix format
suitable for statistical modeling.

At the moment, "tabular data" means an `AbstractDataTable`. Ultimately, the
At the moment, "tabular data" means an `Table`. Ultimately, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a Table, not an

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or change Table to AbstractTable. :-)

@nalimilan
Copy link
Member

We would need to ask David, but I don't think we want to provide different table abstractions. And there's only one use of NullableArray in the src/ dir. The described interface is very general (and he said it wasn't completely settled yet).

@daemonomania
Copy link

So, this is David, who has permanently locked himself out of his davidagold GH account. Alas.

I think it should be fine to remove those dependencies from AbstractTables. I'd be hesitant, however, to start changing all the method signatures in this PR from DataTable to (Abstract)Table, since most of them do make assumptions about the underlying storage pattern. It will probably be more useful, as we think about abstractions, to have DataTable in these signatures so it's immediately obvious which methods make such assumptions, and we're not trying to recover that information each time we look at the method.

@andreasnoack
Copy link
Member Author

The idea here is to remove DataTables as a dependency which requires that DataTable is not in the signature. If methods in this package require a specific storage then we should in the longer run try to loosen such assumptions. In the short run, we'd have to make these assumptions for the abstract type. If that is too strong for AbstractTable then it could either be a subtype of AbstractTable or a completely new type.

@daemonomania
Copy link

Seems like traits would be useful here.

@quinnj
Copy link
Member

quinnj commented Mar 9, 2017

@andreasnoack, where is the current TableBase code? I'm starting to actively compile the various "abstracttable" attempts into a single package (taking parts of DataStreams, the AbstractDataFrame code, David's AbstractTables as well as his Relations.jl package).

For everyone else, I think the biggest thing that would help as we try to converge on a single "AbstractTable" interface is the strongest set of use-cases for an AbstractTable. I'm going to try and dig more into the code here in StatsModels to figure out what exactly it "needs" from an AbstractTable interface, but it'd be great to have other strong use-cases. For me, I'm coming from the context of DataStreams, which requires functionality like getting the "schema" of a table and being able to get/set individual cells, as well as entire columns at a time. I think the ideal goal is to come up with:

  • The minimum set of required and optional methods a table implementation would need to implement
  • The "interface" functionality that would be fully provided (and cover all necessary use-cases) once a table has implemented the interface

In terms of a starter list of use-cases to consider, I can think of:

  • DataStreams: generalizing tabular data IO across storage/transport formats
  • StatsModels: encoding data + terms + formulas in a representation amenable to various statistics operations
  • Query/StructuredQueries: performing SQL-like "get" operations on tabular data; selecting, filtering, joining, grouping, etc.
  • Other data processing packages? I'm thinking like Clustering.jl, which could potentially work directly on any table-type, though I'm less clear on how generally applicable this might be because different processing libraries might require more specific data manipulation routines.

@nalimilan
Copy link
Member

I think in theory the required interface for StatsModels is quite limited:

  • get the number and types of the variables
  • get the number of observations
  • skip observations with missing values in one of the (selected) variables
  • optionally get the ordering of levels for categorical variables (and possibly the name of the reference level, so that we don't necessarily assume it's the first one)
  • access the values of cells one by one (to fill the model matrix); variable/column-wise would be faster but row-wise would work too

The issue is that the current code makes stronger assumptions about being able to access columns as vectors. I think @kleinschmidt had plans to change this, but it will take some work.

So in the short term I think we should just write an abstraction which specifies that the abstract data table (or whatever we call that particular interface/trait) can be indexed with variable names, as that it returns vectors. We can always remove that requirement later, or provide inefficient fallbacks which would create such vectors for data tables which use a different storage.

See also #14.

@kleinschmidt
Copy link
Member

Superseded by #71

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants