-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial column API functions #71
Add initial column API functions #71
Conversation
* added some docstrings * re-organized a little
([] | ||
(col/new-column nil [])) | ||
([data] | ||
(column data {:name nil})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tech.ml.dataset
makes heavy use of column metadata
for implementing "factors" and other annotations, which is used by scicloj.ml
.
So I think we need to think , if a constructor taking additional "meta" is needed.
Otherwise asked: Should tablecloth.column.api
become aware/support the tech.ml.dataset
ML related functions
in this three namespaces:
tech.v3.dataset.categorical
tech.v3.dataset.modelling
tech.v3.dataset.column-filters
So far, we have decided to keep this out of tablecloth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tabelcloth allows column filter by meta-data:
(tc/column-names DS (fn [meta]
(and (= :int64 (:datatype meta))
(clojure.string/ends-with? (:name meta) "1"))) :all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just thinking that it is not very important, that metadata can be given at column construction time.
In practice it will be added later to a column either by:
- inferring datatype during parsing of values
- use the 'modelling/factor' related function on the existing column
So the above column
function does not need a way to specify metadata, I think now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option to create a column whould be use a dummy dataset. It will pass then all of the inference patterns we need. Something like:
(-> (dataset {:some-name [1 2 3]}) :some-name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding metadata. For me it's completely different angle. Metadata like categorical or metamorph pipeline information is a context based and I don't think it's a scope of this task here. So I still think it's out of scope the TC.
However, we can think about implementing factors. But I don't have an opinion now about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behrica thank you for the comments here. I think this is something that we can keep in mind going forward. This PR is just an initial step, not at all a final API. It can and will change before release. Having talked a bit more about logistics with @genmeblog, we will actually not put this PR into main. We are going to maintain a dev branch for the column API. So in summary let's keep this in mind. I will find a way of centralizing the conversation about this axis of the API in the next few days...
@genmeblog per our discussion I changed the target for this PR to be a new dev branch: |
([] | ||
(col/new-column nil [])) | ||
([data] | ||
(column data {:name nil})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option to create a column whould be use a dummy dataset. It will pass then all of the inference patterns we need. Something like:
(-> (dataset {:some-name [1 2 3]}) :some-name)
[col] | ||
(dtype/elemwise-datatype col)) | ||
|
||
(defn typeof? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is more complicated story. I've created already the hierarchy of types (eg. numerical
covers or numeric types.
Take a look at: https://github.com/scicloj/tablecloth/blob/master/src/tablecloth/api/utils.clj#L69
src/tablecloth/column/api/column.clj
Outdated
(defn zeros | ||
"Create a new column filled wth `n-zeros`." | ||
[n-zeros] | ||
(column (repeatedly n-zeros (constantly 0)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to use dtype-next buffer instead of clojure seq. It will be more efficient.
Generally, let's create colums from as much primitive stuff as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good call!
([] | ||
(col/new-column nil [])) | ||
([data] | ||
(column data {:name nil})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding metadata. For me it's completely different angle. Metadata like categorical or metamorph pipeline information is a context based and I don't think it's a scope of this task here. So I still think it's out of scope the TC.
However, we can think about implementing factors. But I don't have an opinion now about this.
@genmeblog this is an interesting idea:
I know we'd discussed this in relation to one way of supporting 2-d columns. Can you explain what you are referring to by "inference patterns"? |
src/tablecloth/column/api/column.clj
Outdated
(defn zeros | ||
"Create a new column filled wth `n-zeros`." | ||
[n-zeros] | ||
(column (emap (constantly 0) :int64 (range n-zeros)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For zeros
and ones
I would use a constant reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes that makes sense!
@genmeblog I'm going to merge this PR into the column API dev branch. I want to address the types in a separate PR. For the moment, I don't see a problem with type inference on the |
Goal
This PR aims just to add the start of a basic API related to enhancing support for the
column
. So we want to be able have some basic functions for creating columns, for knowing that they are columns, and what's in them.Solution Overview
The PR adds a new namespace/api:
tablecloth.column.api
. As @genmeblog said, this separate API can help provide a "separation between dataset operations vs column operations."Within this new namespace, we start with these functions:
column
- creates a column from a vector or listcolumn?
- determines whether an item is a column or nottypeof
- returns the data type of the column's elementstypeof?
- check to see if elements match a datatypezeros
- returns a column filled with zerosones
- returns a column filled with onesThe ideas behind this small set of functions are as follows:
columns
whereas up until nowtablecloth
has been only about the dataset.dataset
, hencecolumn
. This gives thecolumn
some independence.column
, we should be able to identify that an item is a column, hencecolumn?
.:object
). So we should be able to identify the datatype of the column's elements:typeof
. This naming mimics R's vectors.zeros
&ones
for functions that can fill a column with some values.How to test
There's a test suite add with this PR @
test/tablecloth/column/api/column_test.clj
.lein midje tablecloth.column.api.column-test