Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial column API functions #71

Merged

Conversation

ezmiller
Copy link
Collaborator

@ezmiller ezmiller commented Jun 21, 2022

Goal

This PR aims just to add the start of a basic API related to enhancing support for the column. So we want to be able have some basic functions for creating columns, for knowing that they are columns, and what's in them.

Solution Overview

The PR adds a new namespace/api: tablecloth.column.api. As @genmeblog said, this separate API can help provide a "separation between dataset operations vs column operations."

Within this new namespace, we start with these functions:

  • column - creates a column from a vector or list
  • column? - determines whether an item is a column or not
  • typeof - returns the data type of the column's elements
  • typeof? - check to see if elements match a datatype
  • zeros - returns a column filled with zeros
  • ones - returns a column filled with ones

The ideas behind this small set of functions are as follows:

  • We are basically creating a new API for columns whereas up until now tablecloth has been only about the dataset.
  • We want to be able to create a column independently of a dataset, hence column. This gives the column some independence.
  • Since we have an independent column, we should be able to identify that an item is a column, hence column?.
  • Columns are uniformly typed entities (unless they are mixed in which case they are uniformly filled with :object). So we should be able to identify the datatype of the column's elements: typeof. This naming mimics R's vectors.
  • From numpy & matlab, we take the function names zeros & ones for functions that can fill a column with some values.

How to test

There's a test suite add with this PR @ test/tablecloth/column/api/column_test.clj.

  1. Take look to see if they tests are making sense
  2. Run tests: lein midje tablecloth.column.api.column-test
    • confirm tests pass

@ezmiller ezmiller self-assigned this Jun 26, 2022
@ezmiller ezmiller marked this pull request as ready for review June 26, 2022 08:49
@ezmiller ezmiller requested a review from genmeblog June 27, 2022 10:26
([]
(col/new-column nil []))
([data]
(column data {:name nil}))
Copy link
Member

@behrica behrica Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tech.ml.dataset makes heavy use of column metadata
for implementing "factors" and other annotations, which is used by scicloj.ml.
So I think we need to think , if a constructor taking additional "meta" is needed.

Otherwise asked: Should tablecloth.column.api become aware/support the tech.ml.dataset ML related functions
in this three namespaces:
tech.v3.dataset.categorical
tech.v3.dataset.modelling
tech.v3.dataset.column-filters

So far, we have decided to keep this out of tablecloth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tabelcloth allows column filter by meta-data:

(tc/column-names DS (fn [meta]
                       (and (= :int64 (:datatype meta))
                            (clojure.string/ends-with? (:name meta) "1"))) :all)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just thinking that it is not very important, that metadata can be given at column construction time.
In practice it will be added later to a column either by:

  • inferring datatype during parsing of values
  • use the 'modelling/factor' related function on the existing column

So the above column function does not need a way to specify metadata, I think now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option to create a column whould be use a dummy dataset. It will pass then all of the inference patterns we need. Something like:

(-> (dataset {:some-name [1 2 3]}) :some-name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding metadata. For me it's completely different angle. Metadata like categorical or metamorph pipeline information is a context based and I don't think it's a scope of this task here. So I still think it's out of scope the TC.
However, we can think about implementing factors. But I don't have an opinion now about this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behrica thank you for the comments here. I think this is something that we can keep in mind going forward. This PR is just an initial step, not at all a final API. It can and will change before release. Having talked a bit more about logistics with @genmeblog, we will actually not put this PR into main. We are going to maintain a dev branch for the column API. So in summary let's keep this in mind. I will find a way of centralizing the conversation about this axis of the API in the next few days...

@ezmiller ezmiller changed the base branch from master to ethan/column-api-dev-branch-1 July 3, 2022 12:33
@ezmiller
Copy link
Collaborator Author

ezmiller commented Jul 3, 2022

@genmeblog per our discussion I changed the target for this PR to be a new dev branch: ethan/column-api-dev-branch-1.

([]
(col/new-column nil []))
([data]
(column data {:name nil}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option to create a column whould be use a dummy dataset. It will pass then all of the inference patterns we need. Something like:

(-> (dataset {:some-name [1 2 3]}) :some-name)

[col]
(dtype/elemwise-datatype col))

(defn typeof?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is more complicated story. I've created already the hierarchy of types (eg. numerical covers or numeric types.

Take a look at: https://github.com/scicloj/tablecloth/blob/master/src/tablecloth/api/utils.clj#L69

(defn zeros
"Create a new column filled wth `n-zeros`."
[n-zeros]
(column (repeatedly n-zeros (constantly 0))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to use dtype-next buffer instead of clojure seq. It will be more efficient.
Generally, let's create colums from as much primitive stuff as possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good call!

([]
(col/new-column nil []))
([data]
(column data {:name nil}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding metadata. For me it's completely different angle. Metadata like categorical or metamorph pipeline information is a context based and I don't think it's a scope of this task here. So I still think it's out of scope the TC.
However, we can think about implementing factors. But I don't have an opinion now about this.

@ezmiller
Copy link
Collaborator Author

ezmiller commented Jul 4, 2022

@genmeblog this is an interesting idea:

The other option to create a column whould be use a dummy dataset. It will pass then all of the inference patterns we need. Something like:

(-> (dataset {:some-name [1 2 3]}) :some-name)

I know we'd discussed this in relation to one way of supporting 2-d columns. Can you explain what you are referring to by "inference patterns"?

(defn zeros
"Create a new column filled wth `n-zeros`."
[n-zeros]
(column (emap (constantly 0) :int64 (range n-zeros))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For zeros and ones I would use a constant reader

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes that makes sense!

@ezmiller
Copy link
Collaborator Author

@genmeblog I'm going to merge this PR into the column API dev branch. I want to address the types in a separate PR. For the moment, I don't see a problem with type inference on the new-column function, so sticking with that for now. See discussion on that latter issue here: #74

@ezmiller ezmiller merged commit 3b4cec4 into ethan/column-api-dev-branch-1 Aug 12, 2022
@ezmiller ezmiller deleted the ethan/add-initial-column-api-fns branch August 12, 2022 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants