Add initial column API functions #71

ezmiller · 2022-06-21T13:24:34Z

Goal

This PR aims just to add the start of a basic API related to enhancing support for the column. So we want to be able have some basic functions for creating columns, for knowing that they are columns, and what's in them.

Solution Overview

The PR adds a new namespace/api: tablecloth.column.api. As @genmeblog said, this separate API can help provide a "separation between dataset operations vs column operations."

Within this new namespace, we start with these functions:

column - creates a column from a vector or list
column? - determines whether an item is a column or not
typeof - returns the data type of the column's elements
typeof? - check to see if elements match a datatype
zeros - returns a column filled with zeros
ones - returns a column filled with ones

The ideas behind this small set of functions are as follows:

We are basically creating a new API for columns whereas up until now tablecloth has been only about the dataset.
We want to be able to create a column independently of a dataset, hence column. This gives the column some independence.
Since we have an independent column, we should be able to identify that an item is a column, hence column?.
Columns are uniformly typed entities (unless they are mixed in which case they are uniformly filled with :object). So we should be able to identify the datatype of the column's elements: typeof. This naming mimics R's vectors.
From numpy & matlab, we take the function names zeros & ones for functions that can fill a column with some values.

How to test

There's a test suite add with this PR @ test/tablecloth/column/api/column_test.clj.

Take look to see if they tests are making sense
Run tests: lein midje tablecloth.column.api.column-test
- confirm tests pass

…ct-playground

* added some docstrings * re-organized a little

behrica · 2022-06-30T12:36:26Z

src/tablecloth/column/api/column.clj

+  ([]
+   (col/new-column nil []))
+  ([data]
+   (column data {:name nil}))


tech.ml.dataset makes heavy use of column metadata
for implementing "factors" and other annotations, which is used by scicloj.ml.
So I think we need to think , if a constructor taking additional "meta" is needed.

Otherwise asked: Should tablecloth.column.api become aware/support the tech.ml.dataset ML related functions
in this three namespaces:
tech.v3.dataset.categorical
tech.v3.dataset.modelling
tech.v3.dataset.column-filters

So far, we have decided to keep this out of tablecloth.

Tabelcloth allows column filter by meta-data:

(tc/column-names DS (fn [meta] (and (= :int64 (:datatype meta)) (clojure.string/ends-with? (:name meta) "1"))) :all)

I am just thinking that it is not very important, that metadata can be given at column construction time.
In practice it will be added later to a column either by:

inferring datatype during parsing of values

use the 'modelling/factor' related function on the existing column

So the above column function does not need a way to specify metadata, I think now.

The other option to create a column whould be use a dummy dataset. It will pass then all of the inference patterns we need. Something like:

(-> (dataset {:some-name [1 2 3]}) :some-name)

Regarding metadata. For me it's completely different angle. Metadata like categorical or metamorph pipeline information is a context based and I don't think it's a scope of this task here. So I still think it's out of scope the TC.
However, we can think about implementing factors. But I don't have an opinion now about this.

@behrica thank you for the comments here. I think this is something that we can keep in mind going forward. This PR is just an initial step, not at all a final API. It can and will change before release. Having talked a bit more about logistics with @genmeblog, we will actually not put this PR into main. We are going to maintain a dev branch for the column API. So in summary let's keep this in mind. I will find a way of centralizing the conversation about this axis of the API in the next few days...

ezmiller · 2022-07-03T12:34:26Z

@genmeblog per our discussion I changed the target for this PR to be a new dev branch: ethan/column-api-dev-branch-1.

genmeblog · 2022-07-03T10:19:01Z

src/tablecloth/column/api/column.clj

+  ([]
+   (col/new-column nil []))
+  ([data]
+   (column data {:name nil}))


The other option to create a column whould be use a dummy dataset. It will pass then all of the inference patterns we need. Something like:

(-> (dataset {:some-name [1 2 3]}) :some-name)

genmeblog · 2022-07-03T10:24:59Z

src/tablecloth/column/api/column.clj

+  [col]
+  (dtype/elemwise-datatype col))
+
+(defn typeof?


Ok, this is more complicated story. I've created already the hierarchy of types (eg. numerical covers or numeric types.

Take a look at: https://github.com/scicloj/tablecloth/blob/master/src/tablecloth/api/utils.clj#L69

genmeblog · 2022-07-03T10:27:35Z

src/tablecloth/column/api/column.clj

+(defn zeros
+  "Create a new column filled wth `n-zeros`."
+  [n-zeros]
+  (column (repeatedly n-zeros (constantly 0))))


I propose to use dtype-next buffer instead of clojure seq. It will be more efficient.
Generally, let's create colums from as much primitive stuff as possible.

Yes good call!

genmeblog · 2022-07-03T10:34:12Z

src/tablecloth/column/api/column.clj

+  ([]
+   (col/new-column nil []))
+  ([data]
+   (column data {:name nil}))


Regarding metadata. For me it's completely different angle. Metadata like categorical or metamorph pipeline information is a context based and I don't think it's a scope of this task here. So I still think it's out of scope the TC.
However, we can think about implementing factors. But I don't have an opinion now about this.

ezmiller · 2022-07-04T11:53:34Z

@genmeblog this is an interesting idea:

The other option to create a column whould be use a dummy dataset. It will pass then all of the inference patterns we need. Something like:

(-> (dataset {:some-name [1 2 3]}) :some-name)

I know we'd discussed this in relation to one way of supporting 2-d columns. Can you explain what you are referring to by "inference patterns"?

genmeblog · 2022-07-05T08:53:59Z

src/tablecloth/column/api/column.clj

+(defn zeros
+  "Create a new column filled wth `n-zeros`."
+  [n-zeros]
+  (column (emap (constantly 0) :int64 (range n-zeros))))


For zeros and ones I would use a constant reader

Oh yes that makes sense!

ezmiller · 2022-08-12T17:18:23Z

@genmeblog I'm going to merge this PR into the column API dev branch. I want to address the types in a separate PR. For the moment, I don't see a problem with type inference on the new-column function, so sticking with that for now. See discussion on that latter issue here: #74

ezmiller added 18 commits April 22, 2022 14:22

Add namespace stub

da07f71

Add super naive colunn fn

9eae7fc

Merge remote-tracking branch 'origin/master' into ethan/columns-proje…

2c93f87

…ct-playground

Add some simple column fns

cfaffc6

Add typeof function for column

04c9b41

Save work on column exploration doc

2eeeee8

Upgrade to latest clay version

e5ef843

Save scratch work in column.clj

bcc3582

Merge remote-tracking branch 'origin/master' into ethan/columns-proje…

c0a7cb9

…ct-playground

Polishing up existing column fns

fb07581

* added some docstrings * re-organized a little

Move column ns into own domain tablecloth.column.api

d433c63

Add tests for tablecloth.column.api/column

6a08d61

Merge branch 'master' into ethan/columns-project-playground

f86911f

Add tests for zeros and ones

186e764

Use api template to write public api

851819f

Write tests against tablecloth.column.api.column ns

1d2cef3

Add column exploration html

c81d13c

Add typeof? function to check datatype of column els

2788e3e

ezmiller self-assigned this Jun 26, 2022

ezmiller marked this pull request as ready for review June 26, 2022 08:49

ezmiller requested a review from genmeblog June 27, 2022 10:26

ezmiller mentioned this pull request Jun 29, 2022

added printing of meta to Column techascent/tech.ml.dataset#308

Closed

behrica reviewed Jun 30, 2022

View reviewed changes

ezmiller changed the base branch from master to ethan/column-api-dev-branch-1 July 3, 2022 12:33

genmeblog reviewed Jul 4, 2022

View reviewed changes

Use buffer when creating zeros & ones columns

14ca935

genmeblog reviewed Jul 5, 2022

View reviewed changes

Use dtype alias in ns

2d1d07e

ezmiller added 3 commits July 5, 2022 12:05

Add comment to code snippet generating column api

e5c8322

Fix comment syntax

e42a3d9

Use tech.v3.datatype/const-reader for zeros and ones function

35ec106

ezmiller merged commit 3b4cec4 into ethan/column-api-dev-branch-1 Aug 12, 2022

ezmiller deleted the ethan/add-initial-column-api-fns branch August 12, 2022 17:18

ezmiller mentioned this pull request Aug 19, 2022

Update type interface to use type hierarchy in tablecloth.api.util #76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial column API functions #71

Add initial column API functions #71

ezmiller commented Jun 21, 2022 •

edited

Loading

behrica Jun 30, 2022 •

edited

Loading

behrica Jun 30, 2022

behrica Jul 2, 2022

genmeblog Jul 3, 2022

genmeblog Jul 3, 2022

ezmiller Jul 3, 2022

ezmiller commented Jul 3, 2022

genmeblog Jul 3, 2022

genmeblog Jul 3, 2022

genmeblog Jul 3, 2022

ezmiller Jul 4, 2022

genmeblog Jul 3, 2022

ezmiller commented Jul 4, 2022

genmeblog Jul 5, 2022

ezmiller Jul 17, 2022

ezmiller commented Aug 12, 2022

Add initial column API functions #71

Add initial column API functions #71

Conversation

ezmiller commented Jun 21, 2022 • edited Loading

Goal

Solution Overview

How to test

behrica Jun 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezmiller commented Jul 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezmiller commented Jul 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezmiller commented Aug 12, 2022

ezmiller commented Jun 21, 2022 •

edited

Loading

behrica Jun 30, 2022 •

edited

Loading