Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update type interface to use type hierarchy in tablecloth.api.util #76

Merged
merged 21 commits into from
Nov 14, 2022

Conversation

ezmiller
Copy link
Collaborator

@ezmiller ezmiller commented Aug 19, 2022

Goal

In #71 , @genmeblog pointed out that there's a type hierarchy in the tablecloth.api.utils namespace. Following some discussion in #73 , this PR aims to set up a type interface where users mainly interact the types of elements in a column in terms of the "general" (rather than concrete) types. In other words, a user will want to know if an array has :integer rather than :int64 values.

Solution

I added some new functions in tablecloth.api.utils to help work with the type hierarchy tablecloth.api.utils/type-sets:

  • ->general-types - Given a concrete type this function returns the set of general types, e.g.: (->general-types :int64) ;; => #{:integer :numerical}
  • types - Returns the set of general types, i.e. the keys of tablecloth.api.utils/type-sets
  • concrete-types - Returns the set of concrete types, i.e. the types that TMD uses.
  • concrete-type? - Returns true if the given type is a concrete type.

Then I adjusted typeof and typeof? to use this hierarchy, so now:

  • typeof? - Will return true if the elements of the given column match the given general type or concrete type
  • typeof - Returns the set of general types that describe the concrete type of the elements of the column

Please see the tests in this PR for examples.

Open Questions

  • Is there any reason to consider using Clojure's native hierarchies instead of the map for tablecloth.api.utils/type-sets?
  • Is there any reason not to allow typeof? also to support concrete types?

@ezmiller ezmiller force-pushed the ethan/column-type-api branch from 484673d to ab54530 Compare August 19, 2022 20:29
@ezmiller ezmiller changed the title Ethan/column type api Update type interface to use type hierarchy in tablecloth.api.util Aug 19, 2022
@ezmiller ezmiller marked this pull request as ready for review August 19, 2022 21:08
@ezmiller ezmiller self-assigned this Aug 19, 2022
@ezmiller ezmiller requested a review from genmeblog August 19, 2022 21:10
@ezmiller
Copy link
Collaborator Author

@genmeblog just a reminder to take a look at this when you get a chance. :)

src/tablecloth/api/utils.clj Outdated Show resolved Hide resolved
src/tablecloth/api/utils.clj Show resolved Hide resolved
src/tablecloth/column/api/column.clj Outdated Show resolved Hide resolved
@genmeblog
Copy link
Member

I did a review. I see mainly one issue with changing typeof contract. The other two are just optimization proposals.

@ezmiller
Copy link
Collaborator Author

ezmiller commented Oct 3, 2022

@genmeblog: I changed things here so that:

  • types - returns the set concrete types;
  • general-types returns the set of general types;
  • typeof returns the concrete type of the column;
  • typeof? can validate the concrete or general type of a column; and,
  • there is a general-types-lookup table.

@ezmiller ezmiller requested a review from genmeblog October 3, 2022 15:31
@ezmiller
Copy link
Collaborator Author

ezmiller commented Nov 8, 2022

@genmeblog if you have a moment, would you like to take another look at this?

Copy link
Member

@genmeblog genmeblog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok

@ezmiller ezmiller merged commit 6e7413b into ethan/column-api-dev-branch-1 Nov 14, 2022
@ezmiller ezmiller deleted the ethan/column-type-api branch November 14, 2022 05:36
ezmiller added a commit that referenced this pull request Apr 13, 2024
* Add namespace stub

* Add super naive colunn fn

* Add some simple column fns

* Add typeof function for column

* Save work on column exploration doc

* Upgrade to latest clay version

* Save scratch work in column.clj

* Polishing up existing column fns

* added some docstrings
* re-organized a little

* Move column ns into own domain tablecloth.column.api

* Add tests for `tablecloth.column.api/column`

* Add tests for `zeros` and `ones`

* Use api template to write public api

* Write tests against `tablecloth.column.api.column` ns

* Add column exploration html

* Add `typeof?` function to check datatype of column els

* Use buffer when creating zeros & ones columns

* Use `dtype` alias in ns

* Add comment to code snippet generating column api

* Fix comment syntax

* Use `tech.v3.datatype/const-reader` for `zeros` and `ones` function

* Update type interface to use type hierarchy in tablecloth.api.util (#76)

* Add ->general-types function

* Add a general type :logical

* Use type hierarchy in tablecloth.api.utils for `typeof` functions

* Add column dev branch to pr workflow

* Add tests for typeof

* Fix tests for typeof

* Return the concrete type from `typeof`

* Simplify `concrete-types` fn

* Optimize ->general-types by using static lookup

* Adjust fns listing types

* We decided that the default meaning of type points to the "concrete"
type, and not the general type.
* So `types` now returns the set of concrete types and `general-types`
returns the general types.

* Revert "Adjust fns listing types"

This reverts commit d93e34f.

* Fix `typeof` test to test for concerete types

* Reorganize `typeof?` tests

* Reword docstring for `typeof?` slightly

* Update column api template and add missing `typeof?`

* Add commment to `general-types-lookup`

* Improve `->general-types` docstring

* Add `general-types` fn that returns sets of general types

* Adjust util `types` fn to return concrete types

* Lift `tech.v3.datatype.functional` operations (#90)

* Add ->general-types function

* Add a general type :logical

* Use type hierarchy in tablecloth.api.utils for `typeof` functions

* Add column dev branch to pr workflow

* Add tests for typeof

* Fix tests for typeof

* Return the concrete type from `typeof`

* Simplify `concrete-types` fn

* Optimize ->general-types by using static lookup

* Adjust fns listing types

* We decided that the default meaning of type points to the "concrete"
type, and not the general type.
* So `types` now returns the set of concrete types and `general-types`
returns the general types.

* Revert "Adjust fns listing types"

This reverts commit d93e34f.

* Fix `typeof` test to test for concerete types

* Reorganize `typeof?` tests

* Reword docstring for `typeof?` slightly

* Update column api template and add missing `typeof?`

* Add commment to `general-types-lookup`

* Improve `->general-types` docstring

* Add `general-types` fn that returns sets of general types

* Adjust util `types` fn to return concrete types

* Save changes to column api.clj

* Save ongoing experiments with lifting

* Save ongoing work on lifting

* Adjust lift-ops-1 to handle any number of args with rest arg

* Working `rearrange-args` fn

* Save work actually writing lifted fns

* Saving first attempt to writer operators

* Add `percentiiles test

* Adjust `rearrange-args to take new-args in option map

* Unify two lift functions

* Add in docstrings when present

* Move lift utils into utils ns

* Rename lifting namespaces

* Lift some more fns

* Make exclusions for ns header helper an arg

* Add new operators and tests

* Add ops with lhs rhs arg pattern

* Lift '*

* Add require to operators ns for utils

* Update test to make it more complete

* Lift `equals

* Make test more accurate

* Reorganize tests

* Fix grammar

* Lift 'shift

* Uncomment 'or test

* Lift 'normalize op

* Life 'magnitude

* Lifting bit manipulation ops

* lift ieee-remainder

* Lifting more functions

* Add excludes

* Lift a bunch of new functions

* Alphebetize some lists

* More alphebitization

* Clean up

* Instead of using `col` as arg conform to using `x & and `y

* Temporarily disable failing test fix in 7.000-beta23

* Disable the correct test

* Just some minor cleanup in op tests

* Some more cleanup/reorg in op tests

* Update generated operators namespace with switch from col -> x etc

* Lift 'descriptive-statistics

* Fix messed up test layout

* Lift 'quartiles

* Lift 'fill-range and a bunch of reduce operations

* Lift 'mean-fast 'sum-fast 'magnitude-squared

* Lift correlation fns

kendalls, pearsons, and spearmans

* Lift cumulative ops

* cleanup

* Bring column exploration doc up-to-date (#95)

* Upgrade to latest clay version

* Show using tablecloth.column.api.operators ns

* Cleanup whitespace

* Add method for subsetting (#96)

* Export tech.ml.dataset `select` fn for column api

* Update docstring exported to api

* Update column-exploration with basic illustration of select

* Add `slice`

* clean up tests a bit

* Improve `slice` docstring slightly

* Export `slice` to column api

* Add stuff about `slice` to column exploration doc

* Move accesssing & subsetting seciton above basic ops

* Update column_expolration.html

* Update comment block

* Add iteration support by wrapping tech.v3.dataset.column/column-map (#97)

* Export tech.ml.dataset `select` fn for column api

* Update docstring exported to api

* Update column-exploration with basic illustration of select

* Add `slice`

* clean up tests a bit

* Improve `slice` docstring slightly

* Export `slice` to column api

* Add stuff about `slice` to column exploration doc

* Move accesssing & subsetting seciton above basic ops

* Update column_expolration.html

* Update comment block

* Add column-map wrapper over tech.v3.dataset.column/column-mapping

* Accepts columns in the first position to support use with pipes
* If `col` is a vector of columns, then map-fn is run on all

* Fix arg name

* Clean up

* Add iteration to column exploration and reorganize

* Add column-map to column api_template

* Add example of using column-map with multiple columns

* Update column_exploration html doc

* Update column_exploration html doc

* Add sorting support for column (#99)

* Add rough version of `sort-column` with some tests

* Add basic docstring

* Add support for `:asc` and `:desc` to sort-column

* Add note to handle missing values

* Make slight improvement to sort-column docstringa

* Improve support for missing values for column api (#101)

* Export tech.ml.dataset `select` fn for column api

* Update docstring exported to api

* Update column-exploration with basic illustration of select

* Add `slice`

* clean up tests a bit

* Improve `slice` docstring slightly

* Export `slice` to column api

* Add stuff about `slice` to column exploration doc

* Move accesssing & subsetting seciton above basic ops

* Update column_expolration.html

* Update comment block

* Add column-map wrapper over tech.v3.dataset.column/column-mapping

* Accepts columns in the first position to support use with pipes
* If `col` is a vector of columns, then map-fn is run on all

* Fix arg name

* Clean up

* Add iteration to column exploration and reorganize

* Add column-map to column api_template

* Add example of using column-map with multiple columns

* Update column_exploration html doc

* Update column_exploration html doc

* Export tech.v3.dataset.column's missing fns

* Remove `set-missing`

I think this may be more of an internal fn

* Add `count-missing` function

* Add test for `sort-column` for missing values

* Activate test that wil now pass due to tmd upgrade

* Add sort-column to api-template

* Add sort-column section to column_exploration doc

* Add more missing apidoc

* move fns to their own namespace to mirror main tc api
* add `drop-missing` and `replace-missing`

* Add details about missing api to column exploration

* Add a exmaple of using count to column exploration

* Add a few simple tests for missing ns

* Fix docstrings

* Add proof of concept

* Consolidate tablecloth.column.api/operators args (#106)

* Conslidate ops args to x y z

* Fix lift op for comparison ops

* Update lift-op fn to handle multiple ar lookups

Case that required this was the comparison ops. We
want (> x y z) from (> lhs rhs) (> lhs mid rhs). We
can't universally map y to rhs because it would be
wront for the 3-arity option.

* Lift column ops to the dataset level (#107)

* Readme: Replace `lein test` with `lein midje`

* Add proof of concept for lifting

* Clean up

* Fix magnitude arguments

* Fix typo breaking lift operation for `magnitude

* Save prototype working example that handles optional arguments

* Clean up

* Reorganize codegen utilities

* moved hopefully common utilities up  into 'tablecloth.utils.codegen
* retooled those helpers in that ns to be a bit more accessible (WIP)

* Clean up

* Clean up

* Rejigger codegen for column ops to take just fn-sym arglists

* Try lifting all column ops to ds (no tests yet)

* Exclude ops that do not potentially return column

* Do not lift options that do not return columns

* Add docstrings for some codegen

Also regenerated operators to make sure tests pass.

* Add docstring to ds col ops

* version bump and small fix

* Modify ds-level lift op to also return fn that returns column

This is a breaking change for the column api lifting until I adapt
the lift-op to the changes made in the codegen where the argument
is supplied in data rather than within a fn.

* example added for replace-missing

* Add tests for ops that take inf number of cols

* Add tests for ops returning ds taking max of three cols

* Add tests for ops returning ds and taking two columns max

* Test for ops returning ds and max of one column

* Add more functions to test for ops taking one col

* Clean up

* Lifted ops taking one column and returning a scalar

* Lift functions taking two columns and returning a scalar

* Clean up

* Clean up

* bump to 7.000-beta-50

* fixes #108

* hashing in joins enabled for every case

* 7.000-beta-51

* Clean up

* Lift functions taking 1 col and returning scalar

* Adjust column api lift ops to new declarative syntax

* Adjust lift plan for tablecloth.column.api for tmd v7

* Remove mention of tech.ml.datatype

* Add missing word

* Bump tmd version to 7.006 for fix to fns that were erroring

fns are: quartiles-1, quartiles-3 and median

* Fixing more tests

* Comment some code to keep around for a spell

* Remove special lift op for 'round

It's arugments were fixed.

* Cleanup

* 7.007

---------

Co-authored-by: Teodor Heggelund <git@teod.eu>
Co-authored-by: genmeblog <38646601+genmeblog@users.noreply.github.com>
Co-authored-by: GenerateMe <generateme.blog@gmail.com>
Co-authored-by: adham-omran <git@adham-omran.com>

* Ethan/lift scalar ops to ds as aggregators (#118)

* Fix indentation

* Save rough working example

Not fully tested

* Fix tests for new aggregator form of ops that return scalar

* Add `column` API documentation (#120)

* Add a sample notebook file

* Save draft work on column api doc

* Add doc entry for tcc/select boolean select

This appears to be broken now, but ti shouldn't be.

* Export column api operators in column api ns

* Add in some documentation of operations

* Hide namespace expression from generated doc

* Fix circular dependency

* Update generated docs

* Update text in colum operations section

* More updates to the docs

* Remove "Functionality" header in TOC

This way Dataset is an entry, and I can add Column after that.

* Add Column API documentation

* Add an indication of column op signature to docs

* Export lifted column operators in dataset api template

* Add documentation for column operations on datasets

* Some minor changes

* Rename the two headers for Dataset and Column, adding API onto the
end.
* A few small fixes.

* Remove the `Functions` section

This is essentially replaced by the Column API that lifts these
functions into Tablecloth

* Try to remove cyclical dependency

* Revert "Try to remove cyclical dependency"

This reverts commit fcb16c4.

* Fix circular dependency

* Actually fix cyclical dependency

* Undo added line

* Try deploying a documentation preview

* Add preview-branch to docs preview action

Default was gh-pages, we use master.

* Try adding umbrella-dir setting

* Try removing docs folder in umbrella-dir

* Remove old pr docs preview workflow

* Regenerated docs after merge from master

* Add section about column missing values to docs

* Regenerated docs after merge from master

* Remove draft notebook

* Remove temporary trigger for dev branch since it was target of prs

---------

Co-authored-by: Teodor Heggelund <git@teod.eu>
Co-authored-by: genmeblog <38646601+genmeblog@users.noreply.github.com>
Co-authored-by: GenerateMe <generateme.blog@gmail.com>
Co-authored-by: adham-omran <git@adham-omran.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants