Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build API for array processing built on dtype-next #48

Open
ezmiller opened this issue Aug 14, 2021 · 0 comments
Open

Build API for array processing built on dtype-next #48

ezmiller opened this issue Aug 14, 2021 · 0 comments
Labels
enhancement New feature or request

Comments

@ezmiller
Copy link
Collaborator

ezmiller commented Aug 14, 2021

Goal

Currently, tablecloth provides an easy-to-use wrapper over tech.ml.dataset’s high-performance dataset processing constructs. One part of the tech.ml stack that tablecloth has not directly covered is dtype-next, which provides a highly performant basis for array-like numerical processing, similar to Numpy. The project I am proposing aims to wrap dtype-next within tablecloth, providing a new easy-to-use API for numerical structures for the emerging Clojure data processing ecosystem.

Rough Outline of Steps

During this project, I will focus on the following tasks:

  • Add a new function to tablecloth (perhaps named column or array) that will return a typed, countable, random-access data structure backed by dtype-next’s abstractions;
  • Design two API pathways to interact with this structure: one that realizes the data fully at each step, providing more straightforward but less efficient interaction; and another, more performant but slightly harder to use, that allows users to wrap processing steps in a "transaction";
  • Mimic the Numpy (and possibly R vector) APIs ensuring an equivalently complete functional interface for numerical processing;
  • Ensure support reading-friendly format for printing columns in the Clojure REPL (see reading-friendly format for printing columns techascent/tech.ml.dataset#203);
  • Validate the usefulness of the API by implementing real-world examples with various characteristics (missing values, various data types, challenging sizes, etc.) and comparing the ergonomics with other platforms such as Python and R.

Open Questions

  • What will the name of this entity be? Some options could be: array, column, buffer, column-vector.
  • Does it make sense for this API to live within tablecloth or might we want to break it out into its own library?
  • Are there ways that this work needs to align with the work that @ribelo and @genmeblog are doing to define a syntax for operations on dataset columns (e.g. Expose dtype-next column functions in tablecloth.api ns #47 )?
@ezmiller ezmiller added the enhancement New feature or request label Aug 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant