Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Expr architecture #2562

Open
st-pasha opened this issue Aug 6, 2020 · 5 comments
Open

Refactor Expr architecture #2562

st-pasha opened this issue Aug 6, 2020 · 5 comments
Labels
EPIC ⭐ Big task that may encompass many smaller ones refactor Internal code changes, clean-ups or reorganizations that are not externally visible

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Aug 6, 2020

Our current approach for creating new Exprs is overly complicated, as outlined here: https://github.com/h2oai/datatable/blob/master/src/core/expr/!readme.md. This complexity stems mostly from the fact that the Expr class which performs arithmetic on f-expressions is defined in pure python, and then needs to be bridged into the C++ core.

A more sane approach would be to define everything in C++, eliminating most of the "middle-man" code. In particular, the following architecture is proposed:

  • C++ class py::FExpr to replace current python Expr class;
  • C++ class py::ColumnNamespace to replace current python FrameProxy class;
  • C++ class dt::expr::FExpr is a merged version of current dt::expr::Expr and dt::expr::Head. The class is virtual, with the hierarchy following that of the Head class;
  • Each py::FExpr contains a shared_ptr<dt::expr::FExpr>;
  • The dt::expr::FExpr class defines virtual methods for evaluation and reproing;
  • The Op enum is removed.

subtasks

  • Add support for numeric and comparison methods in py::XObject<C>;
  • Create class py::FExpr (which will eventually replace the pure-python datatable.expr.Expr);
  • Create class dt::expr::FExpr which is a backend for py::FExpr;
  • Arrange so that new FExprs can be used alongside old pure-python Exprs;
  • Create class py::Namespace to replace pure-python datatable.expr.FrameProxy;
  • Convert existing OldExpr-based functionality into FExprs:
    • Frame-expr;
    • List-expr;
    • Dict-expr;
    • Literal exprs:
      • None;
      • bool;
      • int;
      • float;
      • str;
      • type;
      • range;
      • slice (all);
      • slice (numeric);
      • slice (string);
    • Column selectors f.A / f[0];
    • f.extend();
    • f.remove();
    • Cast functions;
    • shift();
    • ifelse();
    • cut();
    • qcut();
    • Arithmetic binary operators
      • +;
      • -;
      • *;
      • /;
      • //;
      • %;
      • **;
    • Bitwise binary operators
      • &;
      • |;
      • ^;
      • <<;
      • >>;
    • Unary operations
      • +;
      • -;
      • ~;
    • Comparison operators
      • <;
      • >;
      • <=;
      • >=;
      • ==;
      • !=;
    • String methods
      • len()
      • re_match();
    • Reducers
      • mean,
      • min,
      • max,
      • stdev,
      • first,
      • last,
      • sum,
      • count,
      • count0,
      • median,
      • cov,
      • corr;
    • Math functions:
      • Trigonometric
        • sin,
        • cos,
        • tan,
        • arcsin,
        • arccos,
        • arctan,
        • arctan2,
        • hypot,
        • deg2rad,
        • rad2deg;
      • Hyperbolic
        • sinh,
        • cosh,
        • tanh,
        • arsinh,
        • arcosh,
        • arcosh;
      • Exponential
        • cbrt,
        • exp,
        • exp2,
        • expm1,
        • log,
        • log10,
        • log1p,
        • log2,
        • logaddexp,
        • logaddexp2,
        • pow,
        • sqrt,
        • square;
      • Special
        • erf,
        • erfc,
        • gamma,
        • lgamma;
      • Floating
        • abs,
        • ceil,
        • copysign,
        • fabs,
        • floor,
        • frexp,
        • isclose,
        • isfinite,
        • isinf,
        • isna,
        • ldexp,
        • modf,
        • rint,
        • sign,
        • signbit,
        • trunc;
      • Miscellaneous
        • clip,
        • divmod,
        • fmod,
        • maximum,
        • minimum;
    • Row-functions:
      • rowall,
      • rowany,
      • rowcount,
      • rowfirst,
      • rowlast,
      • rowmin,
      • rowmax,
      • rowmean,
      • rowsum,
      • rowsd;
  • Documentation:
    • Update documentation on how to work with new FExpr infrastructure ("expr/!readme.md");
    • Add API documentation for the py::Namespace class;
    • Add API documentation for the py::FExpr class;
  • Final cleanup:
    • Remove python class datatable.expr.FrameProxy;
    • Remove python class datatable.expr.Expr;
    • Remove python enum datatable.expr.OpCodes;
    • Remove the dt::expr::Op enum;
    • Remove the dt::expr::OldExpr class;
    • Remove args_registry.
@st-pasha st-pasha added refactor Internal code changes, clean-ups or reorganizations that are not externally visible EPIC ⭐ Big task that may encompass many smaller ones labels Aug 6, 2020
@st-pasha st-pasha added this to the Release 0.11.0 milestone Aug 6, 2020
@st-pasha st-pasha self-assigned this Aug 6, 2020
@samukweku
Copy link
Contributor

samukweku commented Aug 8, 2020

Hi @st-pasha, I think it would be helpful as well if a documentation could be created for code contributors , to know what the building blocks are, so to speak. That way when dabbling in and writing the code, I can sort of know which parts to use and what. Admittedly, my knowledge of C++ is limited, so maybe the documentation is clear and the challenge is with me.

Also, with this new refactoring, is it advisable for code contributors to wait, till the refactoring is complete?

@st-pasha
Copy link
Contributor Author

st-pasha commented Aug 8, 2020

Hi @samukweku, this new refactoring effort aims at simplifying code inside the core/expr folder. It is quite a significant part of the codebase, but it's not the only part. So if you want to try and contribute code, feel free to look around. For example, #2166 is both relatively simple and completely independent of the Expr architecture. On the other hand if you wanted to do any PR that involves creating new functions that act within DT[...], then it's best to wait a week or so until this upgrade is complete.

st-pasha added a commit that referenced this issue Aug 12, 2020
- Created C++ class `py::FExpr`, which is equivalent to current pure-python class `datatable.expr.Expr`;
- Created C++ class `py::Namespace`, replacing the current pure-python class `datatable.expr.FrameProxy`;
- Class `dt::expr::Expr` renamed into `OldExpr`, in anticipation that it will be removed in the near future;
- Added C++ class `dt::expr::FExpr`, which will eventually replace `dt::expr::Expr` (and `dt::expr::Head`), but currently serves as its base class;
- Extended class `py::XObject` so that it now supports adding numeric methods (such as `__add__()`, `__mul__()`, etc) and comparison methods (`__lt__()`, `__eq__()`, etc);
- The evaluation methods were altered so that the boolean flag `allow_new` is no longer necessary.

The new `py::FExpr` classes is still "dormant" in the sense that it is not used by the library anywhere. Thus, the current functionality was not changed (mostly) just yet. Instead, I'm laying down a foundation where the new classes can gradually supplant the current python class `Expr` without causing any large-scale disturbances.

WIP for #2562
@st-pasha st-pasha mentioned this issue Aug 13, 2020
st-pasha added a commit that referenced this issue Aug 13, 2020
New classes `FExpr_ColumnAsAttr` and `FExpr_ColumnAsArg` (derived directly from `FExpr`) replace the previous `Head_Func_Column`.
With these changes the new FExpr-based internal API is now actually usable and can be used to implement any new functionality.

WIP for #2562
This was referenced Aug 13, 2020
st-pasha added a commit that referenced this issue Aug 14, 2020
Old `Expr`s corresponding to various python literals are hereby converted into the new `FExpr` format.

WIP for #2562
st-pasha added a commit that referenced this issue Aug 14, 2020
Converted `ifelse()` function into the new `FExpr` format.

WIP for #2562
st-pasha added a commit that referenced this issue Aug 16, 2020
- Implemented standard arithmetic operators such as `+`, `-`, `*`, `/`, `//`, `%` and `**` in the new FExpr API;
- Added tentative documentation for `__add__` into the docs build;
- Class py::FExpr renamed into `PyFExpr` and moved into the `dt::expr` namespace;
- The old `datatable.expr.Expr` now uses the new `FExpr` approach;
- `FExpr` objects can now be constructed from python.

Overall there were more substantial changes when implementing the `FExpr__add__`, `FExpr__sub__`, etc. classes, compared to previous PRs. This is because our current implementation of binary operators is overly complicated, and I'm trying to simplify it by removing all the layers of indirection.

WIP for #2562
st-pasha added a commit that referenced this issue Aug 17, 2020
FExpr implementation for `<`, `>`, `<=`, `>=`, `==` and `!=`.

WIP for #2562
oleksiyskononenko added a commit that referenced this issue Aug 19, 2020
st-pasha added a commit that referenced this issue Aug 19, 2020
This is a tutorial-style detailed documentation, which replaces old `!readme.md` (whose information is now completely out-of-date).

WIP for #2562
@st-pasha st-pasha removed this from the Release 0.11.0 milestone Aug 24, 2020
oleksiyskononenko added a commit that referenced this issue Sep 9, 2022
While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.
samukweku pushed a commit that referenced this issue Sep 15, 2022
While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.
samukweku pushed a commit that referenced this issue Sep 17, 2022
While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.
samukweku pushed a commit that referenced this issue Sep 18, 2022
While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.
oleksiyskononenko pushed a commit that referenced this issue Dec 12, 2022
oleksiyskononenko pushed a commit that referenced this issue Dec 15, 2022
samukweku added a commit that referenced this issue Jan 3, 2023
samukweku added a commit that referenced this issue Jan 3, 2023
samukweku pushed a commit that referenced this issue Jan 3, 2023
While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.
oleksiyskononenko pushed a commit that referenced this issue Jan 20, 2023
- refactor `mean()` reducer to use `FExpr`;
- fix `median()` to use `float64` for void columns.

WIP for #2562
oleksiyskononenko added a commit that referenced this issue Jan 21, 2023
- refactor existing unary `FExpr` reducers to inherit from `ReduceUnary_ColumnImpl`;
- make `is_grouped` to be a template parameter.

WIP for #2562
oleksiyskononenko pushed a commit that referenced this issue Mar 6, 2023
oleksiyskononenko pushed a commit that referenced this issue Apr 22, 2023
- convert `count()` and `countna()` to `FExpr`;
- fix `countna()` when applied to grouped columns;
- change`ReduceUnary_ColumnImpl` for more type flexibility.

WIP for #2562 
Closes #3441
oleksiyskononenko added a commit that referenced this issue May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPIC ⭐ Big task that may encompass many smaller ones refactor Internal code changes, clean-ups or reorganizations that are not externally visible
Projects
None yet
Development

No branches or pull requests

2 participants