Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column-wise Iteration Benchmarks #316

Closed
stanbrub opened this issue Jul 11, 2024 · 0 comments · Fixed by #328
Closed

Column-wise Iteration Benchmarks #316

stanbrub opened this issue Jul 11, 2024 · 0 comments · Fixed by #328
Assignees
Labels
enhancement New feature or request

Comments

@stanbrub
Copy link
Collaborator

stanbrub commented Jul 11, 2024

There are several ways in python to iterate over the columns of Deephaven Tables; dictionary per row, tuple per row, getting jpy column vectors, and pandas tuples.

  • Add nightly benchmarks for 'dictionary per row' and 'tuple per row' (These are the two fastest DH ways)
  • Add compare benchmarks 'tuple per row' against pandas and pyarrow
import timeit
from deephaven import empty_table
t = empty_table(1_000_000).update(["X=i", "Y=ii"])

iter_func = lambda t: sum((r["X"] + r["Y"] for r in t.iter_dict()))
print("Dict iteration:", timeit.timeit(lambda: iter_func(t), number=10))

iter_func1 = lambda t: sum((r.X + r.Y for r in t.iter_tuple()))
print("Tuple iteration:", timeit.timeit(lambda: iter_func1(t), number=10))

##########
import jpy
_JColumnVectors = jpy.get_type("io.deephaven.engine.table.vectors.ColumnVectors")
_j_column_vector = _JColumnVectors.ofInt(t.j_table, "X")
_j_column_vector1 = _JColumnVectors.ofLong(t.j_table, "Y")
iter_func_direct = lambda t: sum((_j_column_vector.get(i) + _j_column_vector1.get(i) for i in range(t.size)))
print("Direct (DH Vector) iteration:", timeit.timeit(lambda: iter_func_direct(t), number=10))

######## Pandas
from deephaven.pandas import to_pandas
df = to_pandas(t)
iter_func_pandas = lambda t: sum((r["X"] + r["Y"] for i, r in df.iterrows()))
print("Pandas iteration:", timeit.timeit(lambda: iter_func_pandas(t), number=10))

iter_func_pandas = lambda t: sum((r.X + r.Y for r in df.itertuples()))
print("Pandas iteration (tuples):", timeit.timeit(lambda: iter_func_pandas(t), number=10))

Some recent results Jianfeng did:

Dict iteration: 2.7872319179587066
Tuple iteration: 1.736757001024671
Direct (DH Vector) iteration: 4.774696293985471
Pandas iteration (tuples): 3.3441969179548323
@stanbrub stanbrub added the enhancement New feature or request label Jul 11, 2024
@stanbrub stanbrub self-assigned this Jul 11, 2024
@stanbrub stanbrub linked a pull request Aug 12, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant