Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The auto conversion feature of the Python UDF decorator has a performance problem. #5112

Closed
jmao-denver opened this issue Feb 3, 2024 · 14 comments · Fixed by #5291
Closed
Assignees
Labels
bug Something isn't working core Core development tasks python python-server-side
Milestone

Comments

@jmao-denver
Copy link
Contributor

This was found by the latest bench marking effort to measure the performance impact from the usability improvement on Python UDF.

@jmao-denver jmao-denver added bug Something isn't working triage python-server-side labels Feb 3, 2024
@jmao-denver jmao-denver self-assigned this Feb 3, 2024
@rcaudy rcaudy added core Core development tasks python and removed triage labels Feb 4, 2024
@jmao-denver
Copy link
Contributor Author

jmao-denver commented Feb 5, 2024

I have done quite a big of digging and playing around (starting from looking for obvious leaky code, to disabling auto conversion, and then simplifying the UDF, and then bypassing the UDF decorator completely), and finally now believe the 'memory leak' could have something to do with that the default liveness scope mishandles tables created in the global scope. This is ofc only my speculation without diving into the actual implementation. @niloc132 , @rcaudy are the resident experts/creators of liveness scope, and should know if I am talking nonsense here after a quick look at the simple code examples/results below.

1. Wrap @stanbrub 's bench marking script into a function and call it multiple times, no memory leak

import time, jpy
from deephaven import empty_table, garbage_collect
from numpy import typing as npt
import numpy as np


def one_run():
    row_count = 100_000

    def why(arr):    
        arr = np.array(arr)
        return arr[0]

    source = empty_table(row_count).update(["X = repeat(ii % 250, 100)"])

    begin_time = time.perf_counter_ns()
    result = source.select('Y = why(X)')
    print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))


    del result;
    del source;
    del why;
    del row_count;
    del begin_time
    for i in range(10):
        garbage_collect()

    Runtime = jpy.get_type('java.lang.Runtime')
    print('Gigs Used After GC:',
            (Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024 / 1024 / 1024)
    del Runtime

for i in range(10):
    one_run()
Rows / Sec: 363735.1236744706
Gigs Used After GC: 0.5138832032680511
Rows / Sec: 359119.1875069157
Gigs Used After GC: 0.5182483717799187
Rows / Sec: 375674.1011544443
Gigs Used After GC: 0.5190707519650459
Rows / Sec: 366915.53177368885
Gigs Used After GC: 0.5190758258104324
Rows / Sec: 381117.53188286355
Gigs Used After GC: 0.5191915258765221
Rows / Sec: 386750.756027146
Gigs Used After GC: 0.5192193910479546
Rows / Sec: 384669.07850075746
Gigs Used After GC: 0.5193070024251938
Rows / Sec: 364130.9196308441
Gigs Used After GC: 0.5193022042512894
Rows / Sec: 353877.2863909734
Gigs Used After GC: 0.5194044783711433
Rows / Sec: 363178.2083835716
Gigs Used After GC: 0.5193197578191757

2. Run @stanbrub 's script as is multiple times manually in the IDE, almost constant memory leak amount each time, in fact, we don't even need to involve PY UDF, for example, just replace the formula with "Y = 1" in the select op, would render exact same result.

import time, jpy
from deephaven import empty_table, garbage_collect
from numpy import typing as npt
import numpy as np

row_count = 100_000

def why(arr):    
    arr = np.array(arr)
    return arr[0]

source = empty_table(row_count).update(["X = repeat(ii % 250, 100)"])

begin_time = time.perf_counter_ns()
result = source.select('Y = why(X)')
print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))


del result;
del source;
del why;
del row_count;
del begin_time
for i in range(10):
    garbage_collect()

Runtime = jpy.get_type('java.lang.Runtime')
print('Gigs Used After GC:',
        (Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024 / 1024 / 1024)
del Runtime
Rows / Sec: 348757.2688643897
Gigs Used After GC: 0.5188859179615974
Rows / Sec: 363289.9712724459
Gigs Used After GC: 0.5953610986471176
Rows / Sec: 355113.5833992525
Gigs Used After GC: 0.6713495403528214
Rows / Sec: 352858.5133444862
Gigs Used After GC: 0.7472281381487846
Rows / Sec: 343005.6148955841
Gigs Used After GC: 0.8226156905293465
Rows / Sec: 338097.21665624983
Gigs Used After GC: 0.8969474732875824
Rows / Sec: 360966.3063689902
Gigs Used After GC: 0.9720757678151131
Rows / Sec: 357905.9099864775
Gigs Used After GC: 1.0456845089793205
Rows / Sec: 351107.51704972837
Gigs Used After GC: 1.1266669183969498
Rows / Sec: 356261.26827671967
Gigs Used After GC: 1.2034849897027016

@stanbrub
Copy link
Contributor

stanbrub commented Feb 5, 2024

Just to be clear, there are multiple issues that can be seen in testing between 0.24.0 and 0.32.0. (Though the regression happens earlier than 0.32.0 and later than 0.24.0, these are the versions that are like what HH is seeing.)

Performance:

  • Between 0.24.0 to 0.32.0 some UDFs regressed significantly (35% to 45%)
  • For 0.32.0 (and probably the previous release) adding hints to UDF args makes things worse
  • The performance regression happens for both arrays and scalars
  • Unable to reproduce UDF regression in Groovy

Memory:

  • Running the same UDF test back-to-back causes OOM, even if the test explicitly deletes table and runs GC
  • The OOM happens on UDFs that have scalar or array arguments
  • Unable to reproduce memory issue in Groovy

@jmao-denver
Copy link
Contributor Author

@stanbrub Can you confirm that performance degradation for scalars is on the same scale as for arrays?

@chipkent
Copy link
Member

chipkent commented Feb 5, 2024

The title lists a "memory leak" and a "performance problem". The example in the thread clearly shows a memory leak, but it looks like the rows/sec remains constant. Is there another reproducer of the "performance problem", or is the performance problem just a slowdown that happens as the process runs out of memory?

@stanbrub
Copy link
Contributor

stanbrub commented Feb 6, 2024

Here are some results that show UDF performance regression between 0.24.0 and 0.32.0 for both scalar and array values

  • There are large performance regressions for UDFs with no type hints
  • Adding type hints makes the performance loss even bigger
  • All tests were run with released docker images in the console w/ 24G Heap on x86 w/ 24 CPU threads
  • None of the below benchmarks came close to memory limits

udf-scalar-vs-array_0 24 0-vs-0 32 0
udf-array-no-hints.py.txt
udf-double-scalar-no-hints.py.txt
udf-single-scalar-no-hints.py.txt
udf-single-scalar-with-hints.py.txt
udf-double-scalar-with-hints.py.txt
udf-array-with-hints.py.txt

@stanbrub
Copy link
Contributor

stanbrub commented Feb 8, 2024

Here's some more supporting info on the performance regression. I ran some Benchmark UDF tests on >= 0.28.0.

  • There are two regressions; with the release of 0.29.0 and another with the release of 0.31.0.
  • Using hints in 0.28.0 greatly improves performance over equivalent UDF's without hints
  • Using hints in 0.31.0 greatly diminishes performance over equivalent UDF's without hints
  • The rates used for the below charts use an average rate for several scalar and array UDF's

2024-02-07_udf-regression-since-0 28 0

@jmao-denver
Copy link
Contributor Author

No hints, no vectorization

import time
import numpy as np
from deephaven import empty_table

row_count = 1_000_000

def why(v1):
    return v1 + 1

source = empty_table(row_count).update(["X = (int)(ii % 250)"])

begin_time = time.perf_counter_ns()
result = source.select('Y=(int)why(X)')
print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))

Rows / Sec: 500045.2228398406
Rows / Sec: 467877.1441906515
Rows / Sec: 507475.29067793896

Return hints, vectorizaiton

import time
import numpy as np
from deephaven import empty_table

row_count = 1_000_000

def why(v) -> np.int32:
    return v1 + 1

source = empty_table(row_count).update(["X = (int)(ii % 250)"])

begin_time = time.perf_counter_ns()
result = source.select('Y=why(X)')
print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))

Rows / Sec: 750895.9363673003
Rows / Sec: 750593.4614114742
Rows / Sec: 749491.9379651789

Numpy type hints for input, vectorization

import time
import numpy as np
from deephaven import empty_table

row_count = 1_000_000

def why(v: np.int32) -> np.int32:
    return v1 + 1

source = empty_table(row_count).update(["X = (int)(ii % 250)"])

begin_time = time.perf_counter_ns()
result = source.select('Y=why(X)')
print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))

Pyhton built-in type for input, vectorization

Rows / Sec: 246714.64247366833
Rows / Sec: 246335.93767897843
Rows / Sec: 245319.00670334187
import time
import numpy as np
from deephaven import empty_table

row_count = 1_000_000

def why(v: int) -> np.int32:
    return v1 + 1

source = empty_table(row_count).update(["X = (int)(ii % 250)"])

begin_time = time.perf_counter_ns()
result = source.select('Y=why(X)')
print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))
Rows / Sec: 345183.85462927143
Rows / Sec: 341366.03654191043
Rows / Sec: 340745.41210793675

@jmao-denver jmao-denver changed the title The auto conversion feature of the Python UDF decorator has a memory leak and performance problem. The auto conversion feature of the Python UDF decorator has a performance problem. Feb 16, 2024
@jmao-denver
Copy link
Contributor Author

jmao-denver commented Feb 16, 2024

TLDR: auto-conversion of args and return value of Py UDF really kills the performance

Auto-conversion disabled in Python, vectorization with type hints; cast needed, no vectorization with no-type-hints, formula is "Y = why(X)"

type hints, Rows / Sec: 7198272.270689589. <==  14 times faster
no type hints, Rows / Sec: 496143.91961906385

Auto-conversion disabled in Python, no vectorization at all, formula is "Y = why(X + 1)"

type hints, Rows / Sec: 722282.9272291749            <==  same performance within the margin of error
no type hints, Rows / Sec: 732129.3625082999

Auto-conversion enabled in Python, vectorization with type hints; cast needed, no vectorization with no-type-hints, formula is "Y = why(X)"

type hints, Rows / Sec: 400916.5270778188. <==  ~18% slower, auto-conversion overhead more than offset the gain from vectorization
no type hints, Rows / Sec: 488018.12092333037

Auto-conversion enabled in Python, no vectorization at all, formula is "Y = why(X + 1)"

type hints, Rows / Sec: 239400.20135381163 <== ~50% slower
no type hints, Rows / Sec: 479758.51428106957 

type hints, Rows / Sec: 505862.9620497612  <==  only return-value conversion, ~18% slower
no type hints, Rows / Sec: 611064.2156193163 

type hints, Rows / Sec: 272818.1017999679 <== only input args converted,  ~60% slower
no type hints, Rows / Sec: 474922.4556231638

Auto-conversion disabled in Java, vectorization with type hints; cast needed, no vectorization with no-type-hints, formula is "Y = why(X)"

type hints, Rows / Sec: 7229849.004798741  <== similar performance b/c vectorization always goes through the vectorization decorator in Python
no type hints, Rows / Sec: 784986.622918931 <== better performance b/c no vectorization and no udf decorator

Auto-conversion disabled in Java, no vectorization, better performance than it is disabled in Python only, because it doesn't go through the udf decorator

type hints, Rows / Sec: 788237.3750420611
no type hints, Rows / Sec: 767679.3823834389
import time
import numpy as np
from deephaven import empty_table, garbage_collect

row_count = 1_000_000
source = empty_table(row_count).update(["X = (int)(ii % 250)"])

def run_test():

    # with type-hints
    def why(v: int) -> np.int32:
        return v

    begin_time = time.perf_counter_ns()
    for i in range(5):
        result = source.select('Y=why(X + 1)')
    print("type hints,", 'Rows / Sec:', row_count * 5 / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))
    print(result.columns[0].data_type)
    
    # try to restore the worker to the same state
    result = None
    for i in range(5):
        garbage_collect()
        time.sleep(0.1)   
    time.sleep(5)

    # without type-hints
    def why(v):
        return v

    begin_time = time.perf_counter_ns()
    for i in range(5):
        # result = source.select('Y=why(X)').select("Y = (int)Y")
        result = source.select("Y = (int)why(X + 1)")
    print("no type hints,", 'Rows / Sec:', row_count * 5 / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))
    print(result.columns[0].data_type)


    result = None
    for i in range(5):
        garbage_collect()    

run_test()

@chipkent
Copy link
Member

I'm trying to get my head around the relevant changes that lead to the problem. Am I correct that these are the key files?
https://github.com/deephaven/deephaven-core/blob/main/py/server/deephaven/_udf.py
https://github.com/deephaven/deephaven-core/blob/main/engine/table/src/main/java/io/deephaven/engine/util/PyCallableWrapperJpyImpl.java

@jmao-denver
Copy link
Contributor Author

jmao-denver commented Feb 18, 2024

Array input - vectorization for type-hints, no vectorization for no-type-hints

--- arg/return value  auto-conversion enabled ---
type hints, Rows / Sec: 1231949.3024970691
no type hints, Rows / Sec: 1820989.1977622057
--- no return value auto-conversion ---
type hints, Rows / Sec: 1365070.3849798716
no type hints, Rows / Sec: 1862219.3173830567
--- not arg auto-conversion ---
type hints, Rows / Sec: 2411131.7068459815
no type hints, Rows / Sec: 1854127.122524789
--- no arg or return value auto-conversion ---
type hints, Rows / Sec: 2948846.317343872
no type hints, Rows / Sec: 1840094.2422902663

Array input - no vectorization for either

--- arg/return value auto-conversion enabled ---
type hints, Rows / Sec: 1034499.0853991724
no type hints, Rows / Sec: 1832391.606097296
--- no return-value auto-conversion ---
type hints, Rows / Sec: 1108208.866524745
no type hints, Rows / Sec: 1777854.1193077555
--- no arg auto-conversion ---
type hints, Rows / Sec: 1754613.7791632663
no type hints, Rows / Sec: 2071235.161504206
--- no arg/return value auto-conversion ---
type hints, Rows / Sec: 2098357.0185048156
no type hints, Rows / Sec: 2104465.867133651

@jmao-denver
Copy link
Contributor Author

jmao-denver commented Feb 19, 2024

Experiments to speed up the auto-conversion

Scalar input/output, auto-conversion on, removed: null/Optional check

type hints, Rows / Sec: 275570.23424915934
no type hints, Rows / Sec: 504334.2759002709

Scalar input/output, auto-conversion on, removed: null/Optional check, numpy scalar support in input

type hints, Rows / Sec: 449091.78075960197
no type hints, Rows / Sec: 520391.05714783334

Array input, auto-conversion on, removed: null/Optional check

type hints, Rows / Sec: 133654.1541810492
no type hints, Rows / Sec: 179031.90280712221

Array input, auto-conversion on, removed: null/Optional check, lookup function call replaced with the use of map

type hints, Rows / Sec: 157456.28193470957
no type hints, Rows / Sec: 175074.40249280503
import time
import numpy as np
from deephaven import empty_table, garbage_collect

row_count = 10_000_000
source = empty_table(row_count).update(["X = (int)(ii % 1_000_000)", "Y = ii"]).group_by("X")

def run_test():

    # with type-hints
    def why(v: np.ndarray[np.int64]) -> np.float64:
        return np.average(v)

    begin_time = time.perf_counter_ns()
    for i in range(5):
        result = source.select("Z = why(Y)")
    print("type hints,", 'Rows / Sec:', row_count * 5 /10/ ((time.perf_counter_ns() - begin_time) / 1_000_000_000))
    print(result.columns[0].data_type)
    
    # try to restore the worker to the same state
    result = None
    for i in range(5):
        garbage_collect()
        time.sleep(0.1)   
    time.sleep(5)

    # without type-hints
    def why(v):
        v = np.frombuffer(v)
        return np.average(v)

    begin_time = time.perf_counter_ns()
    for i in range(5):
        result = source.select("Z = (double) why(Y)")
    print("no type hints,", 'Rows / Sec:', row_count * 5 / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))
    print(result.columns[0].data_type)
    
   # try to restore the worker to the same state
    result = None
    for i in range(5):
        garbage_collect()
        time.sleep(0.1)   
    time.sleep(5)

run_test()

Some rough numbers, UDF vs. pre-UDF

Py UDF decorator on not-type-hinted UDF: 6 - 10% slower (extra func calls and arg type check)
Py UDF decorator on type-hinted UDF with full-on auto-conversion: 50 - 60% slower (above + DH null handling, + np scalar support, java array-> np.array conversion)
Py UDF decorator on type-hinted UDF with auto-conversion - DH null check -np scalar support +optimization: 20 - 26% slower (base cost + java array->np.array conversion)

@chipkent
Copy link
Member

I'm looking at the perf data above.

  1. I'm not totally understanding what is going on. Here it looks like the scalar input case is only processing 500k rows per sec while the array input case is processing 2m rows per sec. What am I missing? The example code is just the array case code, so I can't guess. Maybe it is because the rows per sec is being calculated based on the input table, and the array case is only processing 10 rows?
  2. The array case is only processing 10 rows, which seems extreme. How bad are things if you process a lot of smaller rows?

I have other specific questions about what is going on. These questions revolve around the performance of built-in python types vs numpy types (e.g. float vs np.float64), conversion times in hand written code vs annotations, etc. I wrote this benchmark to try to dig in a bit more. I'm not sure how to disable vectorization, so the numbers I am getting are not apples to apples.

New benchmarks

import time
import numpy as np
from deephaven import empty_table, garbage_collect

row_count = 1_000_000
source = empty_table(row_count).update(["Y = ii"])

# both type-hints numpy
def why_n_n(v: np.int64) -> np.float64:
    return v+1.2

# both type-hints native
def why_p_p(v: int) -> float:
    return v+1.2

# return type-hints numpy
def why_x_n(v) -> np.float64:
    return v+1.2

# return type-hints native
def why_x_p(v) -> float:
    return v+1.2

# param type-hints numpy
def why_n_x(v: np.int64):
    return v+1.2

# param type-hints native
def why_p_x(v: int):
    return v+1.2

# no type-hints
def why_x_x(v):
    return v+1.2

# no type-hints convert the input
def why_c_x(v):
    v = np.int64(v)
    return v+1.2

# no type-hints convert the output
def why_x_c(v):
    return np.float64(v+1.2)

# no type-hints convert both
def why_c_c(v):
    v = np.int64(v)
    return np.float64(v+1.2)

jobs = [
    ["Z = why_n_n(Y)", "numpy/numpy"],
    ["Z = why_p_p(Y)", "python/python"],
    ["Z = why_x_n(Y)", "none/numpy"],
    ["Z = why_x_p(Y)", "none/python"],
    ["Z = (double) why_n_x(Y)", "numpy/none"],
    ["Z = (double) why_p_x(Y)", "python/none"],
    ["Z = (double) why_x_x(Y)", "none/none"],
    ["Z = (double) why_c_x(Y)", "convert/none"],
    ["Z = (double) why_x_c(Y)", "none/convert"],
    ["Z = (double) why_c_c(Y)", "convert/convert"],
]

def run_test():
    for query, msg in jobs:
        begin_time = time.perf_counter_ns()
        for i in range(5):
            result = source.select(query)
        print(f"{msg} hints: {row_count * 5 / ((time.perf_counter_ns() - begin_time) / 1_000_000_000)} rows/sec ({result.columns[0].data_type})")

        # try to restore the worker to the same state
        result = None
        for i in range(5):
            garbage_collect()
        time.sleep(0.1)
        time.sleep(5)

run_test()

Running this code on the demo system with 0.32.0 produces:

numpy/numpy hints: 68825.7609555498 rows/sec (double)
python/python hints: 99940.31391955742 rows/sec (double)
none/numpy hints: 235329.6939374642 rows/sec (double)
none/python hints: 235613.83023603584 rows/sec (double)
numpy/none hints: 53120.79882927409 rows/sec (double)
python/none hints: 72962.35161478045 rows/sec (double)
none/none hints: 132157.69190853115 rows/sec (double)
convert/none hints: 84414.73315296792 rows/sec (double)
none/convert hints: 123055.80861453459 rows/sec (double)
convert/convert hints: 83034.57429840925 rows/sec (double)

ALL of the rows/sec seem low. I do not know what hardware they are on...

Observation 1: Numpy scalar conversions are ultra slow (not a DH problem)

Numpy scalar types are much slower than python scalar types.

Converting both inputs and outputs to numpy scalars is 31% slower than using the native python types.

numpy/numpy hints: 68825.7609555498 rows/sec (double)
python/python hints: 99940.31391955742 rows/sec (double)

Floating point return conversion seems comparable.

none/numpy hints: 235329.6939374642 rows/sec (double)
none/python hints: 235613.83023603584 rows/sec (double)

Integer inputs are 27% slower.

numpy/none hints: 53120.79882927409 rows/sec (double)
python/none hints: 72962.35161478045 rows/sec (double)

Converting inputs and outputs to numpy scalars by hand is slow.

Here the functions have no type hints and convert by hand. Converting both values is 37% slower than the unconverted case. The float conversion is around 6%, while the bulk of the slowdown is from the integer conversion.

none/none hints: 132157.69190853115 rows/sec (double)
convert/convert hints: 83034.57429840925 rows/sec (double)
convert/none hints: 84414.73315296792 rows/sec (double)
none/convert hints: 123055.80861453459 rows/sec (double)

Observation 2: DH wrapper is way too slow (DH problem)

The 0.32.0 annotations are very slow. Using an annotation to convert a Python type is 44% slower than no conversion, even though the result is identical.

none/none hints: 132157.69190853115 rows/sec (double)
python/none hints: 72962.35161478045 rows/sec (double)

Similarly, converting the input to numpy by hand is 36% slower vs 60% slower when using the annotation.

convert/none hints: 84414.73315296792 rows/sec (double)
numpy/none hints: 53120.79882927409 rows/sec (double)

This all indicates that the annotation has too much overhead.

Considerations

  1. Even unannotated cases seem excessively slow on the demo system.
  2. Numpy scalars are incredibly slow. Should we support them? Probably not.
  3. The DH annotation has way too much overhead. It needs to be reenginnered for speed, or key parts of what it is doing need to happen in the query language.
  4. Since I don't know how to turn off vectorization, I could not assess the overhead of a return type hint vs having a query string cast. This experiment should be done since it may uncover some other clear performance problems.
  5. None of the above has looked at performance tradeoffs for various ways nulls could be handled. This concept needs to be understood in detail.
  6. None of the above looks at costs associated with input or output arrays.

@jmao-denver
Copy link
Contributor Author

I'm looking at the perf data above.

  1. I'm not totally understanding what is going on. Here it looks like the scalar input case is only processing 500k rows per sec while the array input case is processing 2m rows per sec. What am I missing? The example code is just the array case code, so I can't guess. Maybe it is because the rows per sec is being calculated based on the input table, and the array case is only processing 10 rows?

It is a low-level math mistake. The row number after group-by should be 10 times smaller.

@jmao-denver
Copy link
Contributor Author

jmao-denver commented Feb 23, 2024

Summary:


1. the weight of extra function calls themself (the Py UDF decorator, and its dependent ones) costs ~7% performance loss. (see the difference between Exp.1 and Exp.7)
2. eliminating branching provides a big improvement, though from a very low level (between Exp.2 and Exp.3)
3. Exp. 4 probably makes the most sense in terms of functionality and correctness, but it is still significantly slower than the pre-UDF numbers (between Exp.1 and Exp. 4)

Test code (Chip's, but with only two targeted UDFs):

import time
import numpy as np
from deephaven import empty_table, garbage_collect

row_count = 1_000_000
source = empty_table(row_count).update(["Y = ii"])

# both type-hints numpy
def why_n_n(v: np.int64) -> np.float64:
    return v+1.2

# both type-hints native
def why_p_p(v: int) -> float:
    return v+1.2

# return type-hints numpy
def why_x_n(v) -> np.float64:
    return v+1.2

# return type-hints native
def why_x_p(v) -> float:
    return v+1.2

# param type-hints numpy
def why_n_x(v: np.int64):
    return v+1.2

# param type-hints native
def why_p_x(v: int):
    return v+1.2

# no type-hints
def why_x_x(v):
    return v+1.2

# no type-hints convert the input
def why_c_x(v):
    v = np.int64(v)
    return v+1.2

# no type-hints convert the output
def why_x_c(v):
    return np.float64(v+1.2)

# no type-hints convert both
def why_c_c(v):
    v = np.int64(v)
    return np.float64(v+1.2)

jobs = [
    ["Z = why_n_n(Y + 1)", "numpy/numpy"],
    ["Z = why_p_p(Y + 1)", "python/python"],
    # ["Z = why_x_n(Y)", "none/numpy"],
    # ["Z = why_x_p(Y)", "none/python"],
    # ["Z = (double) why_n_x(Y)", "numpy/none"],
    # ["Z = (double) why_p_x(Y)", "python/none"],
    # ["Z = (double) why_x_x(Y)", "none/none"],
    # ["Z = (double) why_c_x(Y)", "convert/none"],
    # ["Z = (double) why_x_c(Y)", "none/convert"],
    # ["Z = (double) why_c_c(Y)", "convert/convert"],
]

def run_test():
    global source
    for query, msg in jobs:
        begin_time = time.perf_counter_ns()
        for i in range(5):
            result = source.select(query)
        print(f"{msg} hints: {row_count * 5 / ((time.perf_counter_ns() - begin_time) / 1_000_000_000)} rows/sec ({result.columns[0].data_type})")

        # try to restore the worker to the same state
        result = None
        for i in range(5):
            garbage_collect()
            time.sleep(0.1)
        time.sleep(5)
    source = None
run_test()

1. Py UDF decorator by-passed entirely

numpy/numpy hints: 767347.5798621533 rows/sec (double)
python/python hints: 765443.6285388595 rows/sec (double)

2. Py UDF decorator, non-optimized (no typed converter)

numpy/numpy hints: 186339.32544498434 rows/sec (double)
python/python hints: 259044.24304883406 rows/sec (double)

3. Hard-coded long/float conversion for input/output (null check, NumPy type check)

numpy/numpy hints: 284009.91974102164 rows/sec (double)
python/python hints: 430263.8651539863 rows/sec (double)
np_long_type = np.dtype("l")
def _convert_long_arg(param: _ParsedParamAnnotation, arg: int) -> Any:
    """ Convert an integer argument to the type specified by the annotation """
    if arg == NULL_LONG:
        if param.none_allowed:
            return None
        else:
            raise DHError(f"Argument {arg} is not compatible with annotation {param.orig_types}")
    else:
        for t in param.orig_types:
            if issubclass(t, np.generic):
                return np_long_type.type(arg)
        else:
            return arg
    # return arg
def _convert_args(p_sig: _ParsedSignature, args: Tuple[Any, ...]) -> List[Any]:
    """ Convert all arguments to the types specified by the annotations.
    Given that the number of arguments and the number of parameters may not match (in the presence of keyword,
    var-positional, or var-keyword parameters), we have the following rules:
     If the number of arguments is less than the number of parameters, the remaining parameters are left as is.
     If the number of arguments is greater than the number of parameters, the extra arguments are left as is.

    Python's function call mechanism will raise an exception if it can't resolve the parameters with the arguments.
    """
    converted_args = [_convert_long_arg(param, arg) for param, arg in zip(p_sig.params, args)] ####### call the typed convertor
    # converted_args = [_convert_arg(param, arg) for param, arg in zip(p_sig.params, args)]
    converted_args.extend(args[len(converted_args):])  
    return converted_args
def _scalar(x: Any, dtype: DType) -> Any:
    """Converts a Python value to a Java scalar value. It converts the numpy primitive types, string to
    their Python equivalents so that JPY can handle them. For datetime values, it converts them to Java Instant.
    Otherwise, it returns the value as is."""

    # NULL_BOOL will appear in Java as a byte value which causes a cast error. We just let JPY converts it to Java null
    # and the engine has casting logic to handle it.
    # if (dt := _PRIMITIVE_DTYPE_NULL_MAP.get(dtype)) and _is_py_null(x) and dtype not in (bool_, char):
    #     return dt

    return NULL_DOUBLE if _is_py_null(x) else float(x) ######### skip all the branching

    # try:
    #     if hasattr(x, "dtype"):
    #         if x.dtype.char == 'H':  # np.uint16 maps to Java char
    #             return Character(int(x))
    #         elif x.dtype.char in _NUMPY_INT_TYPE_CODES:
    #             return int(x)
    #         elif x.dtype.char in _NUMPY_FLOATING_TYPE_CODES:
    #             return float(x)
    #         elif x.dtype.char == '?':
    #             return bool(x)
    #         elif x.dtype.char == 'U':
    #             return str(x)
    #         elif x.dtype.char == 'O':
    #             return x
    #         elif x.dtype.char == 'M':
    #             from deephaven.time import to_j_instant
    #             return to_j_instant(x)
    #     elif isinstance(x, (datetime.datetime, pd.Timestamp)):
    #             from deephaven.time import to_j_instant
    #             return to_j_instant(x)
    #     return x
    # except:
    #     return x

4. Hard-coded long/float conversion for input/output (null check, no Numpy type check)

numpy/numpy hints: 506900.8422251575 rows/sec (double)
python/python hints: 517587.5559497151 rows/sec (double)
def _convert_long_arg(param: _ParsedParamAnnotation, arg: int) -> Any:
    """ Convert an integer argument to the type specified by the annotation """
    if arg == NULL_LONG:
        if param.none_allowed:
            return None
        else:
            raise DHError(f"Argument {arg} is not compatible with annotation {param.orig_types}")
    else:
        return arg
    # else:
    #     for t in param.orig_types:
    #         if issubclass(t, np.generic):
    #             return np_long_type.type(arg)
    #     else:
    #         return arg
    # return arg

5. Hard-coded long/float conversion for input/output (no null check, no NumPy type check)

numpy/numpy hints: 537996.0974778825 rows/sec (double)
python/python hints: 545296.1069144575 rows/sec (double)
def _convert_long_arg(param: _ParsedParamAnnotation, arg: int) -> Any:
    """ Convert an integer argument to the type specified by the annotation """
    # if arg == NULL_LONG:
    #     if param.none_allowed:
    #         return None
    #     else:
    #         raise DHError(f"Argument {arg} is not compatible with annotation {param.orig_types}")
    # else:
    #     for t in param.orig_types:
    #         if issubclass(t, np.generic):
    #             return np_long_type.type(arg)
    #     else:
    #         return arg
    return arg ###### skip null check/numpy scalar
def _scalar(x: Any, dtype: DType) -> Any:
    """Converts a Python value to a Java scalar value. It converts the numpy primitive types, string to
    their Python equivalents so that JPY can handle them. For datetime values, it converts them to Java Instant.
    Otherwise, it returns the value as is."""

    # NULL_BOOL will appear in Java as a byte value which causes a cast error. We just let JPY converts it to Java null
    # and the engine has casting logic to handle it.
    # if (dt := _PRIMITIVE_DTYPE_NULL_MAP.get(dtype)) and _is_py_null(x) and dtype not in (bool_, char):
    #     return dt

    return float(x)

    # try:
    #     if hasattr(x, "dtype"):
    #         if x.dtype.char == 'H':  # np.uint16 maps to Java char
    #             return Character(int(x))
    #         elif x.dtype.char in _NUMPY_INT_TYPE_CODES:
    #             return int(x)
    #         elif x.dtype.char in _NUMPY_FLOATING_TYPE_CODES:
    #             return float(x)
    #         elif x.dtype.char == '?':
    #             return bool(x)
    #         elif x.dtype.char == 'U':
    #             return str(x)
    #         elif x.dtype.char == 'O':
    #             return x
    #         elif x.dtype.char == 'M':
    #             from deephaven.time import to_j_instant
    #             return to_j_instant(x)
    #     elif isinstance(x, (datetime.datetime, pd.Timestamp)):
    #             from deephaven.time import to_j_instant
    #             return to_j_instant(x)
    #     return x
    # except:
    #     return x

6. Hard-coded long/float conversion for input/output (no null check, no NumPy type check, _scalar() skipped)

numpy/numpy hints: 545560.2847189392 rows/sec (double)
python/python hints: 547478.740932669 rows/sec (double)
def _py_udf(fn: Callable):
    """A decorator that acts as a transparent translator for Python UDFs used in Deephaven query formulas between
    Python and Java. This decorator is intended for use by the Deephaven query engine and should not be used by
    users.

    It carries out two conversions:
    1. convert Python function return values to Java values.
        For properly annotated functions, including numba vectorized and guvectorized ones, this decorator inspects the
        signature of the function and determines its return type, including supported primitive types and arrays of
        the supported primitive types. It then converts the return value of the function to the corresponding Java value
        of the same type. For unsupported types, the decorator returns the original Python value which appears as
        org.jpy.PyObject in Java.
    4. convert Java function arguments to Python values based on the signature of the function.
    """
    if hasattr(fn, "return_type"):
        return fn
    p_sig = _parse_signature(fn)
    # build a signature string for vectorization by removing NoneType, array char '[', and comma from the encoded types
    # since vectorization only supports UDFs with a single signature and enforces an exact match, any non-compliant
    # signature (e.g. Union with more than 1 non-NoneType) will be rejected by the vectorizer.
    sig_str_vectorization = re.sub(r"[\[N,]", "", p_sig.encoded)
    return_array = p_sig.ret_annotation.has_array
    ret_dtype = dtypes.from_np_dtype(np.dtype(p_sig.ret_annotation.encoded_type[-1]))

    @wraps(fn)
    def wrapper(*args, **kwargs):
        converted_args = _convert_args(p_sig, args)
        # converted_args = args
        # kwargs are not converted because they are not used in the UDFs
        ret = fn(*converted_args, **kwargs)
        if return_array:
            return dtypes.array(ret_dtype, ret)
        elif ret_dtype == dtypes.PyObject:
            return ret
        else:
            return ret ####### skip _scalar() call
            # return _scalar(ret, ret_dtype)

    wrapper.j_name = ret_dtype.j_name
    real_ret_dtype = _BUILDABLE_ARRAY_DTYPE_MAP.get(ret_dtype, dtypes.PyObject) if return_array else ret_dtype

    if hasattr(ret_dtype.j_type, 'jclass'):
        j_class = real_ret_dtype.j_type.jclass
    else:
        j_class = real_ret_dtype.qst_type.clazz()

    wrapper.return_type = j_class
    wrapper.signature = sig_str_vectorization

    return wrapper

7. convert_args() skipped, _scalar() skipped)

numpy/numpy hints: 668722.7697534269 rows/sec (double)
python/python hints: 695122.2607676275 rows/sec (double)
def _py_udf(fn: Callable):
    """A decorator that acts as a transparent translator for Python UDFs used in Deephaven query formulas between
    Python and Java. This decorator is intended for use by the Deephaven query engine and should not be used by
    users.

    It carries out two conversions:
    1. convert Python function return values to Java values.
        For properly annotated functions, including numba vectorized and guvectorized ones, this decorator inspects the
        signature of the function and determines its return type, including supported primitive types and arrays of
        the supported primitive types. It then converts the return value of the function to the corresponding Java value
        of the same type. For unsupported types, the decorator returns the original Python value which appears as
        org.jpy.PyObject in Java.
    2. convert Java function arguments to Python values based on the signature of the function.
    """
    if hasattr(fn, "return_type"):
        return fn
    p_sig = _parse_signature(fn)
    # build a signature string for vectorization by removing NoneType, array char '[', and comma from the encoded types
    # since vectorization only supports UDFs with a single signature and enforces an exact match, any non-compliant
    # signature (e.g. Union with more than 1 non-NoneType) will be rejected by the vectorizer.
    sig_str_vectorization = re.sub(r"[\[N,]", "", p_sig.encoded)
    return_array = p_sig.ret_annotation.has_array
    ret_dtype = dtypes.from_np_dtype(np.dtype(p_sig.ret_annotation.encoded_type[-1]))

    @wraps(fn)
    def wrapper(*args, **kwargs):
        # converted_args = _convert_args(p_sig, args)  ###### skip _convert_argsIO entirely
        converted_args = args 
        # kwargs are not converted because they are not used in the UDFs
        ret = fn(*converted_args, **kwargs)
        if return_array:
            return dtypes.array(ret_dtype, ret)
        elif ret_dtype == dtypes.PyObject:
            return ret
        else:
            return ret ###### skip _scalar() entirely
            # return _scalar(ret, ret_dtype) 

    wrapper.j_name = ret_dtype.j_name
    real_ret_dtype = _BUILDABLE_ARRAY_DTYPE_MAP.get(ret_dtype, dtypes.PyObject) if return_array else ret_dtype

    if hasattr(ret_dtype.j_type, 'jclass'):
        j_class = real_ret_dtype.j_type.jclass
    else:
        j_class = real_ret_dtype.qst_type.clazz()

    wrapper.return_type = j_class
    wrapper.signature = sig_str_vectorization

    return wrapper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core development tasks python python-server-side
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants