The auto conversion feature of the Python UDF decorator has a performance problem. #5112

jmao-denver · 2024-02-03T15:18:48Z

This was found by the latest bench marking effort to measure the performance impact from the usability improvement on Python UDF.

jmao-denver · 2024-02-05T06:35:26Z

I have done quite a big of digging and playing around (starting from looking for obvious leaky code, to disabling auto conversion, and then simplifying the UDF, and then bypassing the UDF decorator completely), and finally now believe the 'memory leak' could have something to do with that the default liveness scope mishandles tables created in the global scope. This is ofc only my speculation without diving into the actual implementation. @niloc132 , @rcaudy are the resident experts/creators of liveness scope, and should know if I am talking nonsense here after a quick look at the simple code examples/results below.

1. Wrap @stanbrub 's bench marking script into a function and call it multiple times, no memory leak

import time, jpy
from deephaven import empty_table, garbage_collect
from numpy import typing as npt
import numpy as np


def one_run():
    row_count = 100_000

    def why(arr):    
        arr = np.array(arr)
        return arr[0]

    source = empty_table(row_count).update(["X = repeat(ii % 250, 100)"])

    begin_time = time.perf_counter_ns()
    result = source.select('Y = why(X)')
    print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))


    del result;
    del source;
    del why;
    del row_count;
    del begin_time
    for i in range(10):
        garbage_collect()

    Runtime = jpy.get_type('java.lang.Runtime')
    print('Gigs Used After GC:',
            (Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024 / 1024 / 1024)
    del Runtime

for i in range(10):
    one_run()

Rows / Sec: 363735.1236744706
Gigs Used After GC: 0.5138832032680511
Rows / Sec: 359119.1875069157
Gigs Used After GC: 0.5182483717799187
Rows / Sec: 375674.1011544443
Gigs Used After GC: 0.5190707519650459
Rows / Sec: 366915.53177368885
Gigs Used After GC: 0.5190758258104324
Rows / Sec: 381117.53188286355
Gigs Used After GC: 0.5191915258765221
Rows / Sec: 386750.756027146
Gigs Used After GC: 0.5192193910479546
Rows / Sec: 384669.07850075746
Gigs Used After GC: 0.5193070024251938
Rows / Sec: 364130.9196308441
Gigs Used After GC: 0.5193022042512894
Rows / Sec: 353877.2863909734
Gigs Used After GC: 0.5194044783711433
Rows / Sec: 363178.2083835716
Gigs Used After GC: 0.5193197578191757

2. Run @stanbrub 's script as is multiple times manually in the IDE, almost constant memory leak amount each time, in fact, we don't even need to involve PY UDF, for example, just replace the formula with "Y = 1" in the select op, would render exact same result.

import time, jpy
from deephaven import empty_table, garbage_collect
from numpy import typing as npt
import numpy as np

row_count = 100_000

def why(arr):    
    arr = np.array(arr)
    return arr[0]

source = empty_table(row_count).update(["X = repeat(ii % 250, 100)"])

begin_time = time.perf_counter_ns()
result = source.select('Y = why(X)')
print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))


del result;
del source;
del why;
del row_count;
del begin_time
for i in range(10):
    garbage_collect()

Runtime = jpy.get_type('java.lang.Runtime')
print('Gigs Used After GC:',
        (Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024 / 1024 / 1024)
del Runtime

Rows / Sec: 348757.2688643897
Gigs Used After GC: 0.5188859179615974
Rows / Sec: 363289.9712724459
Gigs Used After GC: 0.5953610986471176
Rows / Sec: 355113.5833992525
Gigs Used After GC: 0.6713495403528214
Rows / Sec: 352858.5133444862
Gigs Used After GC: 0.7472281381487846
Rows / Sec: 343005.6148955841
Gigs Used After GC: 0.8226156905293465
Rows / Sec: 338097.21665624983
Gigs Used After GC: 0.8969474732875824
Rows / Sec: 360966.3063689902
Gigs Used After GC: 0.9720757678151131
Rows / Sec: 357905.9099864775
Gigs Used After GC: 1.0456845089793205
Rows / Sec: 351107.51704972837
Gigs Used After GC: 1.1266669183969498
Rows / Sec: 356261.26827671967
Gigs Used After GC: 1.2034849897027016

stanbrub · 2024-02-05T20:42:06Z

Just to be clear, there are multiple issues that can be seen in testing between 0.24.0 and 0.32.0. (Though the regression happens earlier than 0.32.0 and later than 0.24.0, these are the versions that are like what HH is seeing.)

Performance:

Between 0.24.0 to 0.32.0 some UDFs regressed significantly (35% to 45%)
For 0.32.0 (and probably the previous release) adding hints to UDF args makes things worse
The performance regression happens for both arrays and scalars
Unable to reproduce UDF regression in Groovy

Memory:

Running the same UDF test back-to-back causes OOM, even if the test explicitly deletes table and runs GC
The OOM happens on UDFs that have scalar or array arguments
Unable to reproduce memory issue in Groovy

jmao-denver · 2024-02-05T23:23:11Z

@stanbrub Can you confirm that performance degradation for scalars is on the same scale as for arrays?

chipkent · 2024-02-05T23:25:31Z

The title lists a "memory leak" and a "performance problem". The example in the thread clearly shows a memory leak, but it looks like the rows/sec remains constant. Is there another reproducer of the "performance problem", or is the performance problem just a slowdown that happens as the process runs out of memory?

stanbrub · 2024-02-06T05:46:51Z

Here are some results that show UDF performance regression between 0.24.0 and 0.32.0 for both scalar and array values

There are large performance regressions for UDFs with no type hints
Adding type hints makes the performance loss even bigger
All tests were run with released docker images in the console w/ 24G Heap on x86 w/ 24 CPU threads
None of the below benchmarks came close to memory limits

udf-array-no-hints.py.txt
udf-double-scalar-no-hints.py.txt
udf-single-scalar-no-hints.py.txt
udf-single-scalar-with-hints.py.txt
udf-double-scalar-with-hints.py.txt
udf-array-with-hints.py.txt

stanbrub · 2024-02-08T00:44:52Z

Here's some more supporting info on the performance regression. I ran some Benchmark UDF tests on >= 0.28.0.

There are two regressions; with the release of 0.29.0 and another with the release of 0.31.0.
Using hints in 0.28.0 greatly improves performance over equivalent UDF's without hints
Using hints in 0.31.0 greatly diminishes performance over equivalent UDF's without hints
The rates used for the below charts use an average rate for several scalar and array UDF's

jmao-denver · 2024-02-08T11:34:37Z

No hints, no vectorization

import time
import numpy as np
from deephaven import empty_table

row_count = 1_000_000

def why(v1):
    return v1 + 1

source = empty_table(row_count).update(["X = (int)(ii % 250)"])

begin_time = time.perf_counter_ns()
result = source.select('Y=(int)why(X)')
print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))

Rows / Sec: 500045.2228398406
Rows / Sec: 467877.1441906515
Rows / Sec: 507475.29067793896

Return hints, vectorizaiton

import time
import numpy as np
from deephaven import empty_table

row_count = 1_000_000

def why(v) -> np.int32:
    return v1 + 1

source = empty_table(row_count).update(["X = (int)(ii % 250)"])

begin_time = time.perf_counter_ns()
result = source.select('Y=why(X)')
print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))

Rows / Sec: 750895.9363673003
Rows / Sec: 750593.4614114742
Rows / Sec: 749491.9379651789

Numpy type hints for input, vectorization

import time
import numpy as np
from deephaven import empty_table

row_count = 1_000_000

def why(v: np.int32) -> np.int32:
    return v1 + 1

source = empty_table(row_count).update(["X = (int)(ii % 250)"])

begin_time = time.perf_counter_ns()
result = source.select('Y=why(X)')
print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))

Pyhton built-in type for input, vectorization

Rows / Sec: 246714.64247366833
Rows / Sec: 246335.93767897843
Rows / Sec: 245319.00670334187

import time
import numpy as np
from deephaven import empty_table

row_count = 1_000_000

def why(v: int) -> np.int32:
    return v1 + 1

source = empty_table(row_count).update(["X = (int)(ii % 250)"])

begin_time = time.perf_counter_ns()
result = source.select('Y=why(X)')
print('Rows / Sec:', row_count / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))

Rows / Sec: 345183.85462927143
Rows / Sec: 341366.03654191043
Rows / Sec: 340745.41210793675

jmao-denver · 2024-02-16T08:50:03Z

TLDR: auto-conversion of args and return value of Py UDF really kills the performance

Auto-conversion disabled in Python, vectorization with type hints; cast needed, no vectorization with no-type-hints, formula is "Y = why(X)"

type hints, Rows / Sec: 7198272.270689589. <==  14 times faster
no type hints, Rows / Sec: 496143.91961906385

Auto-conversion disabled in Python, no vectorization at all, formula is "Y = why(X + 1)"

type hints, Rows / Sec: 722282.9272291749            <==  same performance within the margin of error
no type hints, Rows / Sec: 732129.3625082999

Auto-conversion enabled in Python, vectorization with type hints; cast needed, no vectorization with no-type-hints, formula is "Y = why(X)"

type hints, Rows / Sec: 400916.5270778188. <==  ~18% slower, auto-conversion overhead more than offset the gain from vectorization
no type hints, Rows / Sec: 488018.12092333037

Auto-conversion enabled in Python, no vectorization at all, formula is "Y = why(X + 1)"

type hints, Rows / Sec: 239400.20135381163 <== ~50% slower
no type hints, Rows / Sec: 479758.51428106957 

type hints, Rows / Sec: 505862.9620497612  <==  only return-value conversion, ~18% slower
no type hints, Rows / Sec: 611064.2156193163 

type hints, Rows / Sec: 272818.1017999679 <== only input args converted,  ~60% slower
no type hints, Rows / Sec: 474922.4556231638

Auto-conversion disabled in Java, vectorization with type hints; cast needed, no vectorization with no-type-hints, formula is "Y = why(X)"

type hints, Rows / Sec: 7229849.004798741  <== similar performance b/c vectorization always goes through the vectorization decorator in Python
no type hints, Rows / Sec: 784986.622918931 <== better performance b/c no vectorization and no udf decorator

Auto-conversion disabled in Java, no vectorization, better performance than it is disabled in Python only, because it doesn't go through the udf decorator

type hints, Rows / Sec: 788237.3750420611
no type hints, Rows / Sec: 767679.3823834389

import time
import numpy as np
from deephaven import empty_table, garbage_collect

row_count = 1_000_000
source = empty_table(row_count).update(["X = (int)(ii % 250)"])

def run_test():

    # with type-hints
    def why(v: int) -> np.int32:
        return v

    begin_time = time.perf_counter_ns()
    for i in range(5):
        result = source.select('Y=why(X + 1)')
    print("type hints,", 'Rows / Sec:', row_count * 5 / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))
    print(result.columns[0].data_type)
    
    # try to restore the worker to the same state
    result = None
    for i in range(5):
        garbage_collect()
        time.sleep(0.1)   
    time.sleep(5)

    # without type-hints
    def why(v):
        return v

    begin_time = time.perf_counter_ns()
    for i in range(5):
        # result = source.select('Y=why(X)').select("Y = (int)Y")
        result = source.select("Y = (int)why(X + 1)")
    print("no type hints,", 'Rows / Sec:', row_count * 5 / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))
    print(result.columns[0].data_type)


    result = None
    for i in range(5):
        garbage_collect()    

run_test()

chipkent · 2024-02-17T01:30:51Z

I'm trying to get my head around the relevant changes that lead to the problem. Am I correct that these are the key files?
https://github.com/deephaven/deephaven-core/blob/main/py/server/deephaven/_udf.py
https://github.com/deephaven/deephaven-core/blob/main/engine/table/src/main/java/io/deephaven/engine/util/PyCallableWrapperJpyImpl.java

jmao-denver · 2024-02-18T08:04:20Z

Array input - vectorization for type-hints, no vectorization for no-type-hints

--- arg/return value  auto-conversion enabled ---
type hints, Rows / Sec: 1231949.3024970691
no type hints, Rows / Sec: 1820989.1977622057

--- no return value auto-conversion ---
type hints, Rows / Sec: 1365070.3849798716
no type hints, Rows / Sec: 1862219.3173830567

--- not arg auto-conversion ---
type hints, Rows / Sec: 2411131.7068459815
no type hints, Rows / Sec: 1854127.122524789

--- no arg or return value auto-conversion ---
type hints, Rows / Sec: 2948846.317343872
no type hints, Rows / Sec: 1840094.2422902663

Array input - no vectorization for either

--- arg/return value auto-conversion enabled ---
type hints, Rows / Sec: 1034499.0853991724
no type hints, Rows / Sec: 1832391.606097296

--- no return-value auto-conversion ---
type hints, Rows / Sec: 1108208.866524745
no type hints, Rows / Sec: 1777854.1193077555

--- no arg auto-conversion ---
type hints, Rows / Sec: 1754613.7791632663
no type hints, Rows / Sec: 2071235.161504206

--- no arg/return value auto-conversion ---
type hints, Rows / Sec: 2098357.0185048156
no type hints, Rows / Sec: 2104465.867133651

jmao-denver · 2024-02-19T02:42:05Z

Experiments to speed up the auto-conversion

Scalar input/output, auto-conversion on, removed: null/Optional check

type hints, Rows / Sec: 275570.23424915934
no type hints, Rows / Sec: 504334.2759002709

Scalar input/output, auto-conversion on, removed: null/Optional check, numpy scalar support in input

type hints, Rows / Sec: 449091.78075960197
no type hints, Rows / Sec: 520391.05714783334

Array input, auto-conversion on, removed: null/Optional check

type hints, Rows / Sec: 133654.1541810492
no type hints, Rows / Sec: 179031.90280712221

Array input, auto-conversion on, removed: null/Optional check, lookup function call replaced with the use of map

type hints, Rows / Sec: 157456.28193470957
no type hints, Rows / Sec: 175074.40249280503

import time
import numpy as np
from deephaven import empty_table, garbage_collect

row_count = 10_000_000
source = empty_table(row_count).update(["X = (int)(ii % 1_000_000)", "Y = ii"]).group_by("X")

def run_test():

    # with type-hints
    def why(v: np.ndarray[np.int64]) -> np.float64:
        return np.average(v)

    begin_time = time.perf_counter_ns()
    for i in range(5):
        result = source.select("Z = why(Y)")
    print("type hints,", 'Rows / Sec:', row_count * 5 /10/ ((time.perf_counter_ns() - begin_time) / 1_000_000_000))
    print(result.columns[0].data_type)
    
    # try to restore the worker to the same state
    result = None
    for i in range(5):
        garbage_collect()
        time.sleep(0.1)   
    time.sleep(5)

    # without type-hints
    def why(v):
        v = np.frombuffer(v)
        return np.average(v)

    begin_time = time.perf_counter_ns()
    for i in range(5):
        result = source.select("Z = (double) why(Y)")
    print("no type hints,", 'Rows / Sec:', row_count * 5 / ((time.perf_counter_ns() - begin_time) / 1_000_000_000))
    print(result.columns[0].data_type)
    
   # try to restore the worker to the same state
    result = None
    for i in range(5):
        garbage_collect()
        time.sleep(0.1)   
    time.sleep(5)

run_test()

Some rough numbers, UDF vs. pre-UDF

Py UDF decorator on not-type-hinted UDF: 6 - 10% slower (extra func calls and arg type check)
Py UDF decorator on type-hinted UDF with full-on auto-conversion: 50 - 60% slower (above + DH null handling, + np scalar support, java array-> np.array conversion)
Py UDF decorator on type-hinted UDF with auto-conversion - DH null check -np scalar support +optimization: 20 - 26% slower (base cost + java array->np.array conversion)

chipkent · 2024-02-20T21:19:14Z

I'm looking at the perf data above.

I'm not totally understanding what is going on. Here it looks like the scalar input case is only processing 500k rows per sec while the array input case is processing 2m rows per sec. What am I missing? The example code is just the array case code, so I can't guess. Maybe it is because the rows per sec is being calculated based on the input table, and the array case is only processing 10 rows?
The array case is only processing 10 rows, which seems extreme. How bad are things if you process a lot of smaller rows?

I have other specific questions about what is going on. These questions revolve around the performance of built-in python types vs numpy types (e.g. float vs np.float64), conversion times in hand written code vs annotations, etc. I wrote this benchmark to try to dig in a bit more. I'm not sure how to disable vectorization, so the numbers I am getting are not apples to apples.

New benchmarks

import time
import numpy as np
from deephaven import empty_table, garbage_collect

row_count = 1_000_000
source = empty_table(row_count).update(["Y = ii"])

# both type-hints numpy
def why_n_n(v: np.int64) -> np.float64:
    return v+1.2

# both type-hints native
def why_p_p(v: int) -> float:
    return v+1.2

# return type-hints numpy
def why_x_n(v) -> np.float64:
    return v+1.2

# return type-hints native
def why_x_p(v) -> float:
    return v+1.2

# param type-hints numpy
def why_n_x(v: np.int64):
    return v+1.2

# param type-hints native
def why_p_x(v: int):
    return v+1.2

# no type-hints
def why_x_x(v):
    return v+1.2

# no type-hints convert the input
def why_c_x(v):
    v = np.int64(v)
    return v+1.2

# no type-hints convert the output
def why_x_c(v):
    return np.float64(v+1.2)

# no type-hints convert both
def why_c_c(v):
    v = np.int64(v)
    return np.float64(v+1.2)

jobs = [
    ["Z = why_n_n(Y)", "numpy/numpy"],
    ["Z = why_p_p(Y)", "python/python"],
    ["Z = why_x_n(Y)", "none/numpy"],
    ["Z = why_x_p(Y)", "none/python"],
    ["Z = (double) why_n_x(Y)", "numpy/none"],
    ["Z = (double) why_p_x(Y)", "python/none"],
    ["Z = (double) why_x_x(Y)", "none/none"],
    ["Z = (double) why_c_x(Y)", "convert/none"],
    ["Z = (double) why_x_c(Y)", "none/convert"],
    ["Z = (double) why_c_c(Y)", "convert/convert"],
]

def run_test():
    for query, msg in jobs:
        begin_time = time.perf_counter_ns()
        for i in range(5):
            result = source.select(query)
        print(f"{msg} hints: {row_count * 5 / ((time.perf_counter_ns() - begin_time) / 1_000_000_000)} rows/sec ({result.columns[0].data_type})")

        # try to restore the worker to the same state
        result = None
        for i in range(5):
            garbage_collect()
        time.sleep(0.1)
        time.sleep(5)

run_test()

Running this code on the demo system with 0.32.0 produces:

numpy/numpy hints: 68825.7609555498 rows/sec (double)
python/python hints: 99940.31391955742 rows/sec (double)
none/numpy hints: 235329.6939374642 rows/sec (double)
none/python hints: 235613.83023603584 rows/sec (double)
numpy/none hints: 53120.79882927409 rows/sec (double)
python/none hints: 72962.35161478045 rows/sec (double)
none/none hints: 132157.69190853115 rows/sec (double)
convert/none hints: 84414.73315296792 rows/sec (double)
none/convert hints: 123055.80861453459 rows/sec (double)
convert/convert hints: 83034.57429840925 rows/sec (double)

ALL of the rows/sec seem low. I do not know what hardware they are on...

Observation 1: Numpy scalar conversions are ultra slow (not a DH problem)

Numpy scalar types are much slower than python scalar types.

Converting both inputs and outputs to numpy scalars is 31% slower than using the native python types.

numpy/numpy hints: 68825.7609555498 rows/sec (double)
python/python hints: 99940.31391955742 rows/sec (double)

Floating point return conversion seems comparable.

none/numpy hints: 235329.6939374642 rows/sec (double)
none/python hints: 235613.83023603584 rows/sec (double)

Integer inputs are 27% slower.

numpy/none hints: 53120.79882927409 rows/sec (double)
python/none hints: 72962.35161478045 rows/sec (double)

Converting inputs and outputs to numpy scalars by hand is slow.

Here the functions have no type hints and convert by hand. Converting both values is 37% slower than the unconverted case. The float conversion is around 6%, while the bulk of the slowdown is from the integer conversion.

none/none hints: 132157.69190853115 rows/sec (double)
convert/convert hints: 83034.57429840925 rows/sec (double)
convert/none hints: 84414.73315296792 rows/sec (double)
none/convert hints: 123055.80861453459 rows/sec (double)

Observation 2: DH wrapper is way too slow (DH problem)

The 0.32.0 annotations are very slow. Using an annotation to convert a Python type is 44% slower than no conversion, even though the result is identical.

none/none hints: 132157.69190853115 rows/sec (double)
python/none hints: 72962.35161478045 rows/sec (double)

Similarly, converting the input to numpy by hand is 36% slower vs 60% slower when using the annotation.

convert/none hints: 84414.73315296792 rows/sec (double)
numpy/none hints: 53120.79882927409 rows/sec (double)

This all indicates that the annotation has too much overhead.

Considerations

Even unannotated cases seem excessively slow on the demo system.
Numpy scalars are incredibly slow. Should we support them? Probably not.
The DH annotation has way too much overhead. It needs to be reenginnered for speed, or key parts of what it is doing need to happen in the query language.
Since I don't know how to turn off vectorization, I could not assess the overhead of a return type hint vs having a query string cast. This experiment should be done since it may uncover some other clear performance problems.
None of the above has looked at performance tradeoffs for various ways nulls could be handled. This concept needs to be understood in detail.
None of the above looks at costs associated with input or output arrays.

jmao-denver · 2024-02-21T11:36:21Z

I'm looking at the perf data above.

I'm not totally understanding what is going on. Here it looks like the scalar input case is only processing 500k rows per sec while the array input case is processing 2m rows per sec. What am I missing? The example code is just the array case code, so I can't guess. Maybe it is because the rows per sec is being calculated based on the input table, and the array case is only processing 10 rows?

It is a low-level math mistake. The row number after group-by should be 10 times smaller.

jmao-denver · 2024-02-23T08:38:47Z

Summary:


1. the weight of extra function calls themself (the Py UDF decorator, and its dependent ones) costs ~7% performance loss. (see the difference between Exp.1 and Exp.7)
2. eliminating branching provides a big improvement, though from a very low level (between Exp.2 and Exp.3)
3. Exp. 4 probably makes the most sense in terms of functionality and correctness, but it is still significantly slower than the pre-UDF numbers (between Exp.1 and Exp. 4)

Test code (Chip's, but with only two targeted UDFs):

import time
import numpy as np
from deephaven import empty_table, garbage_collect

row_count = 1_000_000
source = empty_table(row_count).update(["Y = ii"])

# both type-hints numpy
def why_n_n(v: np.int64) -> np.float64:
    return v+1.2

# both type-hints native
def why_p_p(v: int) -> float:
    return v+1.2

# return type-hints numpy
def why_x_n(v) -> np.float64:
    return v+1.2

# return type-hints native
def why_x_p(v) -> float:
    return v+1.2

# param type-hints numpy
def why_n_x(v: np.int64):
    return v+1.2

# param type-hints native
def why_p_x(v: int):
    return v+1.2

# no type-hints
def why_x_x(v):
    return v+1.2

# no type-hints convert the input
def why_c_x(v):
    v = np.int64(v)
    return v+1.2

# no type-hints convert the output
def why_x_c(v):
    return np.float64(v+1.2)

# no type-hints convert both
def why_c_c(v):
    v = np.int64(v)
    return np.float64(v+1.2)

jobs = [
    ["Z = why_n_n(Y + 1)", "numpy/numpy"],
    ["Z = why_p_p(Y + 1)", "python/python"],
    # ["Z = why_x_n(Y)", "none/numpy"],
    # ["Z = why_x_p(Y)", "none/python"],
    # ["Z = (double) why_n_x(Y)", "numpy/none"],
    # ["Z = (double) why_p_x(Y)", "python/none"],
    # ["Z = (double) why_x_x(Y)", "none/none"],
    # ["Z = (double) why_c_x(Y)", "convert/none"],
    # ["Z = (double) why_x_c(Y)", "none/convert"],
    # ["Z = (double) why_c_c(Y)", "convert/convert"],
]

def run_test():
    global source
    for query, msg in jobs:
        begin_time = time.perf_counter_ns()
        for i in range(5):
            result = source.select(query)
        print(f"{msg} hints: {row_count * 5 / ((time.perf_counter_ns() - begin_time) / 1_000_000_000)} rows/sec ({result.columns[0].data_type})")

        # try to restore the worker to the same state
        result = None
        for i in range(5):
            garbage_collect()
            time.sleep(0.1)
        time.sleep(5)
    source = None
run_test()

1. Py UDF decorator by-passed entirely

numpy/numpy hints: 767347.5798621533 rows/sec (double)
python/python hints: 765443.6285388595 rows/sec (double)

2. Py UDF decorator, non-optimized (no typed converter)

numpy/numpy hints: 186339.32544498434 rows/sec (double)
python/python hints: 259044.24304883406 rows/sec (double)

3. Hard-coded long/float conversion for input/output (null check, NumPy type check)

numpy/numpy hints: 284009.91974102164 rows/sec (double)
python/python hints: 430263.8651539863 rows/sec (double)

np_long_type = np.dtype("l")
def _convert_long_arg(param: _ParsedParamAnnotation, arg: int) -> Any:
    """ Convert an integer argument to the type specified by the annotation """
    if arg == NULL_LONG:
        if param.none_allowed:
            return None
        else:
            raise DHError(f"Argument {arg} is not compatible with annotation {param.orig_types}")
    else:
        for t in param.orig_types:
            if issubclass(t, np.generic):
                return np_long_type.type(arg)
        else:
            return arg
    # return arg

def _convert_args(p_sig: _ParsedSignature, args: Tuple[Any, ...]) -> List[Any]:
    """ Convert all arguments to the types specified by the annotations.
    Given that the number of arguments and the number of parameters may not match (in the presence of keyword,
    var-positional, or var-keyword parameters), we have the following rules:
     If the number of arguments is less than the number of parameters, the remaining parameters are left as is.
     If the number of arguments is greater than the number of parameters, the extra arguments are left as is.

    Python's function call mechanism will raise an exception if it can't resolve the parameters with the arguments.
    """
    converted_args = [_convert_long_arg(param, arg) for param, arg in zip(p_sig.params, args)] ####### call the typed convertor
    # converted_args = [_convert_arg(param, arg) for param, arg in zip(p_sig.params, args)]
    converted_args.extend(args[len(converted_args):])  
    return converted_args

def _scalar(x: Any, dtype: DType) -> Any:
    """Converts a Python value to a Java scalar value. It converts the numpy primitive types, string to
    their Python equivalents so that JPY can handle them. For datetime values, it converts them to Java Instant.
    Otherwise, it returns the value as is."""

    # NULL_BOOL will appear in Java as a byte value which causes a cast error. We just let JPY converts it to Java null
    # and the engine has casting logic to handle it.
    # if (dt := _PRIMITIVE_DTYPE_NULL_MAP.get(dtype)) and _is_py_null(x) and dtype not in (bool_, char):
    #     return dt

    return NULL_DOUBLE if _is_py_null(x) else float(x) ######### skip all the branching

    # try:
    #     if hasattr(x, "dtype"):
    #         if x.dtype.char == 'H':  # np.uint16 maps to Java char
    #             return Character(int(x))
    #         elif x.dtype.char in _NUMPY_INT_TYPE_CODES:
    #             return int(x)
    #         elif x.dtype.char in _NUMPY_FLOATING_TYPE_CODES:
    #             return float(x)
    #         elif x.dtype.char == '?':
    #             return bool(x)
    #         elif x.dtype.char == 'U':
    #             return str(x)
    #         elif x.dtype.char == 'O':
    #             return x
    #         elif x.dtype.char == 'M':
    #             from deephaven.time import to_j_instant
    #             return to_j_instant(x)
    #     elif isinstance(x, (datetime.datetime, pd.Timestamp)):
    #             from deephaven.time import to_j_instant
    #             return to_j_instant(x)
    #     return x
    # except:
    #     return x

4. Hard-coded long/float conversion for input/output (null check, no Numpy type check)

numpy/numpy hints: 506900.8422251575 rows/sec (double)
python/python hints: 517587.5559497151 rows/sec (double)

def _convert_long_arg(param: _ParsedParamAnnotation, arg: int) -> Any:
    """ Convert an integer argument to the type specified by the annotation """
    if arg == NULL_LONG:
        if param.none_allowed:
            return None
        else:
            raise DHError(f"Argument {arg} is not compatible with annotation {param.orig_types}")
    else:
        return arg
    # else:
    #     for t in param.orig_types:
    #         if issubclass(t, np.generic):
    #             return np_long_type.type(arg)
    #     else:
    #         return arg
    # return arg

5. Hard-coded long/float conversion for input/output (no null check, no NumPy type check)

numpy/numpy hints: 537996.0974778825 rows/sec (double)
python/python hints: 545296.1069144575 rows/sec (double)

def _convert_long_arg(param: _ParsedParamAnnotation, arg: int) -> Any:
    """ Convert an integer argument to the type specified by the annotation """
    # if arg == NULL_LONG:
    #     if param.none_allowed:
    #         return None
    #     else:
    #         raise DHError(f"Argument {arg} is not compatible with annotation {param.orig_types}")
    # else:
    #     for t in param.orig_types:
    #         if issubclass(t, np.generic):
    #             return np_long_type.type(arg)
    #     else:
    #         return arg
    return arg ###### skip null check/numpy scalar

def _scalar(x: Any, dtype: DType) -> Any:
    """Converts a Python value to a Java scalar value. It converts the numpy primitive types, string to
    their Python equivalents so that JPY can handle them. For datetime values, it converts them to Java Instant.
    Otherwise, it returns the value as is."""

    # NULL_BOOL will appear in Java as a byte value which causes a cast error. We just let JPY converts it to Java null
    # and the engine has casting logic to handle it.
    # if (dt := _PRIMITIVE_DTYPE_NULL_MAP.get(dtype)) and _is_py_null(x) and dtype not in (bool_, char):
    #     return dt

    return float(x)

    # try:
    #     if hasattr(x, "dtype"):
    #         if x.dtype.char == 'H':  # np.uint16 maps to Java char
    #             return Character(int(x))
    #         elif x.dtype.char in _NUMPY_INT_TYPE_CODES:
    #             return int(x)
    #         elif x.dtype.char in _NUMPY_FLOATING_TYPE_CODES:
    #             return float(x)
    #         elif x.dtype.char == '?':
    #             return bool(x)
    #         elif x.dtype.char == 'U':
    #             return str(x)
    #         elif x.dtype.char == 'O':
    #             return x
    #         elif x.dtype.char == 'M':
    #             from deephaven.time import to_j_instant
    #             return to_j_instant(x)
    #     elif isinstance(x, (datetime.datetime, pd.Timestamp)):
    #             from deephaven.time import to_j_instant
    #             return to_j_instant(x)
    #     return x
    # except:
    #     return x

6. Hard-coded long/float conversion for input/output (no null check, no NumPy type check, _scalar() skipped)

numpy/numpy hints: 545560.2847189392 rows/sec (double)
python/python hints: 547478.740932669 rows/sec (double)

def _py_udf(fn: Callable):
    """A decorator that acts as a transparent translator for Python UDFs used in Deephaven query formulas between
    Python and Java. This decorator is intended for use by the Deephaven query engine and should not be used by
    users.

    It carries out two conversions:
    1. convert Python function return values to Java values.
        For properly annotated functions, including numba vectorized and guvectorized ones, this decorator inspects the
        signature of the function and determines its return type, including supported primitive types and arrays of
        the supported primitive types. It then converts the return value of the function to the corresponding Java value
        of the same type. For unsupported types, the decorator returns the original Python value which appears as
        org.jpy.PyObject in Java.
    4. convert Java function arguments to Python values based on the signature of the function.
    """
    if hasattr(fn, "return_type"):
        return fn
    p_sig = _parse_signature(fn)
    # build a signature string for vectorization by removing NoneType, array char '[', and comma from the encoded types
    # since vectorization only supports UDFs with a single signature and enforces an exact match, any non-compliant
    # signature (e.g. Union with more than 1 non-NoneType) will be rejected by the vectorizer.
    sig_str_vectorization = re.sub(r"[\[N,]", "", p_sig.encoded)
    return_array = p_sig.ret_annotation.has_array
    ret_dtype = dtypes.from_np_dtype(np.dtype(p_sig.ret_annotation.encoded_type[-1]))

    @wraps(fn)
    def wrapper(*args, **kwargs):
        converted_args = _convert_args(p_sig, args)
        # converted_args = args
        # kwargs are not converted because they are not used in the UDFs
        ret = fn(*converted_args, **kwargs)
        if return_array:
            return dtypes.array(ret_dtype, ret)
        elif ret_dtype == dtypes.PyObject:
            return ret
        else:
            return ret ####### skip _scalar() call
            # return _scalar(ret, ret_dtype)

    wrapper.j_name = ret_dtype.j_name
    real_ret_dtype = _BUILDABLE_ARRAY_DTYPE_MAP.get(ret_dtype, dtypes.PyObject) if return_array else ret_dtype

    if hasattr(ret_dtype.j_type, 'jclass'):
        j_class = real_ret_dtype.j_type.jclass
    else:
        j_class = real_ret_dtype.qst_type.clazz()

    wrapper.return_type = j_class
    wrapper.signature = sig_str_vectorization

    return wrapper

7. convert_args() skipped, _scalar() skipped)

numpy/numpy hints: 668722.7697534269 rows/sec (double)
python/python hints: 695122.2607676275 rows/sec (double)

def _py_udf(fn: Callable):
    """A decorator that acts as a transparent translator for Python UDFs used in Deephaven query formulas between
    Python and Java. This decorator is intended for use by the Deephaven query engine and should not be used by
    users.

    It carries out two conversions:
    1. convert Python function return values to Java values.
        For properly annotated functions, including numba vectorized and guvectorized ones, this decorator inspects the
        signature of the function and determines its return type, including supported primitive types and arrays of
        the supported primitive types. It then converts the return value of the function to the corresponding Java value
        of the same type. For unsupported types, the decorator returns the original Python value which appears as
        org.jpy.PyObject in Java.
    2. convert Java function arguments to Python values based on the signature of the function.
    """
    if hasattr(fn, "return_type"):
        return fn
    p_sig = _parse_signature(fn)
    # build a signature string for vectorization by removing NoneType, array char '[', and comma from the encoded types
    # since vectorization only supports UDFs with a single signature and enforces an exact match, any non-compliant
    # signature (e.g. Union with more than 1 non-NoneType) will be rejected by the vectorizer.
    sig_str_vectorization = re.sub(r"[\[N,]", "", p_sig.encoded)
    return_array = p_sig.ret_annotation.has_array
    ret_dtype = dtypes.from_np_dtype(np.dtype(p_sig.ret_annotation.encoded_type[-1]))

    @wraps(fn)
    def wrapper(*args, **kwargs):
        # converted_args = _convert_args(p_sig, args)  ###### skip _convert_argsIO entirely
        converted_args = args 
        # kwargs are not converted because they are not used in the UDFs
        ret = fn(*converted_args, **kwargs)
        if return_array:
            return dtypes.array(ret_dtype, ret)
        elif ret_dtype == dtypes.PyObject:
            return ret
        else:
            return ret ###### skip _scalar() entirely
            # return _scalar(ret, ret_dtype) 

    wrapper.j_name = ret_dtype.j_name
    real_ret_dtype = _BUILDABLE_ARRAY_DTYPE_MAP.get(ret_dtype, dtypes.PyObject) if return_array else ret_dtype

    if hasattr(ret_dtype.j_type, 'jclass'):
        j_class = real_ret_dtype.j_type.jclass
    else:
        j_class = real_ret_dtype.qst_type.clazz()

    wrapper.return_type = j_class
    wrapper.signature = sig_str_vectorization

    return wrapper

jmao-denver added bug Something isn't working triage python-server-side labels Feb 3, 2024

jmao-denver added this to the 1. January 2024 (end of month) milestone Feb 3, 2024

jmao-denver self-assigned this Feb 3, 2024

rcaudy added core Core development tasks python and removed triage labels Feb 4, 2024

nbauernfeind mentioned this issue Feb 8, 2024

PythonScopeJpyImpl: Disable Conversion Cache #5123

Merged

jmao-denver changed the title ~~The auto conversion feature of the Python UDF decorator has a memory leak and performance problem.~~ The auto conversion feature of the Python UDF decorator has a performance problem. Feb 16, 2024

pete-petey modified the milestones: 1. January 2024 (end of month), 1. March 2024 Mar 11, 2024

This was referenced Mar 20, 2024

Move arg type check for Py UDF to query compile time #5221

Closed

Use predetermined type converters for UDF arguments #5291

Merged

jmao-denver closed this as completed in #5291 Apr 12, 2024

deephaven-internal mentioned this issue Apr 12, 2024

Use predetermined type converters for UDF arguments deephaven/deephaven-docs-community#186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The auto conversion feature of the Python UDF decorator has a performance problem. #5112

The auto conversion feature of the Python UDF decorator has a performance problem. #5112

jmao-denver commented Feb 3, 2024

jmao-denver commented Feb 5, 2024 •

edited

Loading

stanbrub commented Feb 5, 2024 •

edited

Loading

jmao-denver commented Feb 5, 2024

chipkent commented Feb 5, 2024 •

edited

Loading

stanbrub commented Feb 6, 2024 •

edited

Loading

stanbrub commented Feb 8, 2024

jmao-denver commented Feb 8, 2024

jmao-denver commented Feb 16, 2024 •

edited

Loading

chipkent commented Feb 17, 2024

jmao-denver commented Feb 18, 2024 •

edited

Loading

jmao-denver commented Feb 19, 2024 •

edited

Loading

chipkent commented Feb 20, 2024

jmao-denver commented Feb 21, 2024

jmao-denver commented Feb 23, 2024 •

edited

Loading

The auto conversion feature of the Python UDF decorator has a performance problem. #5112

The auto conversion feature of the Python UDF decorator has a performance problem. #5112

Comments

jmao-denver commented Feb 3, 2024

jmao-denver commented Feb 5, 2024 • edited Loading

1. Wrap @stanbrub 's bench marking script into a function and call it multiple times, no memory leak

2. Run @stanbrub 's script as is multiple times manually in the IDE, almost constant memory leak amount each time, in fact, we don't even need to involve PY UDF, for example, just replace the formula with "Y = 1" in the select op, would render exact same result.

stanbrub commented Feb 5, 2024 • edited Loading

jmao-denver commented Feb 5, 2024

chipkent commented Feb 5, 2024 • edited Loading

stanbrub commented Feb 6, 2024 • edited Loading

stanbrub commented Feb 8, 2024

jmao-denver commented Feb 8, 2024

No hints, no vectorization

Return hints, vectorizaiton

Numpy type hints for input, vectorization

Pyhton built-in type for input, vectorization

jmao-denver commented Feb 16, 2024 • edited Loading

TLDR: auto-conversion of args and return value of Py UDF really kills the performance

Auto-conversion disabled in Python, vectorization with type hints; cast needed, no vectorization with no-type-hints, formula is "Y = why(X)"

Auto-conversion disabled in Python, no vectorization at all, formula is "Y = why(X + 1)"

Auto-conversion enabled in Python, vectorization with type hints; cast needed, no vectorization with no-type-hints, formula is "Y = why(X)"

Auto-conversion enabled in Python, no vectorization at all, formula is "Y = why(X + 1)"

Auto-conversion disabled in Java, vectorization with type hints; cast needed, no vectorization with no-type-hints, formula is "Y = why(X)"

Auto-conversion disabled in Java, no vectorization, better performance than it is disabled in Python only, because it doesn't go through the udf decorator

chipkent commented Feb 17, 2024

jmao-denver commented Feb 18, 2024 • edited Loading

Array input - vectorization for type-hints, no vectorization for no-type-hints

Array input - no vectorization for either

jmao-denver commented Feb 19, 2024 • edited Loading

Experiments to speed up the auto-conversion

chipkent commented Feb 20, 2024

New benchmarks

Observation 1: Numpy scalar conversions are ultra slow (not a DH problem)

Observation 2: DH wrapper is way too slow (DH problem)

Considerations

jmao-denver commented Feb 21, 2024

jmao-denver commented Feb 23, 2024 • edited Loading

Summary:

1. Py UDF decorator by-passed entirely

2. Py UDF decorator, non-optimized (no typed converter)

3. Hard-coded long/float conversion for input/output (null check, NumPy type check)

4. Hard-coded long/float conversion for input/output (null check, no Numpy type check)

5. Hard-coded long/float conversion for input/output (no null check, no NumPy type check)

6. Hard-coded long/float conversion for input/output (no null check, no NumPy type check, _scalar() skipped)

7. convert_args() skipped, _scalar() skipped)

jmao-denver commented Feb 5, 2024 •

edited

Loading

stanbrub commented Feb 5, 2024 •

edited

Loading

chipkent commented Feb 5, 2024 •

edited

Loading

stanbrub commented Feb 6, 2024 •

edited

Loading

jmao-denver commented Feb 16, 2024 •

edited

Loading

jmao-denver commented Feb 18, 2024 •

edited

Loading

jmao-denver commented Feb 19, 2024 •

edited

Loading

jmao-denver commented Feb 23, 2024 •

edited

Loading