Java gatherer #1523

jjbrosnan · 2021-11-03T14:55:13Z

The java gatherer transfers table data to a Python object. It works in Python using the frombuffer method, for which NumPy and Torch have methods. This is acceptable, because these are the two most common data containers that will be used with AI.

Speed testing between between Chip's v1 and v2 showed mixed results. This code is his v2, because it showed better speedups more often than v1. There was enough of both to justify including two methods, but that seems unnecessary given both showed speedups between 5x and 250x over the legacy method.

I have tested for memory leaks and differences between outputs of legacy and new code and have been able to find none.

jjbrosnan · 2021-11-03T15:54:04Z

Speedup using learn can be checked:

from deephaven.TableTools import emptyTable
from deephaven import learn
import numpy as np
import time
import sys

end, start = 0, 0
legacy_elapsed, new_elapsed = 0, 0

test_samples = [1001, 10001, 100001, 500001, 1000001]

def compute_sin(x):
    return np.sin(x)

def table_to_numpy(idx, cols):
    global end, start, legacy_elapsed
    start = time.time()
    
    return_array = np.empty([idx.getSize(), len(cols)], dtype = float)
    iter = idx.iterator()
    i = 0
    while (iter.hasNext()):
        it = iter.next()
        j = 0
        for col in cols:
            return_array[i, j] = col.get(it)
            j += 1
        i += 1

    end = time.time()
    legacy_elapsed = end - start

    return np.squeeze(return_array)

def new_table_to_numpy(idx, cols):

    global end, start, new_elapsed

    start = time.time()
    buffer = learn.gatherer.create_2d_tensor(idx, cols)
    tensor = np.frombuffer(buffer, dtype = float)
    tensor.shape = (idx.getSize(), len(cols))
    end = time.time()

    new_elapsed = end - start

    return np.squeeze(tensor)

def numpy_to_table(data, idx):
    return data[idx]

for n_samples in test_samples:

    print(str(n_samples) + " rows.")

    source = emptyTable(n_samples).update("X = (i / n_samples) * 2 * Math.PI", "Y = X")

    result_legacy = learn.learn(
        table = source,
        model_func = compute_sin,
        inputs = [learn.Input("X", table_to_numpy)],
        outputs = [learn.Output("SinX", numpy_to_table)],
        batch_size = n_samples
    )
    print("LEGACY: Time elapsed: " + str(legacy_elapsed) + " seconds.")

    result_v1 = learn.learn(
        table = source,
        model_func = compute_sin,
        inputs = [learn.Input("X", new_table_to_numpy)],
        outputs = [learn.Output("SinX", numpy_to_table)],
        batch_size = n_samples
    )
    speedup = np.round(legacy_elapsed / new_elapsed, 2)
    print("NEW: Time elapsed: " + str(new_elapsed) + " seconds.  This is " + str(speedup) + "x faster.")

First run shows speedups:

1,001 rows: 13.2x
10,001 rows: 36.14x
100,001 rows: 110.48x
500,001 rows: 91.23x
1,000,001 rows: 171.12x

Integrations/python/deephaven/learn/__init__.py

Integrations/src/main/java/io/deephaven/integrations/learn/Gatherer.java

…eephaven.java for a bit

jjbrosnan · 2021-11-08T21:25:02Z

Here's an example of a Python query with the current code:

# Deephaven imports
from deephaven.TableTools import emptyTable
from deephaven.transferrer import table_to_numpy_2d
from deephaven import transferrer
from deephaven import learn

# Standard imports
import numpy as np

num_rows = 1000

# Create a table with some data
source = emptyTable(num_rows).update("X = (double)i", "Y = (double)(2 * i)")

my_np_gatherer = lambda idx, cols: table_to_numpy_2d(idx, cols, np.double)

def my_np_scatterer(data, idx):
    return data[idx]

# Define a function that sums rows in a NumPy array
def add_columns(columns):
    results = np.array([])
    for row in columns:
        results = np.append(results, np.sum(row))
    return results

# Apply the add_columns function to source with learn
result = learn.learn(
    table = source,
    model_func = add_columns,
    inputs = [learn.Input(["X", "Y"], my_np_gatherer)],
    outputs = [learn.Output("Z", my_np_scatterer)],
    batch_size = num_rows
)

jjbrosnan · 2021-11-08T21:41:44Z

In the current instantiation of the transferrer submodule (I'm not married to that name, I just can't think of anything better), the user will explicitly set the numpy dtype using a lambda function.

I have tested this for all data types and have only found issues with boolean values. I believe this is because the getBoolean method returns a java.lang.Boolean and not the primitive boolean type.

Right now, the following NumPy and Python data types are supported:

Python built in: int, float, boolean (supported by explicit conversion to the corresponding NumPy dtype)
NumPy: bool, byte, float, double, int, short, long (and all of the respective aliases)

After testing with Python built-in types, I got strange errors I didn't fully understand. Thus, I decided that explicit conversion to the corresponding NumPy dtype is appropriate.

I have yet to implement a transpose or any other array operation. I think that needs further discussion.

Integrations/python/deephaven/learn/__init__.py

Integrations/src/main/java/io/deephaven/integrations/learn/Gatherer.java

Integrations/src/main/java/io/deephaven/integrations/learn/Scatterer.java

Integrations/python/deephaven/transferrer/__init__.py

…n accordance with

jjbrosnan · 2021-12-03T13:45:55Z

Currently, deephaven.learn uses an IndexSet as the first argument to the function that gathers data from a Table to a Python object. This PR creates a flat Java array using a RowSequence. I'm not sure if we should make the change in this PR or another one, but deephaven.learn needs to be updated to use a RowSequence, and not an IndexSet. Either that, or convert an IndexSet to a RowSequence under the hood (provided that's possible and not horribly inefficient).

… and the corresponding Python code

Integrations/python/deephaven/learn/__init__.py

Integrations/python/deephaven/learn/gather/__init__.py

Integrations/src/main/java/io/deephaven/integrations/learn/Gatherer.java

Integrations/src/test/java/io/deephaven/integrations/learn/GathererTest.java

Integrations/python/deephaven/learn/gather/__init__.py

Integrations/src/main/java/io/deephaven/integrations/learn/gather/NumPy.java

Integrations/src/test/java/io/deephaven/integrations/learn/gather/GathererTest.java

Integrations/python/deephaven/learn/gather/__init__.py

Push of gatherer after performance and memory leak testing

5222b24

jjbrosnan added core Core development tasks python java DocumentationNeeded labels Nov 3, 2021

jjbrosnan added this to the Nov 2021 milestone Nov 3, 2021

jjbrosnan requested review from chipkent and jcferretti November 3, 2021 14:55

jjbrosnan self-assigned this Nov 3, 2021

jjbrosnan added 6 commits November 3, 2021 10:58

Add Gatherer wrapper to Python learn submodule

7d82122

Changes from ./gradlew :Integrations:spotlessApply

099ed3f

Minor update and rename to Pythonic conventions

98c812c

Fix typo

683a837

Another typo

7e4300b

Add _defineSymbols call so class wrappers get defined

fcccb45

chipkent requested changes Nov 3, 2021

View reviewed changes

jjbrosnan added 5 commits November 3, 2021 17:13

Changes. Not fully ready, but I don't want to stash them yet.

48d12e2

More updates. Still not ready, but I want to save these and work on d…

02de5b1

…eephaven.java for a bit

Updates

6153674

Remove print statement for typeString

0cd8cd0

Replace functions for each data type with one single function

e27264c

jjbrosnan added 3 commits November 8, 2021 16:42

Add support for python built-in types, not just NumPy types

9c1b9d4

Fix typo

36c160e

Convert Python built-in types to NumPy dtypes

909e5dc

chipkent requested changes Nov 8, 2021

View reviewed changes

jjbrosnan added 2 commits November 12, 2021 16:16

Merge remote-tracking branch 'origin/main' into java-gatherer

e294efc

Updates from Chip's review. Still testing

489be5d

Major changes to Gather functions, and updates to tests/Python code i…

9d4b1b0

…n accordance with

jjbrosnan dismissed chipkent’s stale review via 9d4b1b0 December 2, 2021 21:57

jjbrosnan added 3 commits December 2, 2021 17:29

Fix Python test

aef62d2

Fix java tests

42b8efc

Remove commented out java test

2a7a900

jjbrosnan added 6 commits December 3, 2021 12:03

Add row- and column-major functions. Also update tests in Java/Python…

6944fce

… and the corresponding Python code

spotlessApply

c552acb

Merge remote-tracking branch 'origin/main' into java-gatherer

ed35916

Fix for Python

c5da9bf

Merge remote-tracking branch 'origin/main' into java-gatherer

597acea

Fix Python test (reference to old IndexSet)

c4bf888

chipkent requested changes Dec 7, 2021

View reviewed changes

jjbrosnan added 3 commits December 7, 2021 16:57

Changes from Chip's review

bcfb220

spotlessApply

77c0370

Fix Python

ff0db6a

chipkent requested changes Dec 8, 2021

View reviewed changes

jjbrosnan added 2 commits December 8, 2021 14:38

Updates from Chip's review

4859aff

spotlessApply

053853a

chipkent requested changes Dec 8, 2021

View reviewed changes

Updates from Chip's review

e48837b

chipkent requested changes Dec 9, 2021

View reviewed changes

Integrations/python/deephaven/learn/gather/__init__.py Outdated Show resolved Hide resolved

jjbrosnan added 3 commits December 9, 2021 11:12

Fix return to fix Python test

6058848

Chip's suggestion for comment in enum

d3224ef

Put comments below enum values to make Sphinx happy

bd2f454

chipkent approved these changes Dec 9, 2021

View reviewed changes

devinrsmith added the release blocker A bug/behavior that puts is below the "good enough" threshold to release. label Dec 9, 2021

chipkent merged commit ae32bfd into deephaven:main Dec 9, 2021

github-actions bot locked and limited conversation to collaborators Dec 9, 2021

jjbrosnan deleted the java-gatherer branch December 13, 2021 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java gatherer #1523

Java gatherer #1523

jjbrosnan commented Nov 3, 2021

jjbrosnan commented Nov 3, 2021 •

edited

Loading

jjbrosnan commented Nov 8, 2021

jjbrosnan commented Nov 8, 2021 •

edited

Loading

jjbrosnan commented Dec 3, 2021

Java gatherer #1523

Java gatherer #1523

Conversation

jjbrosnan commented Nov 3, 2021

jjbrosnan commented Nov 3, 2021 • edited Loading

jjbrosnan commented Nov 8, 2021

jjbrosnan commented Nov 8, 2021 • edited Loading

jjbrosnan commented Dec 3, 2021

jjbrosnan commented Nov 3, 2021 •

edited

Loading

jjbrosnan commented Nov 8, 2021 •

edited

Loading