Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java gatherer #1523

Merged
merged 49 commits into from
Dec 9, 2021
Merged

Java gatherer #1523

merged 49 commits into from
Dec 9, 2021

Conversation

jjbrosnan
Copy link
Contributor

The java gatherer transfers table data to a Python object. It works in Python using the frombuffer method, for which NumPy and Torch have methods. This is acceptable, because these are the two most common data containers that will be used with AI.

Speed testing between between Chip's v1 and v2 showed mixed results. This code is his v2, because it showed better speedups more often than v1. There was enough of both to justify including two methods, but that seems unnecessary given both showed speedups between 5x and 250x over the legacy method.

I have tested for memory leaks and differences between outputs of legacy and new code and have been able to find none.

@jjbrosnan jjbrosnan added this to the Nov 2021 milestone Nov 3, 2021
@jjbrosnan jjbrosnan self-assigned this Nov 3, 2021
@jjbrosnan
Copy link
Contributor Author

jjbrosnan commented Nov 3, 2021

Speedup using learn can be checked:

from deephaven.TableTools import emptyTable
from deephaven import learn
import numpy as np
import time
import sys

end, start = 0, 0
legacy_elapsed, new_elapsed = 0, 0

test_samples = [1001, 10001, 100001, 500001, 1000001]

def compute_sin(x):
    return np.sin(x)

def table_to_numpy(idx, cols):
    global end, start, legacy_elapsed
    start = time.time()
    
    return_array = np.empty([idx.getSize(), len(cols)], dtype = float)
    iter = idx.iterator()
    i = 0
    while (iter.hasNext()):
        it = iter.next()
        j = 0
        for col in cols:
            return_array[i, j] = col.get(it)
            j += 1
        i += 1

    end = time.time()
    legacy_elapsed = end - start

    return np.squeeze(return_array)

def new_table_to_numpy(idx, cols):

    global end, start, new_elapsed

    start = time.time()
    buffer = learn.gatherer.create_2d_tensor(idx, cols)
    tensor = np.frombuffer(buffer, dtype = float)
    tensor.shape = (idx.getSize(), len(cols))
    end = time.time()

    new_elapsed = end - start

    return np.squeeze(tensor)

def numpy_to_table(data, idx):
    return data[idx]

for n_samples in test_samples:

    print(str(n_samples) + " rows.")

    source = emptyTable(n_samples).update("X = (i / n_samples) * 2 * Math.PI", "Y = X")

    result_legacy = learn.learn(
        table = source,
        model_func = compute_sin,
        inputs = [learn.Input("X", table_to_numpy)],
        outputs = [learn.Output("SinX", numpy_to_table)],
        batch_size = n_samples
    )
    print("LEGACY: Time elapsed: " + str(legacy_elapsed) + " seconds.")

    result_v1 = learn.learn(
        table = source,
        model_func = compute_sin,
        inputs = [learn.Input("X", new_table_to_numpy)],
        outputs = [learn.Output("SinX", numpy_to_table)],
        batch_size = n_samples
    )
    speedup = np.round(legacy_elapsed / new_elapsed, 2)
    print("NEW: Time elapsed: " + str(new_elapsed) + " seconds.  This is " + str(speedup) + "x faster.")

First run shows speedups:

1,001 rows: 13.2x
10,001 rows: 36.14x
100,001 rows: 110.48x
500,001 rows: 91.23x
1,000,001 rows: 171.12x

@jjbrosnan
Copy link
Contributor Author

Here's an example of a Python query with the current code:

# Deephaven imports
from deephaven.TableTools import emptyTable
from deephaven.transferrer import table_to_numpy_2d
from deephaven import transferrer
from deephaven import learn

# Standard imports
import numpy as np

num_rows = 1000

# Create a table with some data
source = emptyTable(num_rows).update("X = (double)i", "Y = (double)(2 * i)")

my_np_gatherer = lambda idx, cols: table_to_numpy_2d(idx, cols, np.double)

def my_np_scatterer(data, idx):
    return data[idx]

# Define a function that sums rows in a NumPy array
def add_columns(columns):
    results = np.array([])
    for row in columns:
        results = np.append(results, np.sum(row))
    return results

# Apply the add_columns function to source with learn
result = learn.learn(
    table = source,
    model_func = add_columns,
    inputs = [learn.Input(["X", "Y"], my_np_gatherer)],
    outputs = [learn.Output("Z", my_np_scatterer)],
    batch_size = num_rows
)

@jjbrosnan
Copy link
Contributor Author

jjbrosnan commented Nov 8, 2021

In the current instantiation of the transferrer submodule (I'm not married to that name, I just can't think of anything better), the user will explicitly set the numpy dtype using a lambda function.

I have tested this for all data types and have only found issues with boolean values. I believe this is because the getBoolean method returns a java.lang.Boolean and not the primitive boolean type.

Right now, the following NumPy and Python data types are supported:

Python built in: int, float, boolean (supported by explicit conversion to the corresponding NumPy dtype)
NumPy: bool, byte, float, double, int, short, long (and all of the respective aliases)

After testing with Python built-in types, I got strange errors I didn't fully understand. Thus, I decided that explicit conversion to the corresponding NumPy dtype is appropriate.

I have yet to implement a transpose or any other array operation. I think that needs further discussion.

Integrations/python/deephaven/learn/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/learn/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/transferrer/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/transferrer/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/transferrer/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/transferrer/__init__.py Outdated Show resolved Hide resolved
@jjbrosnan
Copy link
Contributor Author

Currently, deephaven.learn uses an IndexSet as the first argument to the function that gathers data from a Table to a Python object. This PR creates a flat Java array using a RowSequence. I'm not sure if we should make the change in this PR or another one, but deephaven.learn needs to be updated to use a RowSequence, and not an IndexSet. Either that, or convert an IndexSet to a RowSequence under the hood (provided that's possible and not horribly inefficient).

Integrations/python/deephaven/learn/gather/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/learn/gather/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/learn/gather/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/learn/gather/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/learn/gather/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/learn/gather/__init__.py Outdated Show resolved Hide resolved
Integrations/python/deephaven/learn/gather/__init__.py Outdated Show resolved Hide resolved
@devinrsmith devinrsmith added the release blocker A bug/behavior that puts is below the "good enough" threshold to release. label Dec 9, 2021
@chipkent chipkent merged commit ae32bfd into deephaven:main Dec 9, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Dec 9, 2021
@jjbrosnan jjbrosnan deleted the java-gatherer branch December 13, 2021 15:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
core Core development tasks DocumentationNeeded java python release blocker A bug/behavior that puts is below the "good enough" threshold to release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants