Using v6d to share df across 2 independent process with 0-copy and without deserialization #1591

meta-ks · 2023-10-16T04:23:59Z

meta-ks
Oct 16, 2023

Hi,
It's a wonderful project. Thanks for sharing.
I am wondering if it's possible to share a dataframe across 2 independent python process with above restrictions. Tyhis is my current code:

producer.py

@profile
def load_data_to_RAM(pickle_fp, v6d_store='pyshm', ipc_sock='/tmp/vineyard.sock'):
    df = pickle_to_df(pickle_fp)

    #MonkeyPatch to accept sock
    vineyard.deploy.local.try_init = try_init
    try_init(socket=ipc_sock)
    client = vineyard.connect(ipc_sock)
    oid = client.put(df, persist=True, name=v6d_store)
    print(f'[*]Pushed to v6d store -> {oid}')

consumer.py:

@profile
def consume_data_from_RAM(v6d_store='pyshm', ipc_sock='/tmp/vineyard.sock'):
    client = vineyard.connect(ipc_sock)
    df = client.get(name=v6d_store)

    return df

I remember this being mentioned somewhere in doc that v6d does support this restriction. [maybe this: ]

However, when i do client.get, i takes ~3 sec for a 400k rows, 10 cols df. Just a sidenote, if i do same things in plasma but uses arrow tables instead of df, get operation takes <100us for same data.

Plz tell me if can do this of if i am missing something like a restriction on df elements.

Answered by sighingnow

Oct 17, 2023

I have tried to reproduce such dataframes, you can see puting/getting arrow tables are quite faster than opertions on pandas dataframe, and the to_pandas/from_pandas is time consuming.

testing put/get pandas dataframe:
Time elapsed for <function put_vineyard at 0x7fe1d29e6b00>: 5.225 s
Time elapsed for <function get_vineyard at 0x7fe1d29e6c20>: 4.678 s
testing put/get arrow table:
Time elapsed for <function to_arrow_table at 0x7fe1d29c0d30>: 3.333 s
Time elapsed for <function put_vineyard at 0x7fe1d29e6b00>: 0.232 s
Time elapsed for <function get_vineyard at 0x7fe1d29e6c20>: 0.108 s
Time elapsed for <function to_pandas_dataframe at 0x7fe1d29c1480>: 4.644 s

#!/usr/bin/env python3

import f…

View full answer

dashanji · 2023-10-16T07:05:15Z

dashanji
Oct 16, 2023
Maintainer

Hi @meta-ks, from my experience, the dataframe should be continuous, i,e, you can't put lots of rows whose Dtype is object into the dataframe.

/cc @sighingnow

0 replies

sighingnow · 2023-10-16T07:37:19Z

sighingnow
Oct 16, 2023
Maintainer

Hi @meta-ks,

Thanks for posting the question here. May I know the dtypes/schema of this pandas DataFrame?

3 replies

meta-ks Oct 17, 2023
Author

Hi @sighingnow
Sure thing. my df contains tick data of an ETF which looks like:

More specifically:

mode: str
ohlc: dict
depth: dict of list of dicts.

sighingnow Oct 17, 2023
Maintainer

From the data I can see that happens: the root cause of the inefficiency is the dict columns in the pandas dataframe.

PyArrow's table has native support for dict types, where the ohlc and depth columns will have a more compact layout than pandas. In pandas, each element in these two columns will be a standalone Python object.

When pushing such pandas columns to vineyard, pickle/unpickle will be involved. If you push the pyarrow table to vineyard, the performance should be good enough.

sighingnow Oct 17, 2023
Maintainer

Actually, casting such pandas dataframes to pyarrow.Table (and casting such tables back to pandas dataframe) would be inefficient as well.

The root cause for such performance differences comes from pandas vs. arrow tables.

sighingnow · 2023-10-17T07:41:19Z

sighingnow
Oct 17, 2023
Maintainer

I have tried to reproduce such dataframes, you can see puting/getting arrow tables are quite faster than opertions on pandas dataframe, and the to_pandas/from_pandas is time consuming.

testing put/get pandas dataframe:
Time elapsed for <function put_vineyard at 0x7fe1d29e6b00>: 5.225 s
Time elapsed for <function get_vineyard at 0x7fe1d29e6c20>: 4.678 s
testing put/get arrow table:
Time elapsed for <function to_arrow_table at 0x7fe1d29c0d30>: 3.333 s
Time elapsed for <function put_vineyard at 0x7fe1d29e6b00>: 0.232 s
Time elapsed for <function get_vineyard at 0x7fe1d29e6c20>: 0.108 s
Time elapsed for <function to_pandas_dataframe at 0x7fe1d29c1480>: 4.644 s

#!/usr/bin/env python3

import functools
import time
import sys

import numpy as np
import pandas as pd
import pyarrow as pa
import vineyard

def timing(fn):
    @functools.wraps(fn)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = fn(*args, **kwargs)
        end_time = time.time()
        print("Time elapsed for %s: %.3f s" % (fn, end_time - start_time))
        return result
    return wrapper


def generate_dataframe(dtypes, num_rows=10):
    columns = dict()
    for name, dtype in dtypes.items():
        if dtype == str:
            columns[name] = np.array([str(i) for i in range(num_rows)])
        elif name == 'ohlc':
            columns[name] = np.array([{'open': 1, 'high': 2, 'low': 3, 'close': 4} for i in range(num_rows)])
        elif name == 'depth':
            columns[name] = [{'buy': [{'price': 1, 'quantity': 2} for _ in range(30)]} for i in range(num_rows)]
        else:
            columns[name] = np.random.randint(0, 100, size=num_rows).astype(dtype)
    return pd.DataFrame(columns)

dtypes = {
    'tradable': np.bool_,
    'instrument_token': str,
    'last_price': np.float64,
    'last_trade_quality': np.int64,
    'avg_price': np.float64,
    'volume': np.int64,
    'buy_quantity': np.int64,
    'sell_quantity': np.int64,
    'ohlc': dict,
    'change': np.float64,
    'last_trade_time': np.int64,
    'oi': np.float64,
    'oi_day_high': np.float64,
    'oi_day_low': np.float64,
    'exchange_timestamp': np.int64,
    'depth': dict,
}

@timing
def to_arrow_table(df):
    return pa.Table.from_pandas(df)

@timing
def to_pandas_dataframe(tb):
    return tb.to_pandas()

@timing
def put_vineyard(client, df):
    return client.put(df)

@timing
def get_vineyard(client, object_id):
    return client.get(object_id)

def bench(vineyard_ipc_socket, num_rows):
    df = generate_dataframe(dtypes, num_rows=num_rows)
    print(df.info())

    client = vineyard.connect(vineyard_ipc_socket)

    print('testing put/get pandas dataframe: ')
    object_id = put_vineyard(client, df)
    df2 = get_vineyard(client, object_id)

    print('testing put/get arrow table: ')
    tb = to_arrow_table(df)
    object_id = put_vineyard(client, tb)
    tb2 = get_vineyard(client, object_id)
    df2 = to_pandas_dataframe(tb2)
    print(tb2.schema)

if __name__ == '__main__':
    if len(sys.argv) > 1:
        num_rows = int(sys.argv[1])
    else:
        num_rows = 400000
    if len(sys.argv) > 2:
        vineyard_ipc_socket = sys.argv[2]
    else:
        vineyard_ipc_socket = '/tmp/vineyard.sock'
    bench(vineyard_ipc_socket, num_rows)

1 reply

sighingnow Oct 17, 2023
Maintainer

The operations on pyarrow tables could be even faster, as the inferred map/struct types might not be optimized enough in this case. If we have more knowledge about the data schema, we could do better when building the pyarrow table.

ohlc: struct<close: int64, high: int64, low: int64, open: int64>
  child 0, close: int64
  child 1, high: int64
  child 2, low: int64
  child 3, open: int64
change: double
last_trade_time: int64
oi: double
oi_day_high: double
oi_day_low: double
exchange_timestamp: int64
depth: struct<buy: list<item: struct<price: int64, quantity: int64>>>
  child 0, buy: list<item: struct<price: int64, quantity: int64>>
      child 0, item: struct<price: int64, quantity: int64>
          child 0, price: int64
          child 1, quantity: int64

meta-ks · 2023-10-18T03:15:13Z

meta-ks
Oct 18, 2023
Author

Thanks @sighingnow for detailed bench and answer.

Arrow is the go to option then with specifying schema if possible either for v6d/plasma. Plz correct me.
Also if i flatten my df so that each dict key becomes a separate col, are we expecting it to be efficient like df ?

0 replies

sighingnow · 2023-10-19T07:08:45Z

sighingnow
Oct 19, 2023
Maintainer

Arrow is the go to option then with specifying schema if possible either for v6d/plasma.

Arrow's memory layout is more efficiently, while pandas is more flexible a bit. Actually, pandas 2.0 starts to support arrow backend: https://pandas.pydata.org/docs/user_guide/pyarrow.html. Which would be a game changer IMO.

Also if i flatten my df so that each dict key becomes a separate col, are we expecting it to be efficient like df ?

I think yes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using v6d to share df across 2 independent process with 0-copy and without deserialization #1591

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using v6d to share df across 2 independent process with 0-copy and without deserialization #1591

meta-ks Oct 16, 2023

Replies: 5 comments · 4 replies

dashanji Oct 16, 2023 Maintainer

sighingnow Oct 16, 2023 Maintainer

meta-ks Oct 17, 2023 Author

sighingnow Oct 17, 2023 Maintainer

sighingnow Oct 17, 2023 Maintainer

sighingnow Oct 17, 2023 Maintainer

sighingnow Oct 17, 2023 Maintainer

meta-ks Oct 18, 2023 Author

sighingnow Oct 19, 2023 Maintainer

meta-ks
Oct 16, 2023

Replies: 5 comments 4 replies

dashanji
Oct 16, 2023
Maintainer

sighingnow
Oct 16, 2023
Maintainer

meta-ks Oct 17, 2023
Author

sighingnow Oct 17, 2023
Maintainer

sighingnow Oct 17, 2023
Maintainer

sighingnow
Oct 17, 2023
Maintainer

sighingnow Oct 17, 2023
Maintainer

meta-ks
Oct 18, 2023
Author

sighingnow
Oct 19, 2023
Maintainer