Using v6d to share df across 2 independent process with 0-copy and without deserialization #1591
-
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 4 replies
-
Hi @meta-ks, from my experience, the dataframe should be continuous, i,e, you can't put lots of rows whose Dtype is /cc @sighingnow |
Beta Was this translation helpful? Give feedback.
-
Hi @meta-ks, Thanks for posting the question here. May I know the dtypes/schema of this pandas DataFrame? |
Beta Was this translation helpful? Give feedback.
-
I have tried to reproduce such dataframes, you can see puting/getting arrow tables are quite faster than opertions on pandas dataframe, and the
#!/usr/bin/env python3
import functools
import time
import sys
import numpy as np
import pandas as pd
import pyarrow as pa
import vineyard
def timing(fn):
@functools.wraps(fn)
def wrapper(*args, **kwargs):
start_time = time.time()
result = fn(*args, **kwargs)
end_time = time.time()
print("Time elapsed for %s: %.3f s" % (fn, end_time - start_time))
return result
return wrapper
def generate_dataframe(dtypes, num_rows=10):
columns = dict()
for name, dtype in dtypes.items():
if dtype == str:
columns[name] = np.array([str(i) for i in range(num_rows)])
elif name == 'ohlc':
columns[name] = np.array([{'open': 1, 'high': 2, 'low': 3, 'close': 4} for i in range(num_rows)])
elif name == 'depth':
columns[name] = [{'buy': [{'price': 1, 'quantity': 2} for _ in range(30)]} for i in range(num_rows)]
else:
columns[name] = np.random.randint(0, 100, size=num_rows).astype(dtype)
return pd.DataFrame(columns)
dtypes = {
'tradable': np.bool_,
'instrument_token': str,
'last_price': np.float64,
'last_trade_quality': np.int64,
'avg_price': np.float64,
'volume': np.int64,
'buy_quantity': np.int64,
'sell_quantity': np.int64,
'ohlc': dict,
'change': np.float64,
'last_trade_time': np.int64,
'oi': np.float64,
'oi_day_high': np.float64,
'oi_day_low': np.float64,
'exchange_timestamp': np.int64,
'depth': dict,
}
@timing
def to_arrow_table(df):
return pa.Table.from_pandas(df)
@timing
def to_pandas_dataframe(tb):
return tb.to_pandas()
@timing
def put_vineyard(client, df):
return client.put(df)
@timing
def get_vineyard(client, object_id):
return client.get(object_id)
def bench(vineyard_ipc_socket, num_rows):
df = generate_dataframe(dtypes, num_rows=num_rows)
print(df.info())
client = vineyard.connect(vineyard_ipc_socket)
print('testing put/get pandas dataframe: ')
object_id = put_vineyard(client, df)
df2 = get_vineyard(client, object_id)
print('testing put/get arrow table: ')
tb = to_arrow_table(df)
object_id = put_vineyard(client, tb)
tb2 = get_vineyard(client, object_id)
df2 = to_pandas_dataframe(tb2)
print(tb2.schema)
if __name__ == '__main__':
if len(sys.argv) > 1:
num_rows = int(sys.argv[1])
else:
num_rows = 400000
if len(sys.argv) > 2:
vineyard_ipc_socket = sys.argv[2]
else:
vineyard_ipc_socket = '/tmp/vineyard.sock'
bench(vineyard_ipc_socket, num_rows) |
Beta Was this translation helpful? Give feedback.
-
Thanks @sighingnow for detailed bench and answer.
|
Beta Was this translation helpful? Give feedback.
-
Arrow's memory layout is more efficiently, while pandas is more flexible a bit. Actually, pandas 2.0 starts to support arrow backend: https://pandas.pydata.org/docs/user_guide/pyarrow.html. Which would be a game changer IMO.
I think yes. |
Beta Was this translation helpful? Give feedback.
I have tried to reproduce such dataframes, you can see puting/getting arrow tables are quite faster than opertions on pandas dataframe, and the
to_pandas/from_pandas
is time consuming.