Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Read as arrow #1831

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from
Draft

WIP Read as arrow #1831

wants to merge 18 commits into from

Conversation

willdealtry
Copy link
Collaborator

WIP read dataframe as Arrow arrays

@wuxianliang
Copy link

Dear friend, Great Job! I have saved all stocks data in ArcticDB, I would like to read the data from SSD directly into arrow table. Then we can query with DuckDB, LanceDB even KuzuDB in memory. As I understand finally maybe use ArcticDB zero-copy like this?

import arcticdb
import duckdb
import lancedb
import kuzudb

......

arrow_table = lib.read("Symbol", output_format=OutputFormat.ARROW).data

duckdb.sql("SELECT * FROM arrow_table")
lancedb.create_table("Symbol", arrow_table, schema=schema)
kuzudb.execute("COPY Symbol FROM arrow_table")

@willdealtry
Copy link
Collaborator Author

Dear friend, Great Job! I have saved all stocks data in ArcticDB, I would like to read the data from SSD directly into arrow table. Then we can query with DuckDB, LanceDB even KuzuDB in memory. As I understand finally maybe use ArcticDB zero-copy like this?

import arcticdb
import duckdb
import lancedb
import kuzudb

......

arrow_table = lib.read("Symbol", output_format=OutputFormat.ARROW).data

duckdb.sql("SELECT * FROM arrow_table")
lancedb.create_table("Symbol", arrow_table, schema=schema)
kuzudb.execute("COPY Symbol FROM arrow_table")

Yes that's exactly right. I'm very pleased to hear that you are excited about this piece of work!

@wuxianliang
Copy link

wuxianliang commented Nov 1, 2024

Does read_batch method support read_as_arrow too? Sometimes I wish to analysis all symbols in a daterange.

symbols = library.list_symbols()
batch_results = library.read_batch(symbols, date_range=date_range, output_format=OutputFormat.ARROW )

So every batch_data[i].data is an arrow table? Then I let Claude3.5 sonnet code the rest.

def fast_concat_arrow_tables(batch_results):
    """
    Fast concatenation of Arrow tables from batch results
    
    Parameters:
    -----------
    batch_results : List[Union[VersionedItem, DataError]]
        Results from ArcticDB read_batch operation
        
    Returns:
    --------
    pyarrow.Table
        Concatenated table with added symbol column
    """
    # 1. Pre-allocate list with known size for better memory efficiency
    tables_len = len(batch_results)
    tables = [None] * tables_len
    
    # 2. Add symbol column to each table in one pass
    for i, result in enumerate(batch_results):
        if isinstance(result, VersionedItem):
            table = result.data.to_arrow()
            # Create symbol array once per table
            symbol_array = pa.array([result.symbol] * len(table))
            # Store table with appended symbol column
            tables[i] = table.append_column('symbol', symbol_array)
    
    # 3. Filter out None values and concatenate all tables at once
    tables = [t for t in tables if t is not None]
    return pa.concat_tables(tables)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants