Speedup FrametoArray serializer for ChunkStore #909

BaiBaiHi · 2021-07-13T19:29:40Z

Summary
Speedup FrametoArray serializer for ChunkStore by removing intermediate DataFrame construction.
The majority of the time spent for serialization is spent on the following:

Constructing intermediate DataFrames
Setting the index:

By removing the intermediate DataFrame construction (so we only use numpy arrays and construct the DataFrame at the very end) and constructing the index separately, we can speed the serialization up significantly.

Performance comparisons

No Index - Series

    df = pd.Series(range(100))
    a = FrametoArraySerializer().serialize(df)

Single Chunk:
Multiple Chunks (data in list):

With Index - Series

    df = pd.Series(range(100), index=pd.Index(range(100), name='A'))
    a = FrametoArraySerializer().serialize(df)

Single Chunk:
Multiple Chunks (data in list):

No Index - Multiple columns

    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)),
                      columns=list('ABCD'))
    a = FrametoArraySerializer().serialize(df)

Single chunk (shape: (100, 4))
Multiple chunks (data in list):

With Index - Multiple columns

    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)),
                      columns=list('ABCD'))
    df = df.set_index(['A'])
    a = FrametoArraySerializer().serialize(df)

Single chunk (shape: (100, 4))
Multiple chunks (data in list):

…te DataFrame construction.

TomTaylorLondon · 2021-07-16T12:47:33Z

LGTM

…n when querying symbols (#856) * fix(list_symbols): Use IXSCAN queries for the versions collection when querying symbols This is mainly a reversion on #520, but we add another index to avoid a FETCH stage which gives a massive speedup. * feat(change_log): Adding the latest improvement to the changelog * Update changes file for release * Update setup.py * Update CHANGES.md * Fix flake8 errors (#875) * docs: fix simple typo, verififes -> verifies (#877) There is a small typo in arctic/store/_version_store_utils.py. Should read `verifies` rather than `verififes`. * Fix for issue #815 (#881) * Handle uninitialized cache object * Fixes #874: Pickle protocol 5 not supported in 3.7 and below * Fixes #872: Do not spam if not permissioned on cache db * Pin Pandas to check ck build * Skip flaky test * pin numpy as well for 3.7 * Update chunkstore.py * Speedup FrametoArray serializer for ChunkStore by removing intermediate DataFrame construction. (#909) * Fix column subsetting bug. (#910) * circle.ci build config.yml (#917) circle.ci build running. all tests passed Co-authored-by: Kerr, Duncan (London) <Duncan.Kerr@man.com> * master branch simple build (#918) * do not release master branch build * fix master build Co-authored-by: Kerr, Duncan (London) <Duncan.Kerr@man.com> * fixed build badge for master branch (#919) Co-authored-by: Kerr, Duncan (London) <Duncan.Kerr@man.com> * update CHANGES.md for v1.80.0 (#920) Co-authored-by: Kerr, Duncan (London) <Duncan.Kerr@man.com> * fix(list_symbols): Use IXSCAN queries for the versions collection when querying symbols This is mainly a reversion on #520, but we add another index to avoid a FETCH stage which gives a massive speedup. * feat(change_log): Adding the latest improvement to the changelog * rebased Rob's changes to master 1.80.0 * fixed circleci typos Co-authored-by: Bryant Moscon <bmoscon@gmail.com> Co-authored-by: Tim Gates <tim.gates@iress.com> Co-authored-by: enricodetoma <enrico.detoma@gmail.com> Co-authored-by: Shashank Khare <shashank88@gmail.com> Co-authored-by: Tom Taylor <TomTaylorLondon@users.noreply.github.com> Co-authored-by: Dela B <37855280+BaiBaiHi@users.noreply.github.com> Co-authored-by: duncan <duncan.kerr@live.com> Co-authored-by: Kerr, Duncan (London) <Duncan.Kerr@man.com>

…termediate DataFrame construction. (man-group#909)" This reverts commit 160d261.

Speedup FrametoArray serializer for ChunkStore by removing intermedia…

a384182

…te DataFrame construction.

BaiBaiHi force-pushed the speedup-frame-array-serializer branch from 066855e to a384182 Compare July 14, 2021 15:43

TomTaylorLondon requested a review from bmoscon July 16, 2021 12:47

bmoscon approved these changes Jul 16, 2021

View reviewed changes

bmoscon merged commit 160d261 into man-group:master Jul 16, 2021

BaiBaiHi mentioned this pull request Jul 21, 2021

Bugfix: Fix column subsetting bug in numpy deserializer #910

Merged

BaiBaiHi added a commit to BaiBaiHi/arctic that referenced this pull request Jan 7, 2022

Revert "Speedup FrametoArray serializer for ChunkStore by removing in…

8118025

…termediate DataFrame construction. (man-group#909)" This reverts commit 160d261.

dunckerr mentioned this pull request Jan 10, 2022

release/1.80.2 #933

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup FrametoArray serializer for ChunkStore #909

Speedup FrametoArray serializer for ChunkStore #909

BaiBaiHi commented Jul 13, 2021

TomTaylorLondon commented Jul 16, 2021

Speedup FrametoArray serializer for ChunkStore #909

Speedup FrametoArray serializer for ChunkStore #909

Conversation

BaiBaiHi commented Jul 13, 2021

Performance comparisons

TomTaylorLondon commented Jul 16, 2021