Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup FrametoArray serializer for ChunkStore #909

Merged
merged 1 commit into from
Jul 16, 2021

Conversation

BaiBaiHi
Copy link
Contributor

Summary
Speedup FrametoArray serializer for ChunkStore by removing intermediate DataFrame construction.
The majority of the time spent for serialization is spent on the following:

  1. Constructing intermediate DataFrames
  2. Setting the index:
    image
    image

By removing the intermediate DataFrame construction (so we only use numpy arrays and construct the DataFrame at the very end) and constructing the index separately, we can speed the serialization up significantly.

Performance comparisons

No Index - Series

    df = pd.Series(range(100))
    a = FrametoArraySerializer().serialize(df)
  1. Single Chunk:
    image
  2. Multiple Chunks (data in list):
    image

With Index - Series

    df = pd.Series(range(100), index=pd.Index(range(100), name='A'))
    a = FrametoArraySerializer().serialize(df)
  1. Single Chunk:
    image
  2. Multiple Chunks (data in list):
    image

No Index - Multiple columns

    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)),
                      columns=list('ABCD'))
    a = FrametoArraySerializer().serialize(df)
  1. Single chunk (shape: (100, 4))
    image

  2. Multiple chunks (data in list):
    image

With Index - Multiple columns

    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)),
                      columns=list('ABCD'))
    df = df.set_index(['A'])
    a = FrametoArraySerializer().serialize(df)
  1. Single chunk (shape: (100, 4))
    image

  2. Multiple chunks (data in list):
    image

@BaiBaiHi BaiBaiHi force-pushed the speedup-frame-array-serializer branch from 066855e to a384182 Compare July 14, 2021 15:43
@TomTaylorLondon TomTaylorLondon requested a review from bmoscon July 16, 2021 12:47
@TomTaylorLondon
Copy link
Contributor

LGTM

@bmoscon bmoscon merged commit 160d261 into man-group:master Jul 16, 2021
dunckerr added a commit that referenced this pull request Oct 28, 2021
…n when querying symbols (#856)

* fix(list_symbols): Use IXSCAN queries for the versions collection when querying symbols

This is mainly a reversion on #520, but we add another index to avoid a FETCH stage which gives a massive speedup.

* feat(change_log): Adding the latest improvement to the changelog

* Update changes file for release

* Update setup.py

* Update CHANGES.md

* Fix flake8 errors (#875)

* docs: fix simple typo, verififes -> verifies (#877)

There is a small typo in arctic/store/_version_store_utils.py.

Should read `verifies` rather than `verififes`.

* Fix for issue #815 (#881)

* Handle uninitialized cache object

* Fixes #874: Pickle protocol 5 not supported in 3.7 and below

* Fixes #872: Do not spam if not permissioned on cache db

* Pin Pandas to check ck build

* Skip flaky test

* pin numpy as well for 3.7

* Update chunkstore.py

* Speedup FrametoArray serializer for ChunkStore by removing intermediate DataFrame construction. (#909)

* Fix column subsetting bug. (#910)

* circle.ci build config.yml (#917)

circle.ci build running. all tests passed
Co-authored-by: Kerr, Duncan (London) <Duncan.Kerr@man.com>

* master branch simple build (#918)

* do not release master branch build

* fix master build

Co-authored-by: Kerr, Duncan (London) <Duncan.Kerr@man.com>

* fixed build badge for master branch (#919)

Co-authored-by: Kerr, Duncan (London) <Duncan.Kerr@man.com>

* update CHANGES.md for v1.80.0 (#920)

Co-authored-by: Kerr, Duncan (London) <Duncan.Kerr@man.com>

* fix(list_symbols): Use IXSCAN queries for the versions collection when querying symbols

This is mainly a reversion on #520, but we add another index to avoid a FETCH stage which gives a massive speedup.

* feat(change_log): Adding the latest improvement to the changelog

* rebased Rob's changes to master 1.80.0

* fixed circleci typos

Co-authored-by: Bryant Moscon <bmoscon@gmail.com>
Co-authored-by: Tim Gates <tim.gates@iress.com>
Co-authored-by: enricodetoma <enrico.detoma@gmail.com>
Co-authored-by: Shashank Khare <shashank88@gmail.com>
Co-authored-by: Tom Taylor <TomTaylorLondon@users.noreply.github.com>
Co-authored-by: Dela B <37855280+BaiBaiHi@users.noreply.github.com>
Co-authored-by: duncan <duncan.kerr@live.com>
Co-authored-by: Kerr, Duncan (London) <Duncan.Kerr@man.com>
BaiBaiHi added a commit to BaiBaiHi/arctic that referenced this pull request Jan 7, 2022
…termediate DataFrame construction. (man-group#909)"

This reverts commit 160d261.
@dunckerr dunckerr mentioned this pull request Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants