Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up augur filter without replacing Pandas #1573

Open
4 of 5 tasks
victorlin opened this issue Aug 9, 2024 · 1 comment
Open
4 of 5 tasks

Speed up augur filter without replacing Pandas #1573

victorlin opened this issue Aug 9, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@victorlin
Copy link
Member

victorlin commented Aug 9, 2024

Context

See parent issue for context on how Pandas is used in augur filter and why it is slow.

There are some potential optimizations to the current code without a full rewrite that's necessary with #1574.

Progress

Not pursued

@victorlin victorlin added the enhancement New feature or request label Aug 9, 2024
@victorlin
Copy link
Member Author

Another potential speedup here is to leverage the pyarrow+pandas integration. This should be more mature with pandas v2. Pandas is pushing more in this direction as well, slated to make pyarrow a required dependency in v3.

Unfortunately, it's not as simple as setting engine='pyarrow'. I tried briefly with 11743ac. If we want to go down this route, it might be best to convert the metadata TSV to parquet upfront, which would require rewriting some logic (I'm not sure how much). Previous discussion on using parquet for metadata: (1, 2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant