[WIP] switch to parallel queries, and write intermediate dataframes to disk #522

akotlar · 2024-06-07T02:19:53Z

Switch get_annotation_result_from_query to parallel executor
Save memory by writing intermediate dataframes to disk, and reading in the data as memory mapped file

WIP. Tests still need to be written, and schema must be provided explicitly.

# Default behavior is to show 1 sample per row, e.g. `melt_samples=True`
query_result_df_array_of_structs_select_fields = get_annotation_result_from_query(
    query_string="cadd:>20",
    index_name=index_name,
    bystro_api_auth=user,
    fields=['refSeq.name2', 'refSeq.name', 'refSeq.exonicAlleleFunction', 'refSeq.siteType'],
    explode_field='refSeq.name2',
    tmp_dir='/mnt/ssd1/bystro/python/python/bystro/examples/tmp',
    output_path='foo'
)
query_result_df_array_of_structs_select_fields.head(n=10)

Here tmp_dir and output_path are optional. If these are not provided the intermediate files as well as the concatenated final table will be cleaned up after completion.

akotlar added 4 commits June 6, 2024 22:17

switch to parallel queries, and write intermediate dataframes to disk

05c7845

log memory usage before/after concatenating; thread through max_threads

08d0cbb

wip

4ae1fb6

wip

0a31bc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] switch to parallel queries, and write intermediate dataframes to disk #522

[WIP] switch to parallel queries, and write intermediate dataframes to disk #522

akotlar commented Jun 7, 2024 •

edited

Loading

[WIP] switch to parallel queries, and write intermediate dataframes to disk #522

Are you sure you want to change the base?

[WIP] switch to parallel queries, and write intermediate dataframes to disk #522

Conversation

akotlar commented Jun 7, 2024 • edited Loading

akotlar commented Jun 7, 2024 •

edited

Loading