Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threading issues accessing ADLSGen2 table from Python #915

Closed
craustin opened this issue Nov 4, 2022 · 25 comments · Fixed by #943
Closed

Threading issues accessing ADLSGen2 table from Python #915

craustin opened this issue Nov 4, 2022 · 25 comments · Fixed by #943
Labels
bug Something isn't working

Comments

@craustin
Copy link

craustin commented Nov 4, 2022

Environment

Delta-rs version: v0.6.2

Binding: Python 3.8

Environment:

  • Cloud provider: Delta table on ADLS Gen2; Process running in container
  • OS: Debian 10
  • Other: pyarrow 10.0.0 (but repros w/ earlier versions too)

Bug

What happened:

I'm following the tutorial here:

from deltalake import DeltaTable
import pyarrow.dataset as ds
storage_options = {"AZURE_STORAGE_ACCOUNT_KEY": "..."}
dt = DeltaTable("abfss://my-container@stmystorageaccount.dfs.core.windows.net/data", storage_options=storage_options)
dataset = dt.to_pyarrow_dataset()
df = dataset.to_table(filter=(ds.field("my_date") == date(2022,11,3)) & (ds.field("my_field") == 'some_key')).to_pandas()
thread '<unnamed>' panicked at 'dispatch dropped without returning error', /root/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/hyper-0.14.20/src/client/conn.rs:329:35
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In [54], line 1
----> 1 df = dataset.to_table(filter=(ds.field("my_date") == date(2022,11,3)) & (ds.field("my_field") == 'some_key')).to_pandas()

File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/_dataset.pyx:332, in pyarrow._dataset.Dataset.to_table()
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/_dataset.pyx:2661, in pyarrow._dataset.Scanner.to_table()
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
PanicException: dispatch dropped without returning error

Sometimes the first line (thread '<unnamed>'...) is printed more than once. Sometimes I get this error instead:

...
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/_dataset.pyx:332, in pyarrow._dataset.Dataset.to_table()
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/_dataset.pyx:2661, in pyarrow._dataset.Scanner.to_table()
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()
OSError: Tried reading 8369271 bytes starting at position 50076484 from file but only got 4908220

What you expected to happen:
No errors. pd.DataFrame comes back.

How to reproduce it:
Run above code.

More details:
Possibly related, running the DeltaTable(...) constructor in Jupyter notebook hangs and never returns - so I'm running in a raw python / ipython interpreter. I don't see the hang there, but I do see the above threading issues.

(delta-rs is awesome by the way. Thank you! 🙏)

@craustin craustin added the bug Something isn't working label Nov 4, 2022
@wjones127
Copy link
Collaborator

Thanks for reporting this. I think I might have reproduced something similar while working on #893. I'll look into this specific example (the other one was in S3) in #912. My guess is there are some places we were panicking (calling .unwrap()) in the filesystem dispatch and we need to make sure we are handling those.

@craustin
Copy link
Author

craustin commented Nov 9, 2022

Let me know of any possible workarounds to the Jupyter hang or the ADLS threading issues (maybe same root cause.) I can file a separate issue for the Jupyter hang, if that'd be helpful. I can try to help out w/ the fix / tests too, if you point me in the right direction.

@dennyglee
Copy link
Collaborator

QQ @craustin - we just recently released delta-rs python 0.6.3 which includes #893, I was wondering if the same behavior is happening with the latest release? Thanks!

@craustin
Copy link
Author

Yes, we see the same behavior in v0.6.3.

@dennyglee
Copy link
Collaborator

Got it - thanks for the update! Will dive into this a bit, thanks!

@dennyglee
Copy link
Collaborator

dennyglee commented Nov 15, 2022

@craustin Quick confirmation, if you ran below
df = dataset.to_table().to_pandas()

instead of the filtered statement, the query would work as expected?

The reason I'm asking is because I'm wondering if there is something specific to your dataset (e.g. size, etc.). At least a quick test on one of my datasets (528 rows, 30 columns) in ADLSgen2 resulted in the query working as expected.

@craustin
Copy link
Author

craustin commented Nov 15, 2022

I get the same errors when I remove the filter. When I run df = dataset.to_table().to_pandas() I get the following error (for different filenames each time:

PyDeltaTableError: Generic MicrosoftAzure error: Error performing get request data/part-00000-caeae246-493e-46fb-9df0-724b1521604c.c000.snappy.parquet: response error "request error", after 0 retries: error sending request for url (https://stmystorageaccount.blob.core.windows.net/my-container/data/part-00000-caeae246-493e-46fb-9df0-724b1521604c.c000.snappy.parquet): dispatch task is gone: runtime dropped the dispatch task

There are ~70 transactions in the delta log - and ~50 parquet files (snappy format, from Databricks.) ~1M rows and 20 columns.

Also, I mentioned in the initial issue- I get an indefinite hang (or so it seems) when running from Jupyter. Are you able to run from a notebook? Seems like it's likely a related threading issue. I don't see the hang when running from python or ipython REPLs, but I get the above errors.

@dennyglee
Copy link
Collaborator

Got it, thanks for the additional info. I'll setup a larger example to see if I can repro. Any chance you can run this on a smaller dataset to see if you can get the same error in the interim?

@craustin
Copy link
Author

I think it might not be size-related, but rather something wrong with threading - at least in my environment somehow. I just tried with a table w/ 1 transaction, 6 snappy parquet files, 6 rows, and 4 columns - and I still get the same runtime dropped the dispatch task error - with one of the 6 filenames each time.

@dennyglee
Copy link
Collaborator

Gotcha, thanks for diving in more. Okay, let me see if I can repro your specific environment and go backwards from there.

@craustin
Copy link
Author

craustin commented Nov 16, 2022

I was able to repro the error (runtime dropped the dispatch task) on my personal laptop (Apple M1 Pro), outside of our work env (with an Azure Tryout subscription.) I created a simple storage account and create a Delta table with the below code - and I get the error:

import pandas as pd
from deltalake.writer import write_deltalake
storage_options = {"AZURE_STORAGE_ACCOUNT_KEY": "myaccountkey"}
df = pd.DataFrame({'x': range(100)})
table_root = "abfss://my-container@stmystorageaccount.dfs.core.windows.net/data"
write_deltalake(table_root, df, partition_by=["x"], storage_options=storage_options)

@dennyglee
Copy link
Collaborator

Thanks for the additional info. Can you do me a favor and ping me on Delta Users Slack via dennyglee as I have a few more questions.

Don't worry, will update this issue with pertinent info; I just didn't want to flood it with lots of small Qs

@dennyglee
Copy link
Collaborator

Thanks @craustin - appreciate the Slack conversation and updating the issue as noted above to help clarify the issue. One possible reason for the dispatch issue appears to be associated with partitions. For example:

works (do not specify partitions)

import pandas as pd
from deltalake.writer import write_deltalake
storage_options = {"AZURE_STORAGE_ACCOUNT_KEY": "myaccountkey"}
df = pd.DataFrame({'x': range(100)})

# do not specify partitions
table_root = "abfss://my-container@stmystorageaccount.dfs.core.windows.net/data"
write_deltalake(table_root, df, storage_options=storage_options)

fails (specify partitions)

import pandas as pd
from deltalake.writer import write_deltalake
storage_options = {"AZURE_STORAGE_ACCOUNT_KEY": "myaccountkey"}
df = pd.DataFrame({'x': range(100)})
table_root = "abfss://my-container@stmystorageaccount.dfs.core.windows.net/data"
write_deltalake(table_root, df, partition_by=["x"], storage_options=storage_options)

with my error looking like

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.9/site-packages/deltalake/writer.py", line 254, in write_deltalake
    ds.write_dataset(
  File "/opt/homebrew/lib/python3.9/site-packages/pyarrow/dataset.py", line 988, in write_dataset
    _filesystemdataset_write(
  File "pyarrow/_dataset.pyx", line 2859, in pyarrow._dataset._filesystemdataset_write
deltalake.PyDeltaTableError: Generic MicrosoftAzure error: Error performing put request thingyratings/tables/test0_part/x=15/0-8aa89f84-d4ac-482f-8f55-9947fccd20f7-0.parquet: response error "request error", after 0 retries: error sending request for url (https://lakeything.blob.core.windows.net/root/thingyratings/tables/test0_part/x=15/0-8aa89f84-d4ac-482f-8f55-9947fccd20f7-0.parquet?comp=block&blockid=ICAgICAgICAgICAgICAgICAgIDA%3D): dispatch task is gone: runtime dropped the dispatch task

@dennyglee
Copy link
Collaborator

dennyglee commented Nov 16, 2022

Quick update, it appears that this may be too many open files via the Azure Rust SDK? For example, for the above scenario, could you try using the max_open_files as unsatisfactory workaround?

write_deltalake(table_root, df, partition_by=["x"], storage_options=storage_options, max_open_files=8)

Note, when I write this locally, the write_deltalake(table_local, df, partition_by=["x"]) works as expected.

@PadenZach
Copy link
Contributor

This appears to be affecting more than just ADLSGen2, and is also affecting reading. I'm having similar issues with aws s3.

tbl = DeltaTable(table_uri).to_pyarrow_dataset().to_table()

Results in:

PyDeltaTableError: Generic S3 error: Error performing get request root_prefix/delta_table/parta=A/partb=B/part-00003-3ba999ac-4e93-4ec0-9866-2ffde6d0b0f1.c000.snappy.parquet: response error "request error", after 0 retries: error sending request for url (https://s3.us-east-1.amazonaws.com/my_bucket/root_prefix/delta_table/parta%3A/partb%3DB/part-00003-3ba999ac-4e93-4ec0-9866-2ffde6d0b0f1.c000.snappy.parquet): dispatch task is gone: runtime dropped the dispatch task

Per Denny's suggestion, I tried looking into reducing max_open_files, but this doesnt exist for the to_table function. Instead, we have fragment_readahead for files and batch_readahead for batches.

I tried two things:
dt1.to_table(fragment_readahead=1, batch_readahead=1)
This resulted in a new errors:
PyDeltaTableError: Generic S3 error: Error performing get request <file_path>/part-00082-ce112410-8fc0-4f4e-9435-220c209d7cf3.c000.snappy.parquet: response error "request error", after 0 retries: error sending request for url (https://s3.us-east-1.amazonaws.com/<file_path>/part-00082-ce112410-8fc0-4f4e-9435-220c209d7cf3.c000.snappy.parquet): connection closed before message completed

Wondering if this may just be a speed/performance issue from reducing things so low, I tried increased batch_readahead up again, but keeping fragment_readahead to 1.

dt1.to_table(fragment_readahead=1, batch_readahead=8) This resulted in me being able to succesfully read the entire table into memory.

The also appears to be affecting other dataset operations, and I get the orginal dispatch error while trying to use the deltalake -> pyarrow dataset -> duckdb interoperability. However, it seems to be rather intermittent, perhaps supporting the idea that this may be related to the number of files accessed.

Details:

  • python 3.8
  • deltalake 0.6.3
  • pyarrow 10.x (issue exists in 9.x but the fragment_readahead workaround is not).
  • Dataset is about 5gbs of snappy-compressed parquet, less than 10 columns. It uses hive partitioning and has ~100 files (Perhaps I should compact it later :) )

Let me know if this may be different enough to merit opening up a new ticket.

Thanks!

@roeap
Copy link
Collaborator

roeap commented Nov 17, 2022

We do have a PR open, that - while named completely different - should hopefully remedy the situation #933. It being applicable not only to azure is actually a good thing here, since the root cause we identified is not azure related ... @wjones127, is there anything I can do to help drive that PR home? :)

@wjones127
Copy link
Collaborator

I'll be able to look at it this weekend, but if you have time to reproduce this issue and see if it fixes, that would be helpful.

roeap added a commit that referenced this issue Nov 17, 2022
# Description

This PR builds in top of the changes to handling the runtime in #933. In
my local tests this fixed #915. Additionally, I added the runtime as a
property on the fs handler to avoid re-creating it on every call. In
some non-representative tests with a large number of very small
partitions it cut the runtime in about half.

cc @wjones127 

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
@roeap
Copy link
Collaborator

roeap commented Nov 17, 2022

Reopening to await confirmation from the affected :)

@roeap roeap reopened this Nov 17, 2022
@dennyglee
Copy link
Collaborator

Let me know if I can help repro / test it as well, eh?!

@roeap
Copy link
Collaborator

roeap commented Nov 17, 2022

If you could build current main and see if that fixes it, that would be awesome.

@dennyglee
Copy link
Collaborator

dennyglee commented Nov 17, 2022 via email

@PadenZach
Copy link
Contributor

Ran a few of the queries that were causing errors in main very often. No dispatch errors so far! Looking great! 💯

tbl = DeltaTable(table_uri).to_pyarrow_dataset().to_table() runs as expected.
More advanced queries via duckdb and datafusion also seem to be working much better in my very small query collection

@dennyglee
Copy link
Collaborator

Like @PadenZach I also re-ran my repo steps and works like a charm :)

@roeap
Copy link
Collaborator

roeap commented Nov 18, 2022

just realized that #882 likely has the same root cause, just for keeping track :)

@craustin
Copy link
Author

Latest main build looks good to me. I don't see any of these "dispatch dropped" errors anymore!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants