-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs.mkdir() is creating empty files on Azure Blob #137
Comments
Here is the output of printing the fs_mkdir() method. If I have the final slash in the path name then it creates the folder without any empty files. If I remove that final slash then a folder and an empty file are both created.
Container: dev So when I create partitions with pyarrow, are we creating the folders before adding the filename somehow? Perhaps because we have incrementing filenames like part-0, part-1, and so on? "Technically speaking, you can't have an empty folder in Azure Blob Storage as blob storage has a 2 level hierarchy - blob container and blob. You essentially create an illusion of a folder by prefixing the file name with the folder name you like e.g. assuming you want a style sheet (say site.css) in say css folder, you name the stylesheet file as css/site.css." |
The quote listed here is correct. Azure Storage does not have true directories. Azure views a blob as the entire path, not a file buried within a directory. It identifies the folders as "BlobPrefixes", which are identified for convenience. I'm curious. If we create a partitioned parquet file with Dask, and pyarrow as the engine, we don't see the behavior you describe. Meaning,
returns a sequentially numbered set of parquet files within the my_parquet.parquet "folder" |
I believe that Dask is still using the "old" pyarrow method of writing parquet data. I am only seeing these empty files generated when using the pyarrow.dataset write_dataset() method which uses adlfs. Adding @jorisvandenbossche This writes empty files for each partition:
This does not create empty files (this is the older method):
The same table was used for both examples.
|
Is this something which can be fixed eventually? Right now I am listing the empty blobs and deleting them if they do not match certain file patterns or extensions. |
I’ll have to dig into the behavior of pyarrow ds.write_dataset.
Azure Blob doesn’t support making empty directories, like you’d see in a file system. As you indicated, the work around is to add a trailing slash, but that doesn’t appear to be impacting the write_dataset operation.
… On Feb 23, 2021, at 1:44 AM, ldacey ***@***.***> wrote:
Is this something which can be fixed eventually? Right now I am listing the empty blobs and deleting them if they do not match certain file patterns or extensions.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@ldacey -- It looks to me like the issue only materializes if the folder being written is a nested folder (i.e. not a container). Is that consistent with your observation? |
@ldacey -- I've been working on this, and it works out to be a fairly significant change. I was thinking about a comment above -- that Dask uses the "old" way of writing parquet files. The original approach was intended to align to Dask, such that writing: would yield a collection of partitioned parquet files below, such that: Getting to proper folders may end up looking like: I'm trying to be sure I'm clear on the differences in expected behaviors. |
Okay, testing with this environment:
That seems to have worked, there are no longer any empty files:
Regarding #186, I was able to get this to work on the dataset I wrote above:
But I still ran into the issue of to_table() not completing on a real dataset - this warning was displayed a few times and it got stuck (one of my bigger datasets, but only filtering for a few days of data - no issues when I run the same code on 0.5.9). Not sure if it is due to the number of fragments/partitions?
|
@ldacey #193 includes the fix for the hive partitioning, as well migrating several async clients to async context managers and adds error handling for the finalizer, which I believe will resolve your issue above. I've only seen warning once, so your feedback would be appreciated. The difference between 0.5.x and 0.6.x is that 0.6.x implements async operations for AzureBlobFile objects. |
I installed the master branch and tested it. I had no issues with empty files when trying to write the example data from above, or when reading a dataset into a table from a moderate sized dataset. When I tried creating a table from a larger dataset (20,000 file fragments but filtered for a single partition) I still ran into a warning and it just got stuck. Do you think this is a separate issue potentially caused by a huge dataset?
|
Thanks for the help @ldacey. It's looks like its related to the finalizer and maybe_sync. It appears maybe_sync is running the finalizer close operation as if its synchronous. Would you mind trying it again, but when instantiating the AzureBlobFileSystem, setting the parameter |
Sure. Hm, I ran into an immediate error when I turned async on.
asynchronous=False: asynchronous=True:
Here are the libraries I am using:
|
What happened:
fs.mkdir() is creating empty files on Azure Blob.
A more detailed example of how this impacts writing pyarrow datasets can be found within the pyarrow issue below. Basically, empty files are being generated for each final partition field.
https://issues.apache.org/jira/projects/ARROW/issues/ARROW-10694
What you expected to happen:
fs.mkdir() should only create the folder and not generate empty files.
Minimal Complete Verifiable Example:
Anything else we need to know?:
Environment:
I am using adlfs master branch right now ( pip install git+https://github.com/dask/adlfs.git)
The text was updated successfully, but these errors were encountered: