Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: add buffer flushing to filesystem writes (delta-io#1911)
# Description Current implementation of `ObjectOutputStream` does not invoke flush when writing out files to Azure storage which seem to cause intermittent issues when the `write_deltalake` hangs with no progress and no error. I'm adding a periodic flush to the write process, based on the written buffer size, which can be parameterized via `storage_options` parameter (I could not find another way without changing the interface). I don't know if this is an acceptable approach (also, it requires string values) Setting the `"max_buffer_size": f"{100 * 1024}"` in `storage_options` passed to `write_deltalake` helps me resolve the issue with writing a dataset to Azure which was otherwise failing constantly. Default max buffer size is set to 4MB which looks reasonable and used by other implementations I've seen (e.g. https://github.com/fsspec/filesystem_spec/blob/3c247f56d4a4b22fc9ffec9ad4882a76ee47237d/fsspec/spec.py#L1577) # Related Issue(s) Can help with resolving delta-io#1770 # Documentation If the approach is accepted then I need to find the best way of adding this to docs --------- Signed-off-by: Nikolay Ulmasov <ulmasov@hotmail.com>
- Loading branch information