Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ArrowWriter closes stream at exit #1971

Merged
merged 17 commits into from
Mar 10, 2021

Conversation

albertvillanova
Copy link
Member

Current implementation of ArrowWriter does not properly release the stream resource (by closing it) if its finalize() method is not called and/or an Exception is raised before/during the call to its finalize() method.

Therefore, ArrowWriter should be used as a context manager that properly closes its stream resource at exit.

@@ -336,7 +336,8 @@ def _finalize(self):
from all the nodes if main node or all_process is True.
"""
if self.writer is not None:
self.writer.finalize()
with self.writer as writer:
writer.finalize()
self.writer = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lhoestq , do you think this is the cause of the test error? I am trying to understand why... It happens with DistributedDatasetTest.test_load_dataset_distributed. Maybe you know why...

@lhoestq
Copy link
Member

lhoestq commented Mar 2, 2021

Oh nice thanks for adding the context manager ! All the streams and RecordBatchWriter will be properly closed now. Hopefully this gives a better experience on windows on which it's super important to close stuff.

Not sure about the error, it looks like a process crashed silently.
Let me take a look

@albertvillanova
Copy link
Member Author

albertvillanova commented Mar 2, 2021

Hopefully this gives a better experience on windows on which it's super important to close stuff.

Exactly! On Windows, you got:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process

when trying to access the unclosed stream file, e.g. by with incomplete_dir(self._cache_dir) as tmp_data_dir: shutil.rmtree(tmp_dir)

The reason is: https://docs.python.org/3/library/os.html#os.remove

On Windows, attempting to remove a file that is in use causes an exception to be raised; on Unix, the directory entry is removed but the storage allocated to the file is not made available until the original file is no longer in use.

@lhoestq
Copy link
Member

lhoestq commented Mar 2, 2021

The test passes on my windows. This was probably a circleCI issue. I re-ran the circleCI tests

@albertvillanova
Copy link
Member Author

NICE! It passed!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool thanks :)

I just have one comment:

@@ -336,7 +336,8 @@ def _finalize(self):
from all the nodes if main node or all_process is True.
"""
if self.writer is not None:
self.writer.finalize()
with self.writer as writer:
writer.finalize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of the with statement here can make the user think that the writer opens and creates a file here.
This is not the case since finalize just writes down the latest examples and close the file.

It would be prettier to use .close() instead of a with statement here, to remove this ambiguity.

The with statement should be used to define the start and end point of the usage of the writer (file opened until closed).
However in this case the writer is an attribute of the Metric object and so there's no code block with a defined start/end.
In this case I think it's better to use .close() instead. You can even do try/except to make sure that close() is called even if finalize fails.

It would makes things clearer IMO.

Copy link
Member Author

@albertvillanova albertvillanova Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq, I think the main point is in your sentence:

You can even do try/except to make sure that close() is called even if finalize fails.

Exactly! This is why I decided to open this PR: an exception was raised, and the resource was not closed. If an exception is raised within .finalize(), the file will no be closed. So that the solution would be: try... except... finally: close().

And that is exactly the use case for with: use a context manager instead of try... except... finally: close().
See: https://docs.python.org/3/reference/compound_stmts.html#the-with-statement

This allows common try…except…finally usage patterns to be encapsulated for convenient reuse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case maybe you can you add in the writer's docstring that the with can also be used only when calling finalize ?
As a user I'd expect the with statement of a Writer class to be necessarily used from start of writing till closing, but here you can write and then use the with statement to close. Let me know if that makes sense.

That's definitely not a big deal but I'd rather have things said explicitly :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @lhoestq, you are right. One should use the context manager from start of writing until closing. I left some usages of writer without context manager, because the refactoring was not immediate, for a follow-up PR. Maybe I can remove the with in this case, to finish this PR. And then open another PR for the remaining non-trivial cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good !

@lhoestq
Copy link
Member

lhoestq commented Mar 10, 2021

Maybe you can merge master into this branch and check the CI before merging ?

@albertvillanova
Copy link
Member Author

@lhoestq done! ;)

@lhoestq
Copy link
Member

lhoestq commented Mar 10, 2021

Thanks ! merging

@lhoestq lhoestq merged commit 13a5b7d into huggingface:master Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants