-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ArrowWriter closes stream at exit #1971
Conversation
This reverts commit b28967f.
src/datasets/metric.py
Outdated
@@ -336,7 +336,8 @@ def _finalize(self): | |||
from all the nodes if main node or all_process is True. | |||
""" | |||
if self.writer is not None: | |||
self.writer.finalize() | |||
with self.writer as writer: | |||
writer.finalize() | |||
self.writer = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lhoestq , do you think this is the cause of the test error? I am trying to understand why... It happens with DistributedDatasetTest.test_load_dataset_distributed
. Maybe you know why...
Oh nice thanks for adding the context manager ! All the streams and RecordBatchWriter will be properly closed now. Hopefully this gives a better experience on windows on which it's super important to close stuff. Not sure about the error, it looks like a process crashed silently. |
Exactly! On Windows, you got:
when trying to access the unclosed The reason is: https://docs.python.org/3/library/os.html#os.remove
|
The test passes on my windows. This was probably a circleCI issue. I re-ran the circleCI tests |
NICE! It passed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool thanks :)
I just have one comment:
src/datasets/metric.py
Outdated
@@ -336,7 +336,8 @@ def _finalize(self): | |||
from all the nodes if main node or all_process is True. | |||
""" | |||
if self.writer is not None: | |||
self.writer.finalize() | |||
with self.writer as writer: | |||
writer.finalize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of the with
statement here can make the user think that the writer opens and creates a file here.
This is not the case since finalize just writes down the latest examples and close the file.
It would be prettier to use .close() instead of a with statement here, to remove this ambiguity.
The with statement should be used to define the start and end point of the usage of the writer (file opened until closed).
However in this case the writer is an attribute of the Metric object and so there's no code block with a defined start/end.
In this case I think it's better to use .close() instead. You can even do try/except to make sure that close() is called even if finalize fails.
It would makes things clearer IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhoestq, I think the main point is in your sentence:
You can even do try/except to make sure that close() is called even if finalize fails.
Exactly! This is why I decided to open this PR: an exception was raised, and the resource was not closed. If an exception is raised within .finalize(), the file will no be closed. So that the solution would be: try... except... finally: close().
And that is exactly the use case for with
: use a context manager instead of try... except... finally: close().
See: https://docs.python.org/3/reference/compound_stmts.html#the-with-statement
This allows common try…except…finally usage patterns to be encapsulated for convenient reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case maybe you can you add in the writer's docstring that the with
can also be used only when calling finalize
?
As a user I'd expect the with
statement of a Writer
class to be necessarily used from start of writing till closing, but here you can write and then use the with
statement to close. Let me know if that makes sense.
That's definitely not a big deal but I'd rather have things said explicitly :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @lhoestq, you are right. One should use the context manager from start of writing until closing. I left some usages of writer without context manager, because the refactoring was not immediate, for a follow-up PR. Maybe I can remove the with
in this case, to finish this PR. And then open another PR for the remaining non-trivial cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good !
This reverts commit 7f287fb.
Maybe you can merge master into this branch and check the CI before merging ? |
@lhoestq done! ;) |
Thanks ! merging |
Current implementation of ArrowWriter does not properly release the
stream
resource (by closing it) if itsfinalize()
method is not called and/or an Exception is raised before/during the call to itsfinalize()
method.Therefore, ArrowWriter should be used as a context manager that properly closes its
stream
resource at exit.