Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ArrowWriter closes stream at exit #1971

Merged
merged 17 commits into from
Mar 10, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion src/datasets/metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -336,7 +336,8 @@ def _finalize(self):
from all the nodes if main node or all_process is True.
"""
if self.writer is not None:
self.writer.finalize()
with self.writer as writer:
writer.finalize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of the with statement here can make the user think that the writer opens and creates a file here.
This is not the case since finalize just writes down the latest examples and close the file.

It would be prettier to use .close() instead of a with statement here, to remove this ambiguity.

The with statement should be used to define the start and end point of the usage of the writer (file opened until closed).
However in this case the writer is an attribute of the Metric object and so there's no code block with a defined start/end.
In this case I think it's better to use .close() instead. You can even do try/except to make sure that close() is called even if finalize fails.

It would makes things clearer IMO.

Copy link
Member Author

@albertvillanova albertvillanova Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq, I think the main point is in your sentence:

You can even do try/except to make sure that close() is called even if finalize fails.

Exactly! This is why I decided to open this PR: an exception was raised, and the resource was not closed. If an exception is raised within .finalize(), the file will no be closed. So that the solution would be: try... except... finally: close().

And that is exactly the use case for with: use a context manager instead of try... except... finally: close().
See: https://docs.python.org/3/reference/compound_stmts.html#the-with-statement

This allows common try…except…finally usage patterns to be encapsulated for convenient reuse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case maybe you can you add in the writer's docstring that the with can also be used only when calling finalize ?
As a user I'd expect the with statement of a Writer class to be necessarily used from start of writing till closing, but here you can write and then use the with statement to close. Let me know if that makes sense.

That's definitely not a big deal but I'd rather have things said explicitly :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @lhoestq, you are right. One should use the context manager from start of writing until closing. I left some usages of writer without context manager, because the refactoring was not immediate, for a follow-up PR. Maybe I can remove the with in this case, to finish this PR. And then open another PR for the remaining non-trivial cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good !

self.writer = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lhoestq , do you think this is the cause of the test error? I am trying to understand why... It happens with DistributedDatasetTest.test_load_dataset_distributed. Maybe you know why...

if self.filelock is not None:
self.filelock.release()
Expand Down