-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-12071] Don't re-use WriteToPandasSink instances across windows #14374
Conversation
@@ -521,7 +521,7 @@ def expand(self, pcoll): | |||
return pcoll | fileio.WriteToFiles( | |||
path=dir, | |||
file_naming=fileio.default_file_naming(name), | |||
sink=_WriteToPandasFileSink( | |||
sink=lambda _: _WriteToPandasFileSink( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the ease of making this mistake, is the WriteToFiles
API a hazard to other contributors and users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think so. I'm not sure how to address it though. Should we detect and raise when this mode is used with non global windows? I'll file a jira for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dfecc04
to
e544f68
Compare
e544f68
to
b88ebc8
Compare
Run PythonDocker PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just adding nitpicking where chances are I don't have the context of your decision-making
for path in glob.glob(pattern): | ||
with open(path) as fin: | ||
# TODO(Py3): yield from | ||
for line in fin: | ||
yield line.rstrip('\n') | ||
if delete: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good to know pytest has this. We've also used https://docs.python.org/3/library/tempfile.html elsewhere.
In this case I wanted to specifically delete files as we read them so I could confirm there are no other files (L344).
…pache#14374) * Add (failing) windowed write test * Dont re-use pandas sink instances across windows
When given a concrete
FileSink
,WriteToFiles
will re-use the same sink across windows:beam/sdks/python/apache_beam/io/fileio.py
Line 461 in e92d184
beam/sdks/python/apache_beam/io/fileio.py
Line 625 in e92d184
This leads to all data (aside from one partition), being written to the file for a single window. The fix is to pass in a function that creates a new sink instead.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.