-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-3059] PostgresToGCS Log the number of rows read #3898
Conversation
3ede533
to
61b781c
Compare
@@ -134,6 +134,7 @@ def _write_local_data_files(self, cursor): | |||
""" | |||
schema = list(map(lambda schema_tuple: schema_tuple[0], cursor.description)) | |||
file_no = 0 | |||
row_no = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a dump question, but why we initialize the variable as 1 not 0? I am not very familiiar with this operator but if the cursor doesn't return any row, the log will still print Received 1 rows
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo I guess, this should be 0 indeed, thanks!
61b781c
to
211a6f8
Compare
@@ -155,6 +156,9 @@ def _write_local_data_files(self, cursor): | |||
file_no += 1 | |||
tmp_file_handle = NamedTemporaryFile(delete=True) | |||
tmp_file_handles[self.filename.format(file_no)] = tmp_file_handle | |||
row_no += 1 | |||
|
|||
self.log.info('Received %s rows over %s files', row_no, file_no + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be just file_no
and not file_no + 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line156 already increments the file_no
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file_no
indicates how many files are being written, to avoid that files grow too big. The row_no
should indicate how many lines are being written. So I understand the confusion. The thing is, if no files are being rolled, file_no
will still be 0. But if no rows are being written it will say, written 0 rows over 1 file
. Let me make this more clear.
To know how many data is being read from Postgres, it is nice to log this to the Airflow log
211a6f8
to
a9061dc
Compare
@kaxil I've refactored the code a bit, let me know what you think. This changes the behaviour though, the main question is, do we want to create a file if there is nothing to write? |
I agree with this logic. Can we add tests to check that the file is not created when there are no rows. |
Make sure you have checked all steps below.
Jira
Description
Tests
Commits
Documentation
Code Quality
git diff upstream/master -u -- "*.py" | flake8 --diff