-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support file-like inputs for RecordReader #59
Support file-like inputs for RecordReader #59
Conversation
MaxGroot
commented
Feb 28, 2023
- Peek into the file to find the right adapter
- Add tests for avro
- Add avro to testenv
- Add avro to extras in setup.py
- Peek into the file to find the right adapter - Add tests for avro - Add avro to testenv - Add avro to extras in setup.py
Codecov Report
@@ Coverage Diff @@
## main #59 +/- ##
==========================================
+ Coverage 79.16% 79.34% +0.17%
==========================================
Files 32 32
Lines 2894 2924 +30
==========================================
+ Hits 2291 2320 +29
- Misses 603 604 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some feedback
Previously, rdump would assume a RecordStream for stdin input. Also implemented some (but not all) of the code review suggestions. Have to revisit the peeking logic before this is again ready for review.
I adopted the suggested changes. Something I wasn't quite sure of was if this 'peeking' logic should also be extended to whenever a path is opened. For example, both
Now work, as the recordadapter will peek into the stream, and in the case of a compressed stream, transparently decompress it. However, when you do a
It won't work. While the file pointer is correctly wrapped in a gzip decompressor, the adapter is determined based on the file extension, not on the contents of the file. Therefore, the
It does work, as you manually specify which adapter should be used. I am not sure what api design is desirable here. While I quite like the approach of peeking into files, I understand it's not scalable for all file formats. Having said that, it does sometimes feel inconsistent when the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small suggestions and feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small unit test issue, then it looks good to go
Co-authored-by: Yun Zheng Hu <hu@fox-it.com>
flow/record/base.py
Outdated
path = str(path) | ||
if isinstance(path, str): | ||
return open_path(path, mode, clobber) | ||
elif isinstance(path, Peekable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't followed all code paths, but exiting early here in case of a Peekable
and not calling open_stream
, I figured that if you somehow where to have a Peekable
of a (still) compressed stream, we wouldn't do compression stream detection on it. This is probably not a realistic code path though.
Also implement latest code review suggestions
Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Peek into the file to find the right adapter by checking the file magic --------- Co-authored-by: Max Groot <max.groot@fox-it.com> Co-authored-by: Yun Zheng Hu <hu@fox-it.com> Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>