-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix write method of file requires byte-like object, not str #1750
Conversation
sys.stdout requires str while file with the flags 'wb' requires bytes
0268190
to
41e8152
Compare
gensim/scripts/segment_wiki.py
Outdated
@@ -111,7 +111,7 @@ def segment_and_write_all_articles(file_path, output_file, min_article_character | |||
if output_file is None: | |||
outfile = sys.stdout | |||
else: | |||
outfile = smart_open(output_file, 'wb') | |||
outfile = smart_open(output_file, 'w') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1: we should always be writing out bytes, in specific encoding (utf8).
What exactly is the problem/error this is trying to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to segmentize a wiki and write results to file. But I've got error:
Traceback (most recent call last):
File "P:\Python35\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "P:\Python35\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "p:\_projects\gensim\gensim\scripts\segment_wiki.py", line 319, in <module>
workers=args.workers
File "p:\_projects\gensim\gensim\scripts\segment_wiki.py", line 125, in segment_and_write_all_articles
outfile.write(json.dumps(output_data) + "\n")
File "P:\Python35\lib\gzip.py", line 258, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'
I can make a convertation to bytes but sys.stdout requires str, not bytes and I'd like to keep this flexible approach for writing.
I have Python v3.5, for Python2 is all good.
we should always be writing out bytes, in specific encoding (utf8).
Please, explain why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that's what I/O layers understand: bits and bytes. It makes the logic more explicit and simpler to have "unicode inside" and "bytes on I/O".
I didn't look in detail, but to me that json.dumps() + "\n"
looks like a bug. @menshikh-iv shouldn't that be encoded into a bytestring (utf8) before writing to a binary file?
@horpto thanks for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @horpto suggested, sys.stdout
opened in 'w' mode (not 'wb') (for python3 added encoding='UTF-8'
explicitly)
I can convert this line to bytes explicitly, but potentially, we'll have problems with sys.stdout
(or need to split this two cases).
@horpto can you test your code with 'wb' mode and explicit conversion for json.dumps() + "\n"
(I can test this only for linux, sometimes, encoding problems on windows behaves not obviously)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test your code with 'wb' mode and explicit conversion for json.dumps() + "\n"
It's OK and should work fine for python2 and python3.
Because that's what I/O layers understand: bits and bytes. It makes the logic more explicit and simpler to have "unicode inside" and "bytes on I/O".
I think, it's uselessly when we are writing to text file, as file-object does it already inside. When we are reading content from file - it's OK, files can contain some trash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not useless, it's a Python best practice.
Newlines are messed up on Windows, we always want to have full control over what we write. For this reason, explicit conversions between string and byte are preferred on all I/O boundaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, interesting use case.
ok.
Thanks @horpto, nice catch! |
if output_file is None: | ||
sys.stdout.write(json.dumps(output_data) + "\n") | ||
else: | ||
outfile.write((json.dumps(output_data) + "\n").encode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not: always use explicit encoding!
In this case, the output must be utf8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'd prefer to write utf8 even to stdout (sys.stdout.buffer
), because that's the script's contract -- that's what we tell users we output. It's not a special case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> help(str.encode)
Help on method_descriptor:
encode(...)
S.encode(encoding='utf-8', errors='strict') -> bytes
On the Python2 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not:
u"ř".encode()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0159' in position 0: ordinal not in range(128)
(and even if it did, we'd still want to be explicit)
sys.stdout requires str while file with the flags 'wb' requires bytes