fix write method of file requires byte-like object, not str #1750

horpto · 2017-12-02T12:38:30Z

sys.stdout requires str while file with the flags 'wb' requires bytes

piskvorky · 2017-12-03T07:19:29Z

gensim/scripts/segment_wiki.py

@@ -111,7 +111,7 @@ def segment_and_write_all_articles(file_path, output_file, min_article_character
    if output_file is None:
        outfile = sys.stdout
    else:
-        outfile = smart_open(output_file, 'wb')
+        outfile = smart_open(output_file, 'w')


-1: we should always be writing out bytes, in specific encoding (utf8).

What exactly is the problem/error this is trying to fix?

I've tried to segmentize a wiki and write results to file. But I've got error:

Traceback (most recent call last): File "P:\Python35\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "P:\Python35\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "p:\_projects\gensim\gensim\scripts\segment_wiki.py", line 319, in <module> workers=args.workers File "p:\_projects\gensim\gensim\scripts\segment_wiki.py", line 125, in segment_and_write_all_articles outfile.write(json.dumps(output_data) + "\n") File "P:\Python35\lib\gzip.py", line 258, in write data = memoryview(data) TypeError: memoryview: a bytes-like object is required, not 'str'

I can make a convertation to bytes but sys.stdout requires str, not bytes and I'd like to keep this flexible approach for writing.

I have Python v3.5, for Python2 is all good.

we should always be writing out bytes, in specific encoding (utf8).

Please, explain why?

Because that's what I/O layers understand: bits and bytes. It makes the logic more explicit and simpler to have "unicode inside" and "bytes on I/O".

I didn't look in detail, but to me that json.dumps() + "\n" looks like a bug. @menshikh-iv shouldn't that be encoded into a bytestring (utf8) before writing to a binary file?

@horpto thanks for pointing this out.

As @horpto suggested, sys.stdout opened in 'w' mode (not 'wb') (for python3 added encoding='UTF-8' explicitly)

I can convert this line to bytes explicitly, but potentially, we'll have problems with sys.stdout (or need to split this two cases).

@horpto can you test your code with 'wb' mode and explicit conversion for json.dumps() + "\n" (I can test this only for linux, sometimes, encoding problems on windows behaves not obviously)

@menshikh-iv

can you test your code with 'wb' mode and explicit conversion for json.dumps() + "\n"

It's OK and should work fine for python2 and python3.

Because that's what I/O layers understand: bits and bytes. It makes the logic more explicit and simpler to have "unicode inside" and "bytes on I/O".

I think, it's uselessly when we are writing to text file, as file-object does it already inside. When we are reading content from file - it's OK, files can contain some trash.

It's not useless, it's a Python best practice.

Newlines are messed up on Windows, we always want to have full control over what we write. For this reason, explicit conversions between string and byte are preferred on all I/O boundaries.

oh, interesting use case.
ok.

menshikh-iv · 2017-12-05T01:49:26Z

Thanks @horpto, nice catch!

piskvorky · 2017-12-05T09:32:02Z

gensim/scripts/segment_wiki.py

+            if output_file is None:
+                sys.stdout.write(json.dumps(output_data) + "\n")
+            else:
+                outfile.write((json.dumps(output_data) + "\n").encode())


Definitely not: always use explicit encoding!

In this case, the output must be utf8.

Also, I'd prefer to write utf8 even to stdout (sys.stdout.buffer), because that's the script's contract -- that's what we tell users we output. It's not a special case.

>>> help(str.encode) Help on method_descriptor: encode(...) S.encode(encoding='utf-8', errors='strict') -> bytes

On the Python2 too.

No, it's not:

u"ř".encode() UnicodeEncodeError: 'ascii' codec can't encode character u'\u0159' in position 0: ordinal not in range(128)

(and even if it did, we'd still want to be explicit)

fix write method requires byte-like object, not str

41e8152

sys.stdout requires str while file with the flags 'wb' requires bytes

horpto force-pushed the bugfix/segment-wiki branch from 0268190 to 41e8152 Compare December 2, 2017 12:41

piskvorky requested changes Dec 3, 2017

View reviewed changes

horpto added 2 commits December 4, 2017 11:45

split 2 cases

7cdb0e5

remove empty line; fix build

a954f15

menshikh-iv merged commit 48249bb into piskvorky:develop Dec 5, 2017

piskvorky reviewed Dec 5, 2017

View reviewed changes

menshikh-iv added the style checking label Dec 5, 2017

menshikh-iv mentioned this pull request Dec 5, 2017

Add forgotten encode #1763

Merged

menshikh-iv removed the style checking label Dec 5, 2017

horpto deleted the bugfix/segment-wiki branch January 19, 2019 12:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix write method of file requires byte-like object, not str #1750

fix write method of file requires byte-like object, not str #1750

horpto commented Dec 2, 2017

piskvorky Dec 3, 2017

horpto Dec 3, 2017

piskvorky Dec 3, 2017 •

edited

Loading

menshikh-iv Dec 4, 2017

horpto Dec 4, 2017

piskvorky Dec 4, 2017 •

edited

Loading

horpto Dec 4, 2017

menshikh-iv commented Dec 5, 2017

piskvorky Dec 5, 2017

piskvorky Dec 5, 2017 •

edited

Loading

horpto Dec 5, 2017

piskvorky Dec 5, 2017 •

edited

Loading

fix write method of file requires byte-like object, not str #1750

fix write method of file requires byte-like object, not str #1750

Conversation

horpto commented Dec 2, 2017

piskvorky Dec 3, 2017

Choose a reason for hiding this comment

horpto Dec 3, 2017

Choose a reason for hiding this comment

piskvorky Dec 3, 2017 • edited Loading

Choose a reason for hiding this comment

menshikh-iv Dec 4, 2017

Choose a reason for hiding this comment

horpto Dec 4, 2017

Choose a reason for hiding this comment

piskvorky Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

horpto Dec 4, 2017

Choose a reason for hiding this comment

menshikh-iv commented Dec 5, 2017

piskvorky Dec 5, 2017

Choose a reason for hiding this comment

piskvorky Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

horpto Dec 5, 2017

Choose a reason for hiding this comment

piskvorky Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Dec 3, 2017 •

edited

Loading

piskvorky Dec 4, 2017 •

edited

Loading

piskvorky Dec 5, 2017 •

edited

Loading

piskvorky Dec 5, 2017 •

edited

Loading