Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix write method of file requires byte-like object, not str #1750

Merged
merged 3 commits into from
Dec 5, 2017

Conversation

horpto
Copy link
Contributor

@horpto horpto commented Dec 2, 2017

sys.stdout requires str while file with the flags 'wb' requires bytes

sys.stdout requires str while file with the flags 'wb' requires bytes
@horpto horpto force-pushed the bugfix/segment-wiki branch from 0268190 to 41e8152 Compare December 2, 2017 12:41
@@ -111,7 +111,7 @@ def segment_and_write_all_articles(file_path, output_file, min_article_character
if output_file is None:
outfile = sys.stdout
else:
outfile = smart_open(output_file, 'wb')
outfile = smart_open(output_file, 'w')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1: we should always be writing out bytes, in specific encoding (utf8).

What exactly is the problem/error this is trying to fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to segmentize a wiki and write results to file. But I've got error:

Traceback (most recent call last):
  File "P:\Python35\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "P:\Python35\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "p:\_projects\gensim\gensim\scripts\segment_wiki.py", line 319, in <module>
    workers=args.workers
  File "p:\_projects\gensim\gensim\scripts\segment_wiki.py", line 125, in segment_and_write_all_articles
    outfile.write(json.dumps(output_data) + "\n")
  File "P:\Python35\lib\gzip.py", line 258, in write
    data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'

I can make a convertation to bytes but sys.stdout requires str, not bytes and I'd like to keep this flexible approach for writing.

I have Python v3.5, for Python2 is all good.

we should always be writing out bytes, in specific encoding (utf8).

Please, explain why?

Copy link
Owner

@piskvorky piskvorky Dec 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that's what I/O layers understand: bits and bytes. It makes the logic more explicit and simpler to have "unicode inside" and "bytes on I/O".

I didn't look in detail, but to me that json.dumps() + "\n" looks like a bug. @menshikh-iv shouldn't that be encoded into a bytestring (utf8) before writing to a binary file?

@horpto thanks for pointing this out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @horpto suggested, sys.stdout opened in 'w' mode (not 'wb') (for python3 added encoding='UTF-8' explicitly)

I can convert this line to bytes explicitly, but potentially, we'll have problems with sys.stdout (or need to split this two cases).

@horpto can you test your code with 'wb' mode and explicit conversion for json.dumps() + "\n" (I can test this only for linux, sometimes, encoding problems on windows behaves not obviously)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv

can you test your code with 'wb' mode and explicit conversion for json.dumps() + "\n"

It's OK and should work fine for python2 and python3.

Because that's what I/O layers understand: bits and bytes. It makes the logic more explicit and simpler to have "unicode inside" and "bytes on I/O".

I think, it's uselessly when we are writing to text file, as file-object does it already inside. When we are reading content from file - it's OK, files can contain some trash.

Copy link
Owner

@piskvorky piskvorky Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not useless, it's a Python best practice.

Newlines are messed up on Windows, we always want to have full control over what we write. For this reason, explicit conversions between string and byte are preferred on all I/O boundaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, interesting use case.
ok.

@menshikh-iv
Copy link
Contributor

Thanks @horpto, nice catch!

@menshikh-iv menshikh-iv merged commit 48249bb into piskvorky:develop Dec 5, 2017
if output_file is None:
sys.stdout.write(json.dumps(output_data) + "\n")
else:
outfile.write((json.dumps(output_data) + "\n").encode())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not: always use explicit encoding!

In this case, the output must be utf8.

Copy link
Owner

@piskvorky piskvorky Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd prefer to write utf8 even to stdout (sys.stdout.buffer), because that's the script's contract -- that's what we tell users we output. It's not a special case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> help(str.encode)
Help on method_descriptor:

encode(...)
    S.encode(encoding='utf-8', errors='strict') -> bytes

On the Python2 too.

Copy link
Owner

@piskvorky piskvorky Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not:

u"ř".encode()

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0159' in position 0: ordinal not in range(128)

(and even if it did, we'd still want to be explicit)

@menshikh-iv menshikh-iv mentioned this pull request Dec 5, 2017
@horpto horpto deleted the bugfix/segment-wiki branch January 19, 2019 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants