Always use UTF-8 encoding when parsing TSV file document corpora #325
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a follow-up for PR #263 which added explicit
encoding="utf-8"
parameters to manyopen
calls but missed one case: when reading document corpora from TSV files (optionally gzip compressed). This PR fixes the one remaining case while also simplifying the code slightly.The problem was reported by Pekka K. / Yle Arkisto. It manifests mostly on Windows, where non-UTF-8 locales (e.g. cp1252) are commonly used.