Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-platform/encoding support for IO operations #109

Closed
BramVanroy opened this issue Aug 29, 2022 · 2 comments · Fixed by #110
Closed

Cross-platform/encoding support for IO operations #109

BramVanroy opened this issue Aug 29, 2022 · 2 comments · Fixed by #110
Labels
bug Something isn't working

Comments

@BramVanroy
Copy link
Contributor

I tend to run my dev code locally on my powerful Windows-based home set-up before pushing to the cluster. In this instance, I am trying to run spring which relies on penman. It makes some calls to penman's load method. I am running their preprocessing to parse AMR data and found calls to penman._load result in encoding issues.

I am aware I can just run on WSL and be done with it, but I'd rather see this useful tool be available cross-platform. Is there any way I can contribute? Which methods are all reliant on encoding? My approach would be to allow for an optional encoding argument, as is common in enc/dec methods, and pass it through the relevant IO functions like open.

For starters I can start with codec if that is okay, e.g., change the _load function to

def _load(source: FileOrFilename,
          model: Model = None,
          encoding: str = None) -> List[Graph]:  # EDITED
    """
    Deserialize a list of PENMAN-encoded graphs from *source*.

    Args:
        source: a filename or file-like object to read from
        model: the model used for interpreting the graph
    Returns:
        a list of Graph objects
    """
    codec = PENMANCodec(model=model)
    if isinstance(source, (str, Path)):
        with open(source, encoding=encoding) as fh:  # EDITED
            return list(codec.iterdecode(fh))
    else:
        assert hasattr(source, 'read')
        return list(codec.iterdecode(source))
@goodmami
Copy link
Owner

Thanks for bringing that up. I don't have a Windows machine to test on, but I recall that Windows has strange, sometimes non-standard default encodings, such as UTF-8 with a byte-order-mark. I agree that allowing an encoding option would be helpful. I think you'd only need to target where text is read from the filesystem, as everything should be native Python (unicode) strings internally, so look for calls to open(). I see the following:

with open(source) as fh:

with open(file, 'w') as fh:

with open(file) as f:

There are a couple other places open() is called, but they seem less urgent (e.g., in setup.py or the unit tests).

If you want to submit a PR with the necessary changes, that would be great.

@goodmami goodmami added the bug Something isn't working label Aug 30, 2022
@BramVanroy
Copy link
Contributor Author

Thank you for the quick response! Indeed, unfortunately Windows still relies on cp1252 encoding (not uf-8-bom), which always leads to small issues like this. I'll look for built-in "open" methods and allow for the option of custom encoding there. I'll send in a PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants