Cross-platform/encoding support for IO operations #109

BramVanroy · 2022-08-29T10:00:10Z

I tend to run my dev code locally on my powerful Windows-based home set-up before pushing to the cluster. In this instance, I am trying to run spring which relies on penman. It makes some calls to penman's load method. I am running their preprocessing to parse AMR data and found calls to penman._load result in encoding issues.

I am aware I can just run on WSL and be done with it, but I'd rather see this useful tool be available cross-platform. Is there any way I can contribute? Which methods are all reliant on encoding? My approach would be to allow for an optional encoding argument, as is common in enc/dec methods, and pass it through the relevant IO functions like open.

For starters I can start with codec if that is okay, e.g., change the _load function to

def _load(source: FileOrFilename,
          model: Model = None,
          encoding: str = None) -> List[Graph]:  # EDITED
    """
    Deserialize a list of PENMAN-encoded graphs from *source*.

    Args:
        source: a filename or file-like object to read from
        model: the model used for interpreting the graph
    Returns:
        a list of Graph objects
    """
    codec = PENMANCodec(model=model)
    if isinstance(source, (str, Path)):
        with open(source, encoding=encoding) as fh:  # EDITED
            return list(codec.iterdecode(fh))
    else:
        assert hasattr(source, 'read')
        return list(codec.iterdecode(source))

The text was updated successfully, but these errors were encountered:

goodmami · 2022-08-30T07:28:48Z

Thanks for bringing that up. I don't have a Windows machine to test on, but I recall that Windows has strange, sometimes non-standard default encodings, such as UTF-8 with a byte-order-mark. I agree that allowing an encoding option would be helpful. I think you'd only need to target where text is read from the filesystem, as everything should be native Python (unicode) strings internally, so look for calls to open(). I see the following:

penman/penman/codec.py

Line 256 in 4191a6d

with open(source) as fh:

penman/penman/codec.py

Line 295 in 4191a6d

with open(file, 'w') as fh:

penman/penman/__main__.py

Line 267 in 4191a6d

with open(file) as f:

There are a couple other places open() is called, but they seem less urgent (e.g., in setup.py or the unit tests).

If you want to submit a PR with the necessary changes, that would be great.

BramVanroy · 2022-08-30T08:28:52Z

Thank you for the quick response! Indeed, unfortunately Windows still relies on cp1252 encoding (not uf-8-bom), which always leads to small issues like this. I'll look for built-in "open" methods and allow for the option of custom encoding there. I'll send in a PR soon.

goodmami added the bug Something isn't working label Aug 30, 2022

BramVanroy mentioned this issue Aug 30, 2022

Encoding generalizations #110

Merged

goodmami closed this as completed in #110 Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-platform/encoding support for IO operations #109

Cross-platform/encoding support for IO operations #109

BramVanroy commented Aug 29, 2022

goodmami commented Aug 30, 2022

BramVanroy commented Aug 30, 2022

Cross-platform/encoding support for IO operations #109

Cross-platform/encoding support for IO operations #109

Comments

BramVanroy commented Aug 29, 2022

goodmami commented Aug 30, 2022

BramVanroy commented Aug 30, 2022