Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you specify a different encoding? #118

Open
eight04 opened this issue Jun 11, 2018 · 1 comment · May be fixed by #174
Open

How do you specify a different encoding? #118

eight04 opened this issue Jun 11, 2018 · 1 comment · May be fixed by #174

Comments

@eight04
Copy link
Contributor

eight04 commented Jun 11, 2018

I want to feed some Big5-UAO encoded data. Since there is no encoding parameter (or something like that), I tried using ByteStream:

stream = ByteStream(screen)
stream.select_other_charset("@")
stream.feed(bytes_object)

However, after checking the source code, it seems that this setup equals to:

stream = Stream(screen)
stream.feed(bytes_object.decode("latin-1"))

This method doesn't work because the bytes of Big5-UAO encoded string may contain control characters like \x9d, and match_text failed to match the entire string:

pyte/pyte/streams.py

Lines 132 to 135 in 676610b

_special = set([ctrl.ESC, ctrl.CSI_C1, ctrl.NUL, ctrl.DEL, ctrl.OSC_C1])
_special.update(basic)
_text_pattern = re.compile(
"[^" + "".join(map(re.escape, _special)) + "]+")

Here I generated a list of unicode character which contains control characters if encoded in Big5-UAO:
https://gist.github.com/eight04/3de731b7300a6b5036e082f801e2e3e9

How about encoding the bytes into unicode string with Big5-UAO before passing it to stream.feed?

We can't. In our usecase, we need a special feature called "雙色字". It colors a double width charater with two different colors. For example:

  • Encode "我" into bytes b'\xa7\xda'
  • Insert ANSI escape code to pos 0 and pos 1: b'\x1b[1;31m\xa7\x1b[32m\xda'
  • This is what it looks like: https://i.imgur.com/j0hwhZM.png

As a result, we can't decode the bytes before the escape code is parsed.


May we can add a flag to disable C1 controls in Stream.feed parser?

@eight04
Copy link
Contributor Author

eight04 commented Jun 11, 2018

I found another problem that the bytes sequence may contain unprintable characters

wcwidth think these characters are unprintable:
https://github.com/jquast/wcwidth/blob/c71459ea91af86f3bbcdac2c8ed5e7773da2d848/wcwidth/wcwidth.py#L175-L176

When pyte receives an unprintable character, it doesn't draw it on the buffer:

pyte/pyte/screens.py

Lines 522 to 523 in 676610b

else:
break # Unprintable character or doesn't advance the cursor.

As a result, following characters would never be drawn:
https://gist.github.com/eight04/dd7511c289d83932d18d17e21734bab3


We need a flag to put unprintable bytes to the buffer.

@eight04 eight04 linked a pull request Feb 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant