Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spec] Specify Feature File encoding as UTF-8 #165

Closed
brawer opened this issue Feb 14, 2017 · 14 comments · Fixed by #1133
Closed

[spec] Specify Feature File encoding as UTF-8 #165

brawer opened this issue Feb 14, 2017 · 14 comments · Fixed by #1133
Labels

Comments

@brawer
Copy link
Contributor

brawer commented Feb 14, 2017

The current Feature File Syntax does not say what encoding a feature file should have, besides restricting string literals to ASCII.

Proposal: Change the spec to require that feature files use UTF-8 encoding; allow for arbitrary Unicode string literals; and change section 9.e to allow for Unicode strings in nameid statements.

This would preserve backwards compatibility as long as the current escaping mechanism is left unchanged. For example, the following two blocks would produce the exact same output. The former is from the current spec, and the escapes would still continue to be based on platform encodings even after the suggested specification change. However, font designers could also use the new syntax below, and it would be up to the compiler to convert this whatever is needed for the OpenType table.

  table name {
     nameid 9 "Joachim M\00fcller-Lanc\00e9";    # Windows (Unicode)
     nameid 9 1 "Joachim M\9fller-Lanc\8e";      # Macintosh (Mac Roman)
  } name;

  table name {
     nameid 9 "Joachim Müller-Lancé";    # Windows (Unicode)
     nameid 9 1 "Joachim Müller-Lancé";  # Macintosh (Mac Roman)
  } name;

In fonttools/fonttools#780 (comment), @twardoch suggested an extension to feaLib. Personally I like Adam’s idea (minus the complexity of arbitrary encodings; simply requiring UTF-8 would be easier). But I’d prefer to not deviate from the published feature file spec, hence filing this issue here.

@twardoch
Copy link

Good point on the encodings. I do recommend keeping the escaping, but also allowing UTF-8.

I would also clarify that usage of any characters outside ASCII would imply UTF-8 (i.e. the conversion to the target encoding would be up to the compiler), and presence of any escapement would imply that the escapes are in the native encoding. If it helps implementers, mixing of non-ASCII and escapes could be disallowed.

@mashabow
Copy link
Contributor

This change explicitly allows us to write comments with non-ASCII characters; it is so useful for me 😄

@khaledhosny
Copy link
Collaborator

Escaping might be OK for the odd accent in an otherwise pure ASCII string, but it is a PITA if you are trying to add name entries for, say, Arabic or Indic and is completely against the general notion that feature files being human readable.

@brawer
Copy link
Contributor Author

brawer commented Feb 14, 2017

By the way, fonttools.feaLib already implements the proposal (somewhat accidentally). For example, this feature file:

table name {
    nameid 7 "Joachim M\00fcller-Lanc\00e9";  # Windows (Unicode)
    nameid 7 1 "Joachim M\9fller-Lanc\8e";    # Macintosh (MacRoman, English)
    nameid 8 "Joachim Müller-Lancé";          # Windows (Unicode)
    nameid 8 1 "Joachim Müller-Lancé";        # Macintosh (MacRoman, English)

    nameid 17 "Jovica Veljovi\0107";          # Windows (Unicode)
    nameid 17 1 0 18 "Jovica Veljovi\e6";     # Macintosh (MacRoman, Croatian)
    nameid 18 "Jovica Veljović";              # Windows (Unicode)
    nameid 18 1 0 18 "Jovica Veljović";       # Macintosh (MacRoman, Croatian)
} name;

gets compiled to the following TTX:

  <name>
    <namerecord nameID="7" platformID="3" platEncID="1" langID="0x409">
      Joachim Müller-Lancé
    </namerecord>
    <namerecord nameID="7" platformID="1" platEncID="0" langID="0x0" unicode="True">
      Joachim Müller-Lancé
    </namerecord>
    <namerecord nameID="8" platformID="3" platEncID="1" langID="0x409">
      Joachim Müller-Lancé
    </namerecord>
    <namerecord nameID="8" platformID="1" platEncID="0" langID="0x0" unicode="True">
      Joachim Müller-Lancé
    </namerecord>
    <namerecord nameID="17" platformID="3" platEncID="1" langID="0x409">
      Jovica Veljović
    </namerecord>
    <namerecord nameID="17" platformID="1" platEncID="0" langID="0x12" unicode="True">
      Jovica Veljović
    </namerecord>
    <namerecord nameID="18" platformID="3" platEncID="1" langID="0x409">
      Jovica Veljović
    </namerecord>
    <namerecord nameID="18" platformID="1" platEncID="0" langID="0x12" unicode="True">
      Jovica Veljović
    </namerecord>
  </name>

@brawer
Copy link
Contributor Author

brawer commented Aug 9, 2017

@readroberts, what do you think about this proposal?

@readroberts
Copy link
Contributor

I favor supporting UTF-8. The feature file syntax was developed before UTF-8 was widely supported, but that is hardly the case the case any more - now it is hard to find a text editor that doesn't support it. Given UTF-8 support, the spec certainly needs documentation of what constitutes white-space.

@kenlunde
Copy link

kenlunde commented Aug 9, 2017

UTF-8 support is very useful for comments.

About the whitespace topic of Issue #191, I vote for U+0009 and U+0020 as valid "white space" characters in non-comments, with anything else throwing an error. ✨🙈✨🙉✨🙊✨

@miguelsousa miguelsousa changed the title Specify Feature File encoding as UTF-8 [spec] Specify Feature File encoding as UTF-8 Jul 3, 2018
@typemytype
Copy link
Contributor

Is there already a decision made?

comments with utf-8 is, hmm very, modern :)

@readroberts
Copy link
Contributor

I think we are all in favor of specifying the that the feature file encoding should be UTF-8, for both comments and non-comment text. I think the only issue outstanding is what white-space characters should be allowed. I'd favor not restricting this, and leaving it to the developer to make choices useful to them. What are the reasons to restrict what white-space characters to use?

@LIXiangChen
Copy link

Has there been any progress in this matter? I tried the latest version and still does not support non-ASCII characters. Sad.

@khaledhosny
Copy link
Collaborator

I gave this a try, and does not seem hard to allow UTF-8 input for Windows name entries. For Mac entries, makeotf does not do Unicode to legacy mac encodings conversion, and expects the input to be in the legacy encoding (escaped characters are not treated as Unicode but as legacy bytes).

So either direct UTF-8 input be disallowed (simplest solution, Mac name IDs are legacy and should not be needed any more), or implement UTF-8 to legacy mac encodings conversion (more work, dubious value). WDYT?

@twardoch
Copy link

twardoch commented May 17, 2020

In my view, Unicode strings should be allowed as Unicode strings. E.g. strings that ultimately are UTF-16BE should be expressable as UTF-8 in FEA. But strings that are not Unicode in the targets should be expressable as escaped byte sequences.

@khaledhosny
Copy link
Collaborator

That would be the 1st option.

To correct my previous comment, AFDKO seems to have conversion tables from Unicode to several legacy mac encodings, but these are used for cmap and not name table.

@khaledhosny
Copy link
Collaborator

#1133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants