Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 ? #55

Closed
drzraf opened this issue Apr 28, 2017 · 8 comments
Closed

UTF-8 ? #55

drzraf opened this issue Apr 28, 2017 · 8 comments

Comments

@drzraf
Copy link

drzraf commented Apr 28, 2017

From: algorithms/french/stem_ISO_8859_1.sbl

stringdef a^   hex 'E2'  // a-circumflex
stringdef a`   hex 'E0'  // a-grave
stringdef c,   hex 'E7'  // c-cedilla

stringdef e"   hex 'EB'  // e-diaeresis (rare)
stringdef e'   hex 'E9'  // e-acute
stringdef e^   hex 'EA'  // e-circumflex
stringdef e`   hex 'E8'  // e-grave
stringdef i"   hex 'EF'  // i-diaeresis
stringdef i^   hex 'EE'  // i-circumflex
stringdef o^   hex 'F4'  // o-circumflex
stringdef u^   hex 'FB'  // u-circumflex
stringdef u`   hex 'F9'  // u-grave

So far there is no UTF-8 version. Why?

@ojwb
Copy link
Member

ojwb commented Apr 30, 2017

There is, but it's byte-for-byte identical (because ISO-8859-1 encodings exactly match the first 256 codepoints from Unicode) so rather than have two copies in git to keep in sync the build system handles it.

@ojwb ojwb closed this as completed Apr 30, 2017
@drzraf
Copy link
Author

drzraf commented May 1, 2017 via email

@ojwb
Copy link
Member

ojwb commented May 2, 2017 via email

@drzraf
Copy link
Author

drzraf commented May 2, 2017 via email

@ojwb
Copy link
Member

ojwb commented Jun 19, 2017

As I said above, the Unicode version of the Snowball source is byte-for-byte identical to the ISO_8859_1 version when there's an ISO_8859_1 version.

If that's really too complicated, you can just get the build system to create all the Unicode versions like so:

for d in algorithms/* ; do make "$d"/stem_Unicode.sbl ; done

@drzraf
Copy link
Author

drzraf commented Jun 19, 2017

How is "é" (C3A9) considered then given stringdef e' hex 'E9' // e-acute
If it's not, then how is it going to work for an UTF-8 string?

@ojwb
Copy link
Member

ojwb commented Jun 19, 2017

It's handled as Unicode codepoint U+00E9 when the snowball compiler is running in "widechars" or UTF-8 mode (which are enabled automatically for some backends - e.g. only "widechars" makes sense for Java, and only UTF-8 for Go).

@drzraf
Copy link
Author

drzraf commented Jun 20, 2017

Got it. Sorry for the above misleading "c3a9" UTF-8 and thank you ojwb for the clarifications!
This one is right then as is this one
But I ended up using this one (carry.js)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants