-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 ? #55
Comments
There is, but it's byte-for-byte identical (because ISO-8859-1 encodings exactly match the first 256 codepoints from Unicode) so rather than have two copies in git to keep in sync the build system handles it. |
Would you wanto to make these files publicly available so that other
projects (snowball-js => lurn-languages [JSON=>UTF-8]) could take them as a reference?
|
Would you wanto to make these files publicly available so that other
projects (snowball-js => lurn-languages [JSON=>UTF-8]) could take them as a reference?
All the files involved are already in the git repo.
|
I meant, the generated, UTF-8, corresponding files.
Eg: algorithms/french only contains ISO_8859_1 and MS_DOS_Latin_I files
but not the "generated" UTF-8 counterparts.
Don't you think it would make sense?
|
As I said above, the Unicode version of the Snowball source is byte-for-byte identical to the ISO_8859_1 version when there's an ISO_8859_1 version. If that's really too complicated, you can just get the build system to create all the Unicode versions like so:
|
How is "é" (C3A9) considered then given |
It's handled as Unicode codepoint U+00E9 when the snowball compiler is running in "widechars" or UTF-8 mode (which are enabled automatically for some backends - e.g. only "widechars" makes sense for Java, and only UTF-8 for Go). |
From: algorithms/french/stem_ISO_8859_1.sbl
So far there is no UTF-8 version. Why?
The text was updated successfully, but these errors were encountered: