deunicode

The deunicode library transliterates Unicode strings such as "Æneid" into pure ASCII ones such as "AEneid". It includes support for emoji. It's compatible with no-std Rust environments.

Deunicode is quite fast, supports on-the-fly conversion without allocations. It has a compact representation of Unicode data to minimize memory overhead and executable size (about 75K codepoints mapped to 245K ASCII characters, using 450KB of memory, 160KB gzipped).

Examples

use deunicode::deunicode;

assert_eq!(deunicode("Æneid"), "AEneid");
assert_eq!(deunicode("étude"), "etude");
assert_eq!(deunicode("北亰"), "Bei Jing");
assert_eq!(deunicode("ᔕᓇᓇ"), "shanana");
assert_eq!(deunicode("げんまい茶"), "genmaiCha");
assert_eq!(deunicode("🦄☣"), "unicorn biohazard");

When to use it?

It's a better alternative than just stripping all non-ASCII characters or letting them get mangled by some encoding-ignorant system. It's be okay for one-way conversions for things like search indexes and tokenization, as a stronger version of Unicode NFKD. It may be used for generating nice identifiers for file names and URLs, which aren't too user-facing.

However, like most "universal" libraries of this kind, it has a one-size-fits-all 1:1 mapping of Unicode code points, which can't handle language-specific exceptions nor context-dependent romanization rules. These limitations are only slightly suboptimal for European languages and Korean Hangul, but make a mess of Japanese Kanji.

Guarantees and Warnings

Here are some guarantees you have when calling deunicode():

The String returned will be valid ASCII; the decimal representation of every char in the string will be between 0 and 127, inclusive.
Every ASCII character (0x00 - 0x7F) is mapped to itself.
All Unicode characters will translate to printable ASCII characters (\n or characters in the range 0x20 - 0x7E).

There are, however, some things you should keep in mind:

Some transliterations do produce \n characters.
Some Unicode characters transliterate to an empty string, either on purpose or because deunicode does not know about the character.
Some Unicode characters are unknown and transliterate to "[?]" (or a custom placeholder, or None if you use a chars iterator).
Many Unicode characters transliterate to multi-character strings. For example, "北" is transliterated as "Bei".
The transliteration is context-free, and not sophisticated enough to produce proper Chinese or Japanese. Han characters used in multiple languages are mapped to a single Mandarin pronounciation, and will be mostly illegible to Japanese readers. Transliteration can't handle cases where a single character has multiple possible pronounciations.

Unicode data

Text::Unidecode by Sean M. Burke
Unicodey by Cal Henderson
gh emoji
any_ascii

For a detailed explanation on the rationale behind the original dataset, refer to this article written by Burke in 2001.

This is a maintained alternative to the unidecode crate, which started as a Rust port of Text::Unidecode Perl module.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
benches		benches
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deunicode

Examples

When to use it?

Guarantees and Warnings

Unicode data

About

Languages

License

kornelski/deunicode

Folders and files

Latest commit

History

Repository files navigation

deunicode

Examples

When to use it?

Guarantees and Warnings

Unicode data

About

Topics

Resources

License

Stars

Watchers

Forks

Languages