CLDR-17582 Cleanup English annotations (#3751)

unicode-org · May 30, 2024 · 195243c · 195243c
1 parent 473b6d1
commit 195243c
Show file tree

Hide file tree

Showing 7 changed files with 1,802 additions and 2,282 deletions.
diff --git a/common/annotations/en.xml b/common/annotations/en.xml
diff --git a/docs/ldml/tr35-general.md b/docs/ldml/tr35-general.md
@@ -2620,28 +2620,68 @@ For more information, see version 5.0 or [UTR #51, Unicode Emoji](https://www.un
 <!ATTLIST annotation type (tts) #IMPLIED >
 ```
 
-There are two kinds of annotations: **short names**, and **keywords**.
+There are two kinds of annotations: **short names**, and **search keywords**.
 
-With an attribute `type="tts"`, the value is a **short name**, such as one that can be used for text-to-speech. It should be treated as one of the element values for other purposes.
+With an attribute `type="tts"`, the value is a **short name**, such as one that can be used for text-to-speech. 
+It should be treated as one of the element values for other purposes.
 
-When there is no `type` attribute, the value is a set of **keywords**, delimited by |. Spaces around each element are to be trimmed. The **keywords** are words associated with the character(s) that might be used in searching for the character, or in predictive typing on keyboards. The short name itself can be used as a keyword.
+When there is no `type` attribute, the value is a set of **keywords**, delimited by |. 
+Spaces around each element are to be trimmed. 
+The **keywords** are words associated with the character(s) that might be used in searching for the character, 
+or in predictive typing on keyboards. The short name itself can be used as a keyword.
 
 Here is an example from German:
 
 ```xml
-<annotation cp="👎">schlecht | Hand | Daumen | nach unten</annotation>
+<annotation cp="👎">schlecht | Hand | Daumen | nach | unten</annotation>
 <annotation cp="👎" type="tts">Daumen runter</annotation>
 ```
 
-The `cp` attribute value has two formats: either a single string, or if contained within \[…\] a UnicodeSet. The latter format can contain multiple code points or strings. A code point pr string can occur in multiple annotation element **cp** values, such as the following, which also contains the "thumbs down" character.
+These are intended as search keywords, and not for "triggering" (aka suggesting).
+
+- For triggering, the user is typing out a message and concurrently seeing a few emoji
+  displayed adjacent to the virtual keyboard. Selecting the emoji adds it to the message.
+  For example, you mention your birthday while writing, and an emoji cake pops up.
+  That is typically done with an LLM or similar advanced technology.
+- For searching, the user is looking for an emoji in a search box, 
+  and typing in in words that narrow down a displayed set of emoji.
+  For example, you type 'heart', but that has too many hits, so you add 'blue' and get the set of blue hearts.
+
+### Usage Model
+
+The usage model for the search keywords is:
+
+- The user types one or more words in an emoji search field.
+- Each word successively narrows a number of emoji in a results box.
+    - heart → 🥰 😘 😻 💌 💘 💝 💖 💗 💓 💞 💕 💟 ❣️ 💔 ❤️‍🔥 ❤️‍🩹 ❤️ 🩷 🧡 💛 💚 💙 🩵 💜 🤎 🖤 🩶 🤍 💋 🫰 🫶 🫀 💏 💑 🏠 🏡 ♥️ 🩺
+    - blue → 🥶 😰 💙 🩵 🫐 👕 👖 📘 🧿 🔵 🟦 🔷 🔹 🏳️‍⚧️
+    - heart blue → 💙 🩵
+- A word with no hits is ignored
+    - [heart | blue | confabulation] is equivalent to [heart | blue]
+- As the user types a word, each character added to the word narrows the results.
+- Whenever the list is short enough to scan, the user will mouse-click on the right emoji — so it doesn’t have to be narrowed too far.
+    - In the following, the user would just click on 🎉 if that works for them.
+    - celebrate → 🥳 🥂 🎈 🎉 🎊 🪅
+- The order of words doesn’t matter.
+
+Multiword search keywords are typically broken up into separate parts, 
+because that works better with the usage model. So [hand | mouth | omg | open | over] covers the phrase "hand over mouth".
+
+### cp attribute
+
+The `cp` attribute value has two formats: either a single string, or if contained within \[…\] a UnicodeSet. 
+The latter format can contain multiple code points or strings. A code point pr string can occur in multiple annotation element **cp** values, such as the following, which also contains the "thumbs down" character.
 
 ```xml
 <annotation cp='[☝✊-✍👆-👐👫-👭💁🖐🖕🖖🙅🙆🙋🙌🙏🤘]'>hand</annotation>
 ```
 
-Both for short names and keywords, values do not have to match between different languages. They should be the most common values that people using _that_ language would associate with those characters. For example, a "black heart" might have the association of "wicked" in English, but not in some other languages.
+Both for short names and keywords, values do not have to match between different languages. 
+They should be the most common values that people using _that_ language would associate with those characters. 
+For example, a "black heart" might have the association of "wicked" in English, but not in some other languages.
 
-The cp value may contain sequences, but does not contain any Emoji or Text Variant (VS15 & VS16) characters. All such characters should be removed before looking up any short names and keywords.
+The cp value may contain sequences, but does not contain any Emoji or Text Variant (VS15 & VS16) characters. 
+All such characters should be removed before looking up any short names and keywords.
 
 ### <a name="SynthesizingNames" href="#SynthesizingNames">Synthesizing Sequence Names</a>
 

diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/tool/CheckEmojiAnnotations.java b/tools/cldr-code/src/main/java/org/unicode/cldr/tool/CheckEmojiAnnotations.java
@@ -0,0 +1,155 @@
+package org.unicode.cldr.tool;
+
+import com.google.common.base.Joiner;
+import com.google.common.collect.Sets;
+import com.ibm.icu.impl.UnicodeMap;
+import com.ibm.icu.text.UnicodeSet;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Set;
+import java.util.TreeMap;
+import java.util.TreeSet;
+import org.unicode.cldr.util.Annotations;
+import org.unicode.cldr.util.CLDRConfig;
+import org.unicode.cldr.util.CLDRFile;
+import org.unicode.cldr.util.CldrUtility;
+import org.unicode.cldr.util.CodePointEscaper;
+import org.unicode.cldr.util.Emoji;
+import org.unicode.cldr.util.SimpleUnicodeSetFormatter;
+import org.unicode.cldr.util.XPathParts;
+
+public class CheckEmojiAnnotations {
+    private static final Joiner JOIN_BAR = Joiner.on(" | ");
+
+    public static void main(String[] args) {
+        boolean chooseEmoji = true; // false to get the non-emoji
+
+        UnicodeSet rgi = Emoji.getAllRgi();
+        UnicodeSet rgiNoVariant = Emoji.getAllRgiNoES();
+        CLDRFile root = CLDRConfig.getInstance().getAnnotationsFactory().make("en", false);
+        UnicodeSet rootEmoji = new UnicodeSet();
+        for (String path : root) {
+            XPathParts parts = XPathParts.getFrozenInstance(path);
+            String cp = parts.getAttributeValue(-1, "cp");
+            if (cp != null && rgiNoVariant.contains(cp) == chooseEmoji) {
+                rootEmoji.add(cp);
+            }
+        }
+        rootEmoji.freeze();
+
+        UnicodeMap<Annotations> english = Annotations.getData("en");
+        Map<String, UnicodeSet> keywordToEmoji = new TreeMap<>();
+        UnicodeSet allUnclean = new UnicodeSet();
+
+        for (Annotations entry : english.values()) {
+            Set<String> keywords = entry.getKeywords();
+            UnicodeSet emoji = english.getSet(entry);
+            emoji.retainAll(rootEmoji);
+            UnicodeSet emojiRestored = new UnicodeSet();
+            for (String emojiItem : emoji) {
+                emojiRestored.add(Emoji.restoreVariants(emojiItem));
+            }
+            UnicodeSet unclean = new UnicodeSet(emojiRestored).removeAll(rgi);
+            allUnclean.add(unclean);
+
+            emojiRestored = emojiRestored.retainAll(rgi);
+            if (emojiRestored.isEmpty()) {
+                continue;
+            }
+
+            for (String keyword : keywords) {
+                UnicodeSet value = keywordToEmoji.get(keyword);
+                if (value == null) {
+                    keywordToEmoji.put(keyword, value = new UnicodeSet());
+                }
+                value.addAll(emojiRestored);
+            }
+        }
+        CldrUtility.protectCollection(keywordToEmoji);
+
+        int count = 0;
+        System.out.println("### Emoji to Keywords");
+        TreeSet<String> sortedRootEmoji = new TreeSet<>(Emoji.COLLATOR);
+        rootEmoji.addAllTo(sortedRootEmoji);
+        for (String emoji : sortedRootEmoji) {
+            String restored = Emoji.restoreVariants(emoji);
+            Set<String> keywords = english.get(emoji).getKeywords();
+            System.out.println(
+                    ++count + "\t" + restored + "\t" + emoji + "\t" + JOIN_BAR.join(keywords));
+        }
+
+        UnicodeSet toEscape =
+                new UnicodeSet(CodePointEscaper.FORCE_ESCAPE)
+                        .remove(CodePointEscaper.ZWJ.getCodePoint())
+                        .remove(CodePointEscaper.RANGE.getCodePoint())
+                        .freeze();
+        SimpleUnicodeSetFormatter suf = new SimpleUnicodeSetFormatter(null, toEscape);
+
+        allUnclean =
+                allUnclean
+                        .retainAll(rgiNoVariant)
+                        .removeAll(Emoji.SKIN_MODIFIERS)
+                        .removeAll(Emoji.HAIR_MODIFIERS);
+        if (!allUnclean.isEmpty()) {
+            throw new IllegalArgumentException("Missing " + suf.format(allUnclean));
+        }
+
+        System.out.println("### Keywords to Emoji");
+
+        count = 0;
+        for (Entry<String, UnicodeSet> entry : keywordToEmoji.entrySet()) {
+            System.out.println(
+                    ++count + "\t" + entry.getKey() + "\t" + suf.format(entry.getValue()));
+        }
+
+        System.out.println("### Gender Variants");
+
+        for (Set<String> entry : Emoji.getGenderGroups()) {
+            // find common keywords
+            Set<String> common = null;
+            Set<String> cleanEntry = new TreeSet<>();
+            for (String s : entry) {
+                if (!rootEmoji.contains(Emoji.removeVariants(s))) {
+                    continue;
+                }
+                Annotations anno = getAnnotations(english, s);
+                if (anno == null) {
+                    continue;
+                }
+                cleanEntry.add(s);
+                if (common == null) {
+                    System.out.println();
+                    common = new TreeSet<>();
+                    common.addAll(anno.getKeywords());
+                } else {
+                    common.retainAll(anno.getKeywords());
+                }
+            }
+            // now show them
+            if (cleanEntry.size() > 1) {
+                for (String s : cleanEntry) {
+                    Annotations anno = getAnnotations(english, s);
+                    String removed = Emoji.removeVariants(s);
+                    System.out.println(
+                            s
+                                    + "\t"
+                                    + removed
+                                    + "\t"
+                                    + anno.getShortName()
+                                    + "\t"
+                                    + JOIN_BAR.join(common)
+                                    + "\t"
+                                    + JOIN_BAR.join(Sets.difference(anno.getKeywords(), common)));
+                }
+            }
+        }
+    }
+
+    public static Annotations getAnnotations(UnicodeMap<Annotations> english, String s) {
+        Annotations anno = english.get(s);
+        if (anno == null) {
+            anno = english.get(s.replace(Emoji.EMOJI_VARIANT, ""));
+        }
+        return anno;
+    }
+}