Fix the case of "centi-meter" and "100-kilometer" #4418

younies · 2023-12-07T15:09:30Z

robertbastian

Can you add a test that this fixes? I'm not quite sure what the scenario is

robertbastian · 2023-12-07T15:29:03Z

experimental/unitsconversion/src/measureunit.rs

+        identifier_split: &mut std::str::Split<'data, char>,
+        trie: &ZeroTrie<ZeroVec<'data, u8>>,
+    ) -> Option<usize> {
+        let mut part = part.to_string();


Use a Cow to avoid allocation in the general case.

The ZeroTrie cursor is available now so you should just use that.

Great, I will do that right away.

In order to add the test cases, could you reply to this comment:
https://github.com/unicode-org/icu4x/pull/4422/files#r1419262946

I need to be able to read the provider.

To use the ZeroTrie cursor, this means that we need to change the algorithm in getting the power and the si prefix.

Shall we do it in another PR and test the performance. who knows which implementation is better in performance ?

You can't allocate memory in hot library code like this. We have clients who are sensitive to it. Also, we have a lot of experiential evidence that memory allocations are one of the biggest single contributors to slow code, so I have no doubt that ZeroTrieCursor will be faster.

Okay, no problem. actually, I am implementing it right now.

sffc

Two things:

Now that ZeroTrie Cursor is available, please rewrite the parsing code to use it and fix the bug at the same time.
Please add test cases. I'm not opposed to merging this PR first but it needs test cases in order to be mergeable.

sffc · 2023-12-07T16:23:06Z

experimental/unitsconversion/src/measureunit.rs

+        identifier_split: &mut std::str::Split<'data, char>,
+        trie: &ZeroTrie<ZeroVec<'data, u8>>,
+    ) -> Option<usize> {
+        let mut part = part.to_string();


The ZeroTrie cursor is available now so you should just use that.

sffc · 2023-12-07T18:51:27Z

Please hook up baked data first.

…ng-units

sffc

I don't agree with the proposed PR because you should not allocate memory, like to_string(), when parsing. Also, I would like any bug fixes to be backed by unit tests.

This PR is small enough that I suggest that you write the ZeroTrie code and fix the bug at the same time. There is no need to benchmark which one is faster because memory allocations are off the table.

…ng-units

…pleAscii<Vec<u8>>

younies · 2023-12-18T19:39:16Z

experimental/unitsconversion/src/power.rs

+pub fn get_power(part: &str) -> (u8, &str) {
+    use std::collections::BTreeMap;
+    let mut powers = BTreeMap::<Vec<u8>, usize>::new();
+    powers.insert(b"pow1".to_vec(), 1);
+    powers.insert(b"pow2".to_vec(), 2);
+    powers.insert(b"square".to_vec(), 2);
+    powers.insert(b"pow3".to_vec(), 3);
+    powers.insert(b"cubic".to_vec(), 3);
+    powers.insert(b"pow4".to_vec(), 4);
+    powers.insert(b"pow5".to_vec(), 5);
+    powers.insert(b"pow6".to_vec(), 6);
+    powers.insert(b"pow7".to_vec(), 7);
+    powers.insert(b"pow8".to_vec(), 8);
+    powers.insert(b"pow9".to_vec(), 9);
+    powers.insert(b"pow10".to_vec(), 10);
+    powers.insert(b"pow11".to_vec(), 11);
+    powers.insert(b"pow12".to_vec(), 12);
+    powers.insert(b"pow13".to_vec(), 13);
+    powers.insert(b"pow14".to_vec(), 14);
+    powers.insert(b"pow15".to_vec(), 15);
+
+    let trie = ZeroTrieSimpleAscii::try_from(&powers).unwrap();


@sffc How to construct the try directly without the need of an intermediate map?

Also, the current way, will construct the map at each function call, how to build it at the compile time?

Shall I use lazy_static?

No, use the const constructor of ZeroTrieSimpleAscii

https://unicode-org.github.io/icu4x/rustdoc/zerotrie/struct.ZeroTrieSimpleAscii.html#method.from_sorted_str_tuples

You need to manually put the strings in the correct order and then select the correct trie length. The error message should tell you whether the length is too long or too short and then you bisect to find the correct length. Alternatively you can use the non-const constructor to find the length.

I made it easier in #4466

done, thanks for the info

I would like if we can give it unsorted.

younies · 2023-12-18T19:39:38Z

experimental/unitsconversion/src/si_prefix.rs

+fn get_si_prefix_base_ten(part: &str) -> (i8, &str) {
+    let prefixes = vec![
+        ("quetta", 30, 0),
+        ("ronna", 27, 1),
+        ("yotta", 24, 2),
+        ("zetta", 21, 3),
+        ("exa", 18, 4),
+        ("peta", 15, 5),
+        ("tera", 12, 6),
+        ("giga", 9, 7),
+        ("mega", 6, 8),
+        ("kilo", 3, 9),
+        ("hecto", 2, 10),
+        ("deca", 1, 11),
+        ("deci", -1, 12),
+        ("centi", -2, 13),
+        ("milli", -3, 14),
+        ("micro", -6, 15),
+        ("nano", -9, 16),
+        ("pico", -12, 17),
+        ("femto", -15, 18),
+        ("atto", -18, 19),
+        ("zepto", -21, 20),
+        ("yocto", -24, 21),
+        ("ronto", -27, 22),
+        ("quecto", -30, 23),
+    ];
+
+    let prefixes_map = prefixes
+        .iter()
+        .map(|(prefix, _, index)| (prefix.as_bytes().to_vec(), *index))


younies · 2023-12-18T19:39:47Z

experimental/unitsconversion/src/si_prefix.rs

+fn get_si_prefix_base_two(part: &str) -> (i8, &str) {
+    let prefixes = vec![
+        ("yobi", 80),
+        ("zebi", 70),
+        ("exbi", 60),
+        ("pebi", 50),
+        ("tebi", 40),
+        ("gibi", 30),
+        ("mebi", 20),
+        ("kibi", 10),
+    ];
+    let prefixes_map = prefixes
+        .iter()
+        .map(|(prefix, index)| (prefix.as_bytes().to_vec(), *index))
+        .collect::<BTreeMap<Vec<u8>, usize>>();
+    let trie = ZeroTrieSimpleAscii::try_from(&prefixes_map).unwrap();
+    let mut cursor = trie.cursor();


younies · 2023-12-18T19:41:18Z

experimental/unitsconversion/tests/units_test.rs

+    // TODO: how to convert from `&ZeroTrie<ZeroVec<'_, u8>>` to &ZeroTrieSimpleAscii<Vec<u8>>?
+    let payload: ZeroTrieSimpleAscii<Vec<u8>> = ZeroTrieSimpleAscii::try_from(
        &icu_unitsconversion::provider::Baked::SINGLETON_UNITS_INFO_V1.units_conversion_trie,
-    );
+    )
+    .unwrap();
+    let parser = MeasureUnitParser::from_payload(&payload);



@sffc : how to convert from &ZeroTrie<ZeroVec<'_, u8>> to &ZeroTrieSimpleAscii<Vec>?

Or more accurate question: how to get the ZeroTrieSimpleAsciiCursor from ZeroTrie<ZeroVec<'_, u8>>

For now use take_store and from_store

After #4408 we can simplify it

I have done that.

However, take_store returns ZeroVec<u8> and from_store need Vec<u8>\

is there a better way than converting ZeroVec to Vec?

…ng-units

younies · 2023-12-19T15:34:30Z

experimental/unitsconversion/tests/units_test.rs

+    // // TODO: how to convert from `&ZeroTrie<ZeroVec<'_, u8>>` to &ZeroTrieSimpleAscii<Vec<u8>>?
+    // let store = icu_unitsconversion::provider::Baked::SINGLETON_UNITS_INFO_V1
+    //     .units_conversion_trie
+    //     .take_store();


@Manishearth

sffc

Nice

experimental/unitsconversion/src/measureunit.rs

sffc · 2023-12-28T08:54:30Z

experimental/unitsconversion/src/measureunit.rs

+        if part_without_power.starts_with("-") {
+            return Ok((power, &part_without_power[1..]));
        }


Suggestion (optional): you can avoid the indexing by using split_first, like

if let Some((&'-', remainder)) = part_without_power.split_first() { return Ok((power, remainder)); }

However, I think this only works if you deal with [u8] instead of str.

You have a lot of indexing here but it is all guarded so it's fine. You'll need to add some clippy suppressions, though.

Co-authored-by: Shane F. Carr <shane@unicode.org>

…ng-units

Fix the case of "centi-meter" and "100-kilometer"

bd44037

younies requested a review from sffc December 7, 2023 15:09

younies requested a review from a team as a code owner December 7, 2023 15:09

robertbastian reviewed Dec 7, 2023

View reviewed changes

sffc reviewed Dec 7, 2023

View reviewed changes

younies added 3 commits December 12, 2023 18:35

Merge branch 'main' of github.com:unicode-org/icu4x into fix-extracti…

4413060

…ng-units

Merge branch 'main' of github.com:unicode-org/icu4x into fix-extracti…

ddf1121

…ng-units

fix after merge.

49e161d

younies requested a review from sffc December 14, 2023 14:07

younies added 2 commits December 14, 2023 16:32

fix fmt

1deb947

Merge branch 'main' into fix-extracting-units

bec7a3c

sffc requested changes Dec 18, 2023

View reviewed changes

younies added 3 commits December 18, 2023 20:27

fix the cases of "centi-meter" and "100-kilometer"

7e57267

Merge branch 'main' of github.com:unicode-org/icu4x into fix-extracti…

11f6d4f

…ng-units

try to convert from from &ZeroTrie<ZeroVec<'_, u8>> to &ZeroTrieSim…

ef2444a

…pleAscii<Vec<u8>>

younies requested a review from Manishearth as a code owner December 18, 2023 19:33

younies commented Dec 18, 2023

View reviewed changes

younies requested a review from sffc December 18, 2023 19:41

younies added 4 commits December 19, 2023 11:59

Merge branch 'main' of github.com:unicode-org/icu4x into fix-extracti…

2499082

…ng-units

Use const TRIE for powers

69a9327

Make constants TRIEs

7e42334

silent the test

7eee337

younies commented Dec 19, 2023

View reviewed changes

Use ZeroVec instead of Vec

e46cc25

sffc previously approved these changes Dec 28, 2023

View reviewed changes

Update experimental/unitsconversion/src/measureunit.rs

a975af3

Co-authored-by: Shane F. Carr <shane@unicode.org>

younies dismissed sffc’s stale review via a975af3 December 29, 2023 00:13

Merge branch 'main' of github.com:unicode-org/icu4x into fix-extracti…

28a750c

…ng-units

sffc previously approved these changes Dec 29, 2023

View reviewed changes

younies added 5 commits January 3, 2024 17:05

Merge branch 'main' of github.com:unicode-org/icu4x into fix-extracti…

8eff0c7

…ng-units

Merge branch 'main' of github.com:unicode-org/icu4x into fix-extracti…

4a3458a

…ng-units

add more test cases

b4b1c72

fix fmt

89a3a8d

fix clippy

3c0c3f7

younies dismissed sffc’s stale review via 3c0c3f7 January 4, 2024 16:00

younies added 7 commits January 4, 2024 17:22

fix all the cases

14d9162

fix clippy

25fbd09

fix clippy

07982db

fix clippy

e1e91a1

fix fmt

2754835

add tests for the non parsable units

92577dc

update bakkeddata

e9232c9

younies requested review from robertbastian and sffc January 4, 2024 16:54

younies added 2 commits January 4, 2024 18:06

cargo make download-repo-sources

61c9de3

cargo make testdata

8e2a1e6

sffc approved these changes Jan 5, 2024

View reviewed changes

younies merged commit e804871 into unicode-org:main Jan 5, 2024
29 checks passed

younies deleted the fix-extracting-units branch January 5, 2024 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the case of "centi-meter" and "100-kilometer" #4418

Fix the case of "centi-meter" and "100-kilometer" #4418

younies commented Dec 7, 2023 •

edited

Loading

robertbastian left a comment

robertbastian Dec 7, 2023

sffc Dec 7, 2023

younies Dec 7, 2023

younies Dec 14, 2023

sffc Dec 18, 2023

younies Dec 18, 2023

sffc left a comment

sffc Dec 7, 2023

sffc commented Dec 7, 2023

sffc left a comment

younies Dec 18, 2023

younies Dec 18, 2023

sffc Dec 18, 2023

sffc Dec 18, 2023

younies Dec 19, 2023

younies Dec 18, 2023

younies Dec 19, 2023

younies Dec 18, 2023

younies Dec 19, 2023

younies Dec 18, 2023

sffc Dec 18, 2023

younies Dec 19, 2023

younies Dec 19, 2023

sffc left a comment

sffc Dec 28, 2023

younies Jan 4, 2024

Fix the case of "centi-meter" and "100-kilometer" #4418

Fix the case of "centi-meter" and "100-kilometer" #4418

Conversation

younies commented Dec 7, 2023 • edited Loading

robertbastian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc commented Dec 7, 2023

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younies commented Dec 7, 2023 •

edited

Loading