Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the case of "centi-meter" and "100-kilometer" #4418

Merged
merged 30 commits into from
Jan 5, 2024

Conversation

younies
Copy link
Member

@younies younies commented Dec 7, 2023

Fixes: #4461

@younies younies requested a review from sffc December 7, 2023 15:09
@younies younies requested a review from a team as a code owner December 7, 2023 15:09
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that this fixes? I'm not quite sure what the scenario is

identifier_split: &mut std::str::Split<'data, char>,
trie: &ZeroTrie<ZeroVec<'data, u8>>,
) -> Option<usize> {
let mut part = part.to_string();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a Cow to avoid allocation in the general case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ZeroTrie cursor is available now so you should just use that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I will do that right away.

In order to add the test cases, could you reply to this comment:
https://github.com/unicode-org/icu4x/pull/4422/files#r1419262946

I need to be able to read the provider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use the ZeroTrie cursor, this means that we need to change the algorithm in getting the power and the si prefix.

Shall we do it in another PR and test the performance. who knows which implementation is better in performance ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't allocate memory in hot library code like this. We have clients who are sensitive to it. Also, we have a lot of experiential evidence that memory allocations are one of the biggest single contributors to slow code, so I have no doubt that ZeroTrieCursor will be faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, no problem. actually, I am implementing it right now.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. Now that ZeroTrie Cursor is available, please rewrite the parsing code to use it and fix the bug at the same time.
  2. Please add test cases. I'm not opposed to merging this PR first but it needs test cases in order to be mergeable.

identifier_split: &mut std::str::Split<'data, char>,
trie: &ZeroTrie<ZeroVec<'data, u8>>,
) -> Option<usize> {
let mut part = part.to_string();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ZeroTrie cursor is available now so you should just use that.

@sffc
Copy link
Member

sffc commented Dec 7, 2023

Please hook up baked data first.

@younies younies requested a review from sffc December 14, 2023 14:07
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with the proposed PR because you should not allocate memory, like to_string(), when parsing. Also, I would like any bug fixes to be backed by unit tests.

This PR is small enough that I suggest that you write the ZeroTrie code and fix the bug at the same time. There is no need to benchmark which one is faster because memory allocations are off the table.

Comment on lines 13 to 34
pub fn get_power(part: &str) -> (u8, &str) {
use std::collections::BTreeMap;
let mut powers = BTreeMap::<Vec<u8>, usize>::new();
powers.insert(b"pow1".to_vec(), 1);
powers.insert(b"pow2".to_vec(), 2);
powers.insert(b"square".to_vec(), 2);
powers.insert(b"pow3".to_vec(), 3);
powers.insert(b"cubic".to_vec(), 3);
powers.insert(b"pow4".to_vec(), 4);
powers.insert(b"pow5".to_vec(), 5);
powers.insert(b"pow6".to_vec(), 6);
powers.insert(b"pow7".to_vec(), 7);
powers.insert(b"pow8".to_vec(), 8);
powers.insert(b"pow9".to_vec(), 9);
powers.insert(b"pow10".to_vec(), 10);
powers.insert(b"pow11".to_vec(), 11);
powers.insert(b"pow12".to_vec(), 12);
powers.insert(b"pow13".to_vec(), 13);
powers.insert(b"pow14".to_vec(), 14);
powers.insert(b"pow15".to_vec(), 15);

let trie = ZeroTrieSimpleAscii::try_from(&powers).unwrap();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sffc How to construct the try directly without the need of an intermediate map?

Also, the current way, will construct the map at each function call, how to build it at the compile time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I use lazy_static?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, use the const constructor of ZeroTrieSimpleAscii

https://unicode-org.github.io/icu4x/rustdoc/zerotrie/struct.ZeroTrieSimpleAscii.html#method.from_sorted_str_tuples

You need to manually put the strings in the correct order and then select the correct trie length. The error message should tell you whether the length is too long or too short and then you bisect to find the correct length. Alternatively you can use the non-const constructor to find the length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it easier in #4466

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks for the info

I would like if we can give it unsorted.

Comment on lines 18 to 48
fn get_si_prefix_base_ten(part: &str) -> (i8, &str) {
let prefixes = vec![
("quetta", 30, 0),
("ronna", 27, 1),
("yotta", 24, 2),
("zetta", 21, 3),
("exa", 18, 4),
("peta", 15, 5),
("tera", 12, 6),
("giga", 9, 7),
("mega", 6, 8),
("kilo", 3, 9),
("hecto", 2, 10),
("deca", 1, 11),
("deci", -1, 12),
("centi", -2, 13),
("milli", -3, 14),
("micro", -6, 15),
("nano", -9, 16),
("pico", -12, 17),
("femto", -15, 18),
("atto", -18, 19),
("zepto", -21, 20),
("yocto", -24, 21),
("ronto", -27, 22),
("quecto", -30, 23),
];

let prefixes_map = prefixes
.iter()
.map(|(prefix, _, index)| (prefix.as_bytes().to_vec(), *index))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 74 to 90
fn get_si_prefix_base_two(part: &str) -> (i8, &str) {
let prefixes = vec![
("yobi", 80),
("zebi", 70),
("exbi", 60),
("pebi", 50),
("tebi", 40),
("gibi", 30),
("mebi", 20),
("kibi", 10),
];
let prefixes_map = prefixes
.iter()
.map(|(prefix, index)| (prefix.as_bytes().to_vec(), *index))
.collect::<BTreeMap<Vec<u8>, usize>>();
let trie = ZeroTrieSimpleAscii::try_from(&prefixes_map).unwrap();
let mut cursor = trie.cursor();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 80 to 86
// TODO: how to convert from `&ZeroTrie<ZeroVec<'_, u8>>` to &ZeroTrieSimpleAscii<Vec<u8>>?
let payload: ZeroTrieSimpleAscii<Vec<u8>> = ZeroTrieSimpleAscii::try_from(
&icu_unitsconversion::provider::Baked::SINGLETON_UNITS_INFO_V1.units_conversion_trie,
);
)
.unwrap();
let parser = MeasureUnitParser::from_payload(&payload);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sffc : how to convert from &ZeroTrie<ZeroVec<'_, u8>> to &ZeroTrieSimpleAscii<Vec>?

Or more accurate question: how to get the ZeroTrieSimpleAsciiCursor from ZeroTrie<ZeroVec<'_, u8>>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now use take_store and from_store

After #4408 we can simplify it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done that.

However, take_store returns ZeroVec<u8> and from_store need Vec<u8>\

is there a better way than converting ZeroVec to Vec?

@younies younies requested a review from sffc December 18, 2023 19:41
Comment on lines 80 to 83
// // TODO: how to convert from `&ZeroTrie<ZeroVec<'_, u8>>` to &ZeroTrieSimpleAscii<Vec<u8>>?
// let store = icu_unitsconversion::provider::Baked::SINGLETON_UNITS_INFO_V1
// .units_conversion_trie
// .take_store();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sffc
sffc previously approved these changes Dec 28, 2023
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

experimental/unitsconversion/src/measureunit.rs Outdated Show resolved Hide resolved
Comment on lines 57 to 59
if part_without_power.starts_with("-") {
return Ok((power, &part_without_power[1..]));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (optional): you can avoid the indexing by using split_first, like

if let Some((&'-', remainder)) = part_without_power.split_first() {
    return Ok((power, remainder));
}

However, I think this only works if you deal with [u8] instead of str.

You have a lot of indexing here but it is all guarded so it's fine. You'll need to add some clippy suppressions, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Co-authored-by: Shane F. Carr <shane@unicode.org>
sffc
sffc previously approved these changes Dec 29, 2023
@younies younies merged commit e804871 into unicode-org:main Jan 5, 2024
29 checks passed
@younies younies deleted the fix-extracting-units branch January 5, 2024 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parser needs to be fixed in order to parse all the CLDR units.
4 participants