-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the case of "centi-meter" and "100-kilometer" #4418
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test that this fixes? I'm not quite sure what the scenario is
identifier_split: &mut std::str::Split<'data, char>, | ||
trie: &ZeroTrie<ZeroVec<'data, u8>>, | ||
) -> Option<usize> { | ||
let mut part = part.to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a Cow
to avoid allocation in the general case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ZeroTrie cursor is available now so you should just use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I will do that right away.
In order to add the test cases, could you reply to this comment:
https://github.com/unicode-org/icu4x/pull/4422/files#r1419262946
I need to be able to read the provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use the ZeroTrie cursor, this means that we need to change the algorithm in getting the power and the si prefix.
Shall we do it in another PR and test the performance. who knows which implementation is better in performance ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't allocate memory in hot library code like this. We have clients who are sensitive to it. Also, we have a lot of experiential evidence that memory allocations are one of the biggest single contributors to slow code, so I have no doubt that ZeroTrieCursor will be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, no problem. actually, I am implementing it right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- Now that ZeroTrie Cursor is available, please rewrite the parsing code to use it and fix the bug at the same time.
- Please add test cases. I'm not opposed to merging this PR first but it needs test cases in order to be mergeable.
identifier_split: &mut std::str::Split<'data, char>, | ||
trie: &ZeroTrie<ZeroVec<'data, u8>>, | ||
) -> Option<usize> { | ||
let mut part = part.to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ZeroTrie cursor is available now so you should just use that.
Please hook up baked data first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with the proposed PR because you should not allocate memory, like to_string()
, when parsing. Also, I would like any bug fixes to be backed by unit tests.
This PR is small enough that I suggest that you write the ZeroTrie code and fix the bug at the same time. There is no need to benchmark which one is faster because memory allocations are off the table.
pub fn get_power(part: &str) -> (u8, &str) { | ||
use std::collections::BTreeMap; | ||
let mut powers = BTreeMap::<Vec<u8>, usize>::new(); | ||
powers.insert(b"pow1".to_vec(), 1); | ||
powers.insert(b"pow2".to_vec(), 2); | ||
powers.insert(b"square".to_vec(), 2); | ||
powers.insert(b"pow3".to_vec(), 3); | ||
powers.insert(b"cubic".to_vec(), 3); | ||
powers.insert(b"pow4".to_vec(), 4); | ||
powers.insert(b"pow5".to_vec(), 5); | ||
powers.insert(b"pow6".to_vec(), 6); | ||
powers.insert(b"pow7".to_vec(), 7); | ||
powers.insert(b"pow8".to_vec(), 8); | ||
powers.insert(b"pow9".to_vec(), 9); | ||
powers.insert(b"pow10".to_vec(), 10); | ||
powers.insert(b"pow11".to_vec(), 11); | ||
powers.insert(b"pow12".to_vec(), 12); | ||
powers.insert(b"pow13".to_vec(), 13); | ||
powers.insert(b"pow14".to_vec(), 14); | ||
powers.insert(b"pow15".to_vec(), 15); | ||
|
||
let trie = ZeroTrieSimpleAscii::try_from(&powers).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sffc How to construct the try directly without the need of an intermediate map?
Also, the current way, will construct the map at each function call, how to build it at the compile time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall I use lazy_static
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, use the const constructor of ZeroTrieSimpleAscii
You need to manually put the strings in the correct order and then select the correct trie length. The error message should tell you whether the length is too long or too short and then you bisect to find the correct length. Alternatively you can use the non-const constructor to find the length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it easier in #4466
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks for the info
I would like if we can give it unsorted.
fn get_si_prefix_base_ten(part: &str) -> (i8, &str) { | ||
let prefixes = vec![ | ||
("quetta", 30, 0), | ||
("ronna", 27, 1), | ||
("yotta", 24, 2), | ||
("zetta", 21, 3), | ||
("exa", 18, 4), | ||
("peta", 15, 5), | ||
("tera", 12, 6), | ||
("giga", 9, 7), | ||
("mega", 6, 8), | ||
("kilo", 3, 9), | ||
("hecto", 2, 10), | ||
("deca", 1, 11), | ||
("deci", -1, 12), | ||
("centi", -2, 13), | ||
("milli", -3, 14), | ||
("micro", -6, 15), | ||
("nano", -9, 16), | ||
("pico", -12, 17), | ||
("femto", -15, 18), | ||
("atto", -18, 19), | ||
("zepto", -21, 20), | ||
("yocto", -24, 21), | ||
("ronto", -27, 22), | ||
("quecto", -30, 23), | ||
]; | ||
|
||
let prefixes_map = prefixes | ||
.iter() | ||
.map(|(prefix, _, index)| (prefix.as_bytes().to_vec(), *index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
fn get_si_prefix_base_two(part: &str) -> (i8, &str) { | ||
let prefixes = vec![ | ||
("yobi", 80), | ||
("zebi", 70), | ||
("exbi", 60), | ||
("pebi", 50), | ||
("tebi", 40), | ||
("gibi", 30), | ||
("mebi", 20), | ||
("kibi", 10), | ||
]; | ||
let prefixes_map = prefixes | ||
.iter() | ||
.map(|(prefix, index)| (prefix.as_bytes().to_vec(), *index)) | ||
.collect::<BTreeMap<Vec<u8>, usize>>(); | ||
let trie = ZeroTrieSimpleAscii::try_from(&prefixes_map).unwrap(); | ||
let mut cursor = trie.cursor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// TODO: how to convert from `&ZeroTrie<ZeroVec<'_, u8>>` to &ZeroTrieSimpleAscii<Vec<u8>>? | ||
let payload: ZeroTrieSimpleAscii<Vec<u8>> = ZeroTrieSimpleAscii::try_from( | ||
&icu_unitsconversion::provider::Baked::SINGLETON_UNITS_INFO_V1.units_conversion_trie, | ||
); | ||
) | ||
.unwrap(); | ||
let parser = MeasureUnitParser::from_payload(&payload); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sffc : how to convert from &ZeroTrie<ZeroVec<'_, u8>>
to &ZeroTrieSimpleAscii<Vec>?
Or more accurate question: how to get the ZeroTrieSimpleAsciiCursor
from ZeroTrie<ZeroVec<'_, u8>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now use take_store and from_store
After #4408 we can simplify it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done that.
However, take_store
returns ZeroVec<u8>
and from_store
need Vec<u8>
\
is there a better way than converting ZeroVec
to Vec
?
// // TODO: how to convert from `&ZeroTrie<ZeroVec<'_, u8>>` to &ZeroTrieSimpleAscii<Vec<u8>>? | ||
// let store = icu_unitsconversion::provider::Baked::SINGLETON_UNITS_INFO_V1 | ||
// .units_conversion_trie | ||
// .take_store(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
if part_without_power.starts_with("-") { | ||
return Ok((power, &part_without_power[1..])); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion (optional): you can avoid the indexing by using split_first
, like
if let Some((&'-', remainder)) = part_without_power.split_first() {
return Ok((power, remainder));
}
However, I think this only works if you deal with [u8]
instead of str
.
You have a lot of indexing here but it is all guarded so it's fine. You'll need to add some clippy suppressions, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Co-authored-by: Shane F. Carr <shane@unicode.org>
Fixes: #4461