Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MeasureUnit #4360

Merged
merged 56 commits into from
Dec 7, 2023
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
ebcc3fe
Use ZeroTrie and remove dimension
younies Nov 17, 2023
aefef1a
Remove unneeded data in the data provider
younies Nov 17, 2023
c2edbd6
fix ci-job-msrv-features-3
younies Nov 17, 2023
5ea721a
Merge branch 'main' of github.com:unicode-org/icu4x into units-trie
younies Nov 22, 2023
dd9000a
use ZeroTrieSimpleAscii instead
younies Nov 22, 2023
63c44ce
Merge branch 'main' of github.com:unicode-org/icu4x into units-trie
younies Nov 23, 2023
e282232
Implement MeasureUnit
younies Nov 23, 2023
5861684
fix error
younies Nov 23, 2023
a2e5600
Implement measureunit
younies Nov 23, 2023
427b008
small fix
younies Nov 24, 2023
3440254
fix the analyzer
younies Nov 24, 2023
82bcd7c
Ready for the new data model
younies Nov 24, 2023
534fa73
silent test cases temporary
younies Nov 24, 2023
55ea5e1
add small output
younies Nov 24, 2023
27b65a0
process the bases units
younies Nov 24, 2023
2d2bc47
Fix test cases
younies Nov 24, 2023
77e459d
fix clippy
younies Nov 24, 2023
8dd3c86
use strip_prefix
younies Nov 24, 2023
dff30b7
fix get_unit_id
younies Nov 24, 2023
6427f4e
fix clippy
younies Nov 24, 2023
28fe962
Add comments
younies Nov 24, 2023
4b0a395
add comments
younies Nov 24, 2023
42ebe7f
Create a struct for `SiPrefix`
younies Nov 28, 2023
64ba8e5
use optional si prefix
younies Nov 28, 2023
f13d1cb
add "ronto" and "quecto"
younies Nov 28, 2023
a4b923c
return option si prefix
younies Nov 28, 2023
e839685
Merge branch 'main' of github.com:unicode-org/icu4x into units-trie
younies Nov 28, 2023
754d17e
fix clippy
younies Nov 28, 2023
dbd3e9c
Merge branch 'main' of github.com:unicode-org/icu4x into units-trie
younies Nov 29, 2023
01f4009
add more powers and si prefixes
younies Nov 29, 2023
e01e211
Fix the search for the id
younies Nov 29, 2023
bc6fcb8
SiPrefix without i8 instead of u8.
younies Nov 30, 2023
193ec7a
Instead of making the SiPrefix field optional, we can establish the d…
younies Nov 30, 2023
1002837
add ronna and quetta
younies Nov 30, 2023
7b19d2d
Update experimental/unitsconversion/src/measureunit.rs
younies Nov 30, 2023
d320904
fix
younies Nov 30, 2023
9fcf51a
reduce the created vectors.
younies Nov 30, 2023
bfff7dd
fix naming
younies Nov 30, 2023
d0d1a5d
Merge branch 'main' of github.com:unicode-org/icu4x into units-trie
younies Nov 30, 2023
1931b07
use split instead
younies Nov 30, 2023
9df4bac
improve
younies Nov 30, 2023
7c0916f
Merge branch 'main' of github.com:unicode-org/icu4x into units-trie
younies Nov 30, 2023
705355b
improve
younies Nov 30, 2023
5c6f1ee
small fix
younies Nov 30, 2023
c288a8a
improve
younies Nov 30, 2023
894bb60
fix clippy
younies Nov 30, 2023
bb144d4
fix comments
younies Dec 1, 2023
4ed7efb
move the static function
younies Dec 1, 2023
03ee92f
rename functions
younies Dec 1, 2023
6cc4f86
fix comment
younies Dec 1, 2023
0eff919
add todo
younies Dec 1, 2023
0acd58f
Merge branch 'main' into units-trie
younies Dec 1, 2023
fe3382d
use smallvec
younies Dec 1, 2023
e4fc883
Merge branch 'main' into units-trie
younies Dec 1, 2023
20d2013
Merge branch 'main' into units-trie
younies Dec 1, 2023
70c9f31
Merge branch 'main' into units-trie
younies Dec 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions experimental/unitsconversion/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,18 @@

extern crate alloc;

pub mod measureunit;
pub mod provider;

/// Represents the possible errors that can occur during the measurement unit operations.
pub enum ConversionError {
/// The unit is not valid.
/// This can happen if the unit id is not following the CLDR specification.
/// For example, `meter` is a valid unit id, but `metre` is not.
InvalidUnit,

/// The conversion is not valid.
/// This can happen if the units are not compatible.
/// For example, `meter` and `foot` are compatible, but `meter` and `second` are not.
InvalidConversion,
}
269 changes: 269 additions & 0 deletions experimental/unitsconversion/src/measureunit.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
// This file is part of ICU4X. For terms of use, please see the file
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

use zerotrie::ZeroTrie;
use zerovec::ZeroVec;

use crate::{
provider::{Base, MeasureUnitItem, SiPrefix},
ConversionError,
};

// TODO(#4369): split this struct to two structs: MeasureUnitParser for parsing the identifier and MeasureUnit to represent the unit.
#[zerovec::make_varule(MeasureUnitULE)]
#[derive(Clone, Debug, PartialOrd, Ord, PartialEq, Eq, Default)]
#[cfg_attr(
feature = "datagen",
derive(databake::Bake),
databake(path = icu_unitsconversion::provider),
)]
#[cfg_attr(
feature = "datagen",
derive(serde::Serialize),
zerovec::derive(Serialize)
)]
#[cfg_attr(
feature = "serde",
derive(serde::Deserialize),
zerovec::derive(Deserialize)
)]
#[zerovec::derive(Debug)]
pub struct MeasureUnit<'data> {
/// Contains the processed units.
#[cfg_attr(feature = "serde", serde(borrow))]
pub contained_units: ZeroVec<'data, MeasureUnitItem>,
}

impl MeasureUnit<'_> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this runtime code, or just datagen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have also:

let meter = MeasureUnit::try_from_id("meter");

So, I want to use MeasureUnit in the runtime and in the datagen

// TODO: consider returning Option<(u8, &str)> instead of (1, part) for the case when the power is not found.
// TODO: complete all the cases for the powers.
// TODO: consider using a trie for the powers.
/// Get the power of the unit.
/// NOTE:
/// if the power is found, the function will return (power, part without the power).
/// if the power is not found, the function will return (1, part).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: These docs seem out of date

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

fn get_power(part: &str) -> (u8, &str) {
if let Some(part) = part.strip_prefix("square-") {
(2, part)
} else if let Some(part) = part.strip_prefix("pow2-") {
(2, part)
} else if let Some(part) = part.strip_prefix("cubic-") {
(3, part)
} else if let Some(part) = part.strip_prefix("pow3-") {
(3, part)
} else if let Some(part) = part.strip_prefix("pow4-") {
(4, part)
} else if let Some(part) = part.strip_prefix("pow5-") {
(5, part)
} else if let Some(part) = part.strip_prefix("pow6-") {
(6, part)
} else if let Some(part) = part.strip_prefix("pow7-") {
(7, part)
} else if let Some(part) = part.strip_prefix("pow8-") {
(8, part)
} else if let Some(part) = part.strip_prefix("pow9-") {
(9, part)
} else {
(1, part)
}
}

// TODO: complete all the cases for the prefixes.
// TODO: consider using a trie for the prefixes.
/// Get the SI prefix.
/// NOTE:
/// if the prefix is found, the function will return (power, base, part without the prefix).
/// if the prefix is not found, the function will return (0, Base::NotExist, part).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Docs seem out of date

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

fn get_si_prefix(part: &str) -> (SiPrefix, &str) {
let (si_prefix_base_10, part) = Self::get_si_prefix_base_10(part);
if si_prefix_base_10 != 0 {
return (
SiPrefix {
power: si_prefix_base_10,
base: Base::Decimal,
},
part,
);
}

let (si_prefix_base_2, part) = Self::get_si_prefix_base_two(part);
if si_prefix_base_2 != 0 {
return (
SiPrefix {
power: si_prefix_base_2,
base: Base::Binary,
},
part,
);
}

(
SiPrefix {
power: 0,
base: Base::Decimal,
},
part,
)
}

// TODO: consider returning Option<(i8, &str)> instead of (0, part) for the case when the prefix is not found.
// TODO: consider using a trie for the prefixes.
// TODO: complete all the cases for the prefixes.
/// Get the SI prefix for base 10.
/// NOTE:
/// if the prefix is found, the function will return (power, part without the prefix).
/// if the prefix is not found, the function will return (0, part).
fn get_si_prefix_base_10(part: &str) -> (i8, &str) {
if let Some(part) = part.strip_prefix("quetta") {
(30, part)
} else if let Some(part) = part.strip_prefix("ronna") {
(27, part)
} else if let Some(part) = part.strip_prefix("yotta") {
(24, part)
} else if let Some(part) = part.strip_prefix("zetta") {
(21, part)
} else if let Some(part) = part.strip_prefix("exa") {
(18, part)
} else if let Some(part) = part.strip_prefix("peta") {
(15, part)
} else if let Some(part) = part.strip_prefix("tera") {
(12, part)
} else if let Some(part) = part.strip_prefix("giga") {
(9, part)
} else if let Some(part) = part.strip_prefix("mega") {
(6, part)
} else if let Some(part) = part.strip_prefix("kilo") {
(3, part)
} else if let Some(part) = part.strip_prefix("hecto") {
(2, part)
} else if let Some(part) = part.strip_prefix("deca") {
(1, part)
} else if let Some(part) = part.strip_prefix("deci") {
(-1, part)
} else if let Some(part) = part.strip_prefix("centi") {
(-2, part)
} else if let Some(part) = part.strip_prefix("milli") {
(-3, part)
} else if let Some(part) = part.strip_prefix("micro") {
(-6, part)
} else if let Some(part) = part.strip_prefix("nano") {
(-9, part)
} else if let Some(part) = part.strip_prefix("pico") {
(-12, part)
} else if let Some(part) = part.strip_prefix("femto") {
(-15, part)
} else if let Some(part) = part.strip_prefix("atto") {
(-18, part)
} else if let Some(part) = part.strip_prefix("zepto") {
(-21, part)
} else if let Some(part) = part.strip_prefix("yocto") {
(-24, part)
} else if let Some(part) = part.strip_prefix("ronto") {
(-27, part)
} else if let Some(part) = part.strip_prefix("quecto") {
(-30, part)
} else {
robertbastian marked this conversation as resolved.
Show resolved Hide resolved
(0, part)
}
}

// TODO: consider returning Option<(i8, &str)> instead of (0, part) for the case when the prefix is not found.
// TODO: consider using a trie for the prefixes.
// TODO: complete all the cases for the prefixes.
/// Get the SI prefix for base 2.
/// NOTE:
/// if the prefix is found, the function will return (power, part without the prefix).
/// if the prefix is not found, the function will return (0, part).
fn get_si_prefix_base_two(part: &str) -> (i8, &str) {
if let Some(part) = part.strip_prefix("kibi") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: here you're searching from 0 to +inf, which I think makes sense. In the decimal units you do +inf -> 0 -> -inf. It probably makes sense to start at 0 and do +1 -> -1 -> +2 -> -2 ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be confusing:

"deca" , "deci" , "hecta", "centi", "kilo", "milli" .... etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I see the pattern. Just a nit though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

(10, part)
} else if let Some(part) = part.strip_prefix("mebi") {
(20, part)
} else if let Some(part) = part.strip_prefix("gibi") {
(30, part)
} else if let Some(part) = part.strip_prefix("tebi") {
(40, part)
} else if let Some(part) = part.strip_prefix("pebi") {
(50, part)
} else if let Some(part) = part.strip_prefix("exbi") {
(60, part)
} else if let Some(part) = part.strip_prefix("zebi") {
(70, part)
} else if let Some(part) = part.strip_prefix("yobi") {
(80, part)
} else {
(0, part)
}
}

// TODO: consider using a sufficient trie search for finding the unit id.
/// Get the unit id.
/// NOTE:
/// if the unit id is found, the function will return (unit id, part without the unit id and without `-` at the beginning of the remaining part if it exists).
/// if the unit id is not found, the function will return None.
fn get_unit_id<'data>(
part: &'data str,
trie: &ZeroTrie<ZeroVec<'data, u8>>,
) -> Option<(usize, &'data str)> {
// TODO(#4379): this is inefficient way to search for an item in a trie.
// we must implement a way to search for a prefix in a trie.
let mut result = None;
for (index, _) in part.char_indices() {
let identifier = &part[..=index];
robertbastian marked this conversation as resolved.
Show resolved Hide resolved
if let Some(value) = trie.get(identifier.as_bytes()) {
result = Some((value, &part[identifier.len()..]));
}
}
robertbastian marked this conversation as resolved.
Show resolved Hide resolved

result
}

/// Process a part of an identifier.
/// For example, if the whole identifier is: "square-kilometer-per-second",
/// this function will be called for "square-kilometer" with sign (1) and "second" with sign (-1).
fn analyze_identifier_part(
identifier_part: &str,
sign: i8,
result: &mut Vec<MeasureUnitItem>,
trie: &ZeroTrie<ZeroVec<'_, u8>>,
sffc marked this conversation as resolved.
Show resolved Hide resolved
) -> Result<(), ConversionError> {
let mut identifier = identifier_part;
while !identifier.is_empty() {
let (power, identifier_after_power) = Self::get_power(identifier);
let (si_prefix, identifier_after_si) = Self::get_si_prefix(identifier_after_power);
let (unit_id, identifier_after_unit) =
match Self::get_unit_id(identifier_after_si, trie) {
Some((unit_id, identifier_after_unit)) => (unit_id, identifier_after_unit),
None => return Err(ConversionError::InvalidUnit),
};
robertbastian marked this conversation as resolved.
Show resolved Hide resolved

result.push(MeasureUnitItem {
power: power as i8 * sign,
si_prefix,
unit_id: unit_id as u16,
});

identifier = identifier_after_unit.get(1..).unwrap_or("");
robertbastian marked this conversation as resolved.
Show resolved Hide resolved
}

Ok(())
}

// TODO: add test cases for this function.
/// Process an identifier.
pub fn try_from_identifier<'data>(
identifier: &'data str,
trie: &ZeroTrie<ZeroVec<'data, u8>>,
) -> Result<Vec<MeasureUnitItem>, ConversionError> {
let (num_part, den_part) = identifier
.split_once("-per-")
.or_else(|| identifier.split_once("per-"))
.unwrap_or((identifier, ""));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized we actually only need one search:

Suggested change
let (num_part, den_part) = identifier
.split_once("-per-")
.or_else(|| identifier.split_once("per-"))
.unwrap_or((identifier, ""));
let (num_part, den_part) = identifier
.split_once("per-")
.map(|(n,d)| (n.strip_suffix("-").unwrap_or(n), d))
.unwrap_or((identifier, ""));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, done, but I feel we over complicated it


let mut measure_unit_items = Vec::<MeasureUnitItem>::new();
Self::analyze_identifier_part(num_part, 1, &mut measure_unit_items, trie)?;
Self::analyze_identifier_part(den_part, -1, &mut measure_unit_items, trie)?;
Ok(measure_unit_items)
}
}
66 changes: 62 additions & 4 deletions experimental/unitsconversion/src/provider.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
//!
//! Read more about data providers: [`icu_provider`]

use alloc::borrow::Cow;
use icu_provider::prelude::*;
use zerotrie::ZeroTrie;
use zerovec::{VarZeroVec, ZeroVec};
Expand Down Expand Up @@ -37,7 +36,7 @@ pub const KEYS: &[DataKey] = &[UnitsInfoV1Marker::KEY];
pub struct UnitsInfoV1<'data> {
/// Maps from unit name (e.g. foot) to it is conversion information.
#[cfg_attr(feature = "serde", serde(borrow))]
pub units_conversion_map: ZeroTrie<ZeroVec<'data, u8>>,
pub units_conversion_trie: ZeroTrie<ZeroVec<'data, u8>>,

/// Contains the conversion information, such as the conversion rate and the base unit.
/// For example, the conversion information for the unit `foot` is `1 foot = 0.3048 meter`.
Expand Down Expand Up @@ -66,9 +65,9 @@ pub struct UnitsInfoV1<'data> {
)]
#[zerovec::derive(Debug)]
pub struct ConversionInfo<'data> {
/// Contains the base unit which the unit is converted to.
/// Contains the base unit (after parsing) which what the unit is converted to.
#[cfg_attr(feature = "serde", serde(borrow))]
pub base_unit: Cow<'data, str>,
pub basic_units: ZeroVec<'data, MeasureUnitItem>,

/// Represents the numerator of the conversion factor.
#[cfg_attr(feature = "serde", serde(borrow))]
Expand Down Expand Up @@ -129,3 +128,62 @@ pub enum Exactness {
Exact = 0,
Approximate = 1,
}

#[zerovec::make_ule(BaseULE)]
#[cfg_attr(
feature = "datagen",
derive(serde::Serialize, databake::Bake),
databake(path = icu_unitsconversion::provider),
)]
#[cfg_attr(feature = "serde", derive(serde::Deserialize))]
#[derive(Copy, Clone, Debug, PartialOrd, Ord, PartialEq, Eq, Default)]
#[repr(u8)]
pub enum Base {
/// The base of the si prefix is 10.
#[default]
Decimal = 0,

/// The base of the si prefix is 2.
Binary = 1,
}

/// Represents an Item of a MeasureUnit.
/// For example, the MeasureUnit `kilometer-per-square-second` contains two items:
/// 1. `kilometer` with power 1 and prefix 3 with base 10.
/// 2. `second` with power -2 and prefix `NotExist`.
#[zerovec::make_ule(MeasureUnitItemULE)]
#[derive(Copy, Clone, Debug, PartialOrd, Ord, PartialEq, Eq, Default)]
#[cfg_attr(
feature = "datagen",
derive(serde::Serialize, databake::Bake),
databake(path = icu_unitsconversion::provider),
)]
#[cfg_attr(feature = "serde", derive(serde::Deserialize))]
pub struct MeasureUnitItem {
/// The power of the unit.
pub power: i8,

/// The si base of the unit.
pub si_prefix: SiPrefix,

/// The id of the unit.
pub unit_id: u16,
}

// TODO: Consider reducing the size of this struct while implementing the ULE.
/// Represents the SI prefix.
#[zerovec::make_ule(SiPrefixULE)]
#[derive(Copy, Clone, Debug, PartialOrd, Ord, PartialEq, Eq, Default)]
#[cfg_attr(
feature = "datagen",
derive(serde::Serialize, databake::Bake),
databake(path = icu_unitsconversion::provider),
)]
#[cfg_attr(feature = "serde", derive(serde::Deserialize))]
pub struct SiPrefix {
/// The absolute value of the power of the si prefix.
pub power: i8,

/// The base of the si prefix.
pub base: Base,
}
Loading