-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trace Normalizer: add service + env tag normalization #106
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
ab26e7b
add service tag + env tag normalization + unit tests
thedavl 38d0c62
Update normalizer.rs
thedavl 4058d5c
lint
thedavl 2eb2940
remove .unwrap() outside of unit tests
thedavl 0ae59a7
Merge branch 'main' into david.lee/trace-normalize-service-name
thedavl 4236bd5
change fallback_service to take an &str
thedavl aca86be
modify fallback service
thedavl 263e106
lint
thedavl 659c309
remove more .unwrap()
thedavl 0e35bbd
add mountain of tag normalization unit tests
thedavl 12c80c2
address PR comments
thedavl 7c3a31d
modify comment wording
thedavl 924c6d3
remove let ... else
thedavl be94c7a
Merge branch 'main' into david.lee/trace-normalize-service-name
thedavl 8b1f91a
Update trace-normalization/src/normalize_utils.rs
thedavl 8be8691
address PR comments
thedavl 912f026
remove .peekable() in is_normalized_ascii_tag func
thedavl 0c8de51
Update normalize_utils.rs
thedavl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,10 +3,17 @@ | |
// developed at Datadog (https://www.datadoghq.com/). Copyright 2023-Present | ||
// Datadog, Inc. | ||
|
||
// DEFAULT_SERVICE_NAME is the default name we assign a service if it's missing and we have no reasonable fallback | ||
const DEFAULT_SERVICE_NAME: &str = "unnamed-service"; | ||
|
||
// MAX_NAME_LEN the maximum length a name can have | ||
pub(crate) const MAX_NAME_LEN: usize = 100; | ||
// MAX_SERVICE_LEN the maximum length a service can have | ||
const MAX_SERVICE_LEN: usize = 100; | ||
// MAX_SERVICE_LEN the maximum length a tag can have | ||
const MAX_TAG_LEN: usize = 200; | ||
|
||
// TruncateUTF8 truncates the given string to make sure it uses less than limit bytes. | ||
// truncate_utf8 truncates the given string to make sure it uses less than limit bytes. | ||
// If the last character is a utf8 character that would be split, it removes it | ||
// entirely to make sure the resulting string is not broken. | ||
pub(crate) fn truncate_utf8(s: &str, limit: usize) -> &str { | ||
|
@@ -25,23 +32,132 @@ pub(crate) fn truncate_utf8(s: &str, limit: usize) -> &str { | |
s | ||
} | ||
|
||
// NormalizeService returns a span service or an error describing why normalization failed. | ||
// TODO: Implement this in a future PR | ||
// pub fn normalize_service(svc: String, lang: String) -> (String, Option<errors::NormalizeErrors>) { | ||
// if svc == "" { | ||
// return (fallback_service(lang), errors::NormalizeErrors::ErrorEmpty); | ||
// } | ||
// if svc.len() > MAX_SERVICE_LEN { | ||
// return (truncate_utf8(svc, MAX_SERVICE_LEN), errors::NormalizeErrors::ErrorTooLong.into()); | ||
// } | ||
// TODO: implement tag normalization | ||
// let s: String = normalize_tag(svc); | ||
// if s == "" { | ||
// return (fallbackService(lang), errors::NormalizeErrors::ErrorInvalid) | ||
// } | ||
// return (s, err) | ||
// (svc, None) | ||
// } | ||
// fallback_service returns the fallback service name for a service | ||
// belonging to language lang. | ||
// In the go agent implementation, if a lang was specified in TagStats | ||
// (extracted from the payload header) the fallback_service name would be "unnamed-{lang}-service". | ||
pub(crate) fn fallback_service() -> String { | ||
DEFAULT_SERVICE_NAME.to_string() | ||
} | ||
|
||
// normalize_service normalizes a span service | ||
pub(crate) fn normalize_service(svc: &str) -> anyhow::Result<String> { | ||
anyhow::ensure!(!svc.is_empty(), "Normalizer Error: Empty service name."); | ||
|
||
let truncated_service = truncate_utf8(svc, MAX_SERVICE_LEN); | ||
|
||
normalize_tag(truncated_service) | ||
} | ||
|
||
// normalize_tag applies some normalization to ensure the tags match the backend requirements. | ||
pub(crate) fn normalize_tag(tag: &str) -> anyhow::Result<String> { | ||
// Fast path: Check if the tag is valid and only contains ASCII characters, | ||
// if yes return it as-is right away. For most use-cases this reduces CPU usage. | ||
if is_normalized_ascii_tag(tag) { | ||
return Ok(tag.to_string()); | ||
} | ||
|
||
anyhow::ensure!(!tag.is_empty(), "Normalizer Error: Empty tag name."); | ||
|
||
// given a dummy value | ||
let mut last_char: char = 'a'; | ||
|
||
let mut result = String::with_capacity(tag.len()); | ||
|
||
for cur_char in tag.chars() { | ||
if result.len() == MAX_TAG_LEN { | ||
break; | ||
} | ||
if cur_char.is_lowercase() { | ||
result.push(cur_char); | ||
last_char = cur_char; | ||
continue; | ||
} | ||
if cur_char.is_uppercase() { | ||
let mut iter = cur_char.to_lowercase(); | ||
if let Some(c) = iter.next() { | ||
result.push(c); | ||
last_char = c; | ||
} | ||
continue; | ||
} | ||
if cur_char.is_alphabetic() { | ||
result.push(cur_char); | ||
last_char = cur_char; | ||
continue; | ||
} | ||
if cur_char == ':' { | ||
result.push(cur_char); | ||
last_char = cur_char; | ||
continue; | ||
} | ||
if !result.is_empty() | ||
&& (cur_char.is_ascii_digit() || cur_char == '.' || cur_char == '/' || cur_char == '-') | ||
{ | ||
result.push(cur_char); | ||
last_char = cur_char; | ||
continue; | ||
} | ||
if !result.is_empty() && last_char != '_' { | ||
result.push('_'); | ||
last_char = '_'; | ||
} | ||
} | ||
|
||
if last_char == '_' { | ||
result.remove(result.len() - 1); | ||
} | ||
|
||
Ok(result.to_string()) | ||
} | ||
|
||
pub(crate) fn is_normalized_ascii_tag(tag: &str) -> bool { | ||
if tag.is_empty() { | ||
return true; | ||
} | ||
if tag.len() > MAX_TAG_LEN { | ||
return false; | ||
} | ||
|
||
let mut tag_iter = tag.chars(); | ||
|
||
match tag_iter.next() { | ||
Some(c) => { | ||
if !is_valid_ascii_start_char(c) { | ||
return false; | ||
} | ||
} | ||
None => return false, | ||
} | ||
|
||
while let Some(cur_char) = tag_iter.next() { | ||
if is_valid_ascii_tag_char(cur_char) { | ||
continue; | ||
} | ||
if cur_char == '_' { | ||
// an underscore is only okay if followed by a valid non-underscore character | ||
match tag_iter.next() { | ||
Some(c) => { | ||
if !is_valid_ascii_tag_char(c) { | ||
return false; | ||
} | ||
} | ||
None => return false, | ||
}; | ||
} else { | ||
return false; | ||
} | ||
} | ||
true | ||
} | ||
|
||
pub(crate) fn is_valid_ascii_start_char(c: char) -> bool { | ||
('a'..='z').contains(&c) || c == ':' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same deal here, also, are only lower case characters valid as a start character? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, thats correct (corresponding function in the go agent normalizer) |
||
} | ||
|
||
pub(crate) fn is_valid_ascii_tag_char(c: char) -> bool { | ||
is_valid_ascii_start_char(c) || ('0'..='9').contains(&c) || c == '.' || c == '/' || c == '-' | ||
} | ||
|
||
// normalize_name normalizes a span name or an error describing why normalization failed. | ||
pub(crate) fn normalize_name(name: &str) -> anyhow::Result<String> { | ||
|
@@ -56,58 +172,6 @@ pub(crate) fn normalize_name(name: &str) -> anyhow::Result<String> { | |
normalize_metric_names(truncated_name) | ||
} | ||
|
||
// TODO: Implement this in a future PR | ||
// NormalizeTag applies some normalization to ensure the tags match the backend requirements. | ||
// pub fn normalize_tag(v: String) -> String { | ||
// Fast path: Check if the tag is valid and only contains ASCII characters, | ||
// if yes return it as-is right away. For most use-cases this reduces CPU usage. | ||
// if is_normalized_ascii_tag(v.clone()) { | ||
// return v; | ||
// } | ||
|
||
// if v.is_empty() { | ||
// return "".to_string(); | ||
// } | ||
|
||
// "".to_string() | ||
// } | ||
|
||
// pub fn is_normalized_ascii_tag(tag: String) -> bool { | ||
// if tag.is_empty() { | ||
// return true; | ||
// } | ||
// if tag.len() > MAX_TAG_LEN { | ||
// return false; | ||
// } | ||
// if !is_valid_ascii_start_char(tag.chars().next().unwrap()) { | ||
// return false; | ||
// } | ||
// for mut i in 0..tag.len() { | ||
// let b: char = tag.chars().nth(i).unwrap(); | ||
// if is_valid_ascii_tag_char(b) { | ||
// continue; | ||
// } | ||
// if b == '_' { | ||
// // an underscore is only okay if followed by a valid non-underscore character | ||
// i+=1; | ||
// if i == tag.len() || !is_valid_ascii_tag_char(tag.chars().nth(i).unwrap()) { | ||
// return false; | ||
// } | ||
// } else { | ||
// return false; | ||
// } | ||
// } | ||
// true | ||
// } | ||
|
||
// pub fn is_valid_ascii_start_char(c: char) -> bool { | ||
// ('a'..='z').contains(&c) || c == ':' | ||
// } | ||
|
||
// pub fn is_valid_ascii_tag_char(c: char) -> bool { | ||
// is_valid_ascii_start_char(c) || ('0'..='9').contains(&c) || c == '.' || c == '/' || c == '-' | ||
// } | ||
|
||
pub(crate) fn normalize_metric_names(name: &str) -> anyhow::Result<String> { | ||
let mut result = String::with_capacity(name.len()); | ||
|
||
|
@@ -190,4 +254,80 @@ mod tests { | |
} | ||
} | ||
} | ||
|
||
#[duplicate_item( | ||
test_name input expected expected_err; | ||
[test_normalize_empty_service] [""] [normalize_utils::DEFAULT_SERVICE_NAME] ["Normalizer Error: Empty service name."]; | ||
[test_normalize_valid_service] ["good"] ["good"] [""]; | ||
[test_normalize_long_service] ["Too$Long$.".repeat(20).as_str()] ["too_long_.".repeat(10)] [""]; | ||
[test_normalize_dash_service] ["bad&service"] ["bad_service"] [""]; | ||
)] | ||
#[test] | ||
fn test_name() { | ||
match normalize_utils::normalize_service(input) { | ||
Ok(val) => { | ||
assert_eq!(expected_err, ""); | ||
assert_eq!(val, expected) | ||
} | ||
Err(err) => { | ||
assert_eq!(format!("{err}"), expected_err); | ||
} | ||
} | ||
} | ||
#[duplicate_item( | ||
thedavl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
test_name input expected expected_err; | ||
[test_normalize_tag_1] ["#test_starting_hash"] ["test_starting_hash"] [""]; | ||
[test_normalize_tag_2] ["TestCAPSandSuch"] ["testcapsandsuch"] [""]; | ||
[test_normalize_tag_3] ["Test Conversion Of Weird !@#$%^&**() Characters"] ["test_conversion_of_weird_characters"] [""]; | ||
[test_normalize_tag_4] ["$#weird_starting"] ["weird_starting"] [""]; | ||
[test_normalize_tag_5] ["allowed:c0l0ns"] ["allowed:c0l0ns"] [""]; | ||
[test_normalize_tag_6] ["1love"] ["love"] [""]; | ||
[test_normalize_tag_7] ["ünicöde"] ["ünicöde"] [""]; | ||
[test_normalize_tag_8] ["ünicöde:metäl"] ["ünicöde:metäl"] [""]; | ||
[test_normalize_tag_9] ["Data🐨dog🐶 繋がっ⛰てて"] ["data_dog_繋がっ_てて"] [""]; | ||
[test_normalize_tag_10] [" spaces "] ["spaces"] [""]; | ||
[test_normalize_tag_11] [" #hashtag!@#spaces #__<># "] ["hashtag_spaces"] [""]; | ||
[test_normalize_tag_12] [":testing"] [":testing"] [""]; | ||
[test_normalize_tag_13] ["_foo"] ["foo"] [""]; | ||
[test_normalize_tag_14] [":::test"] [":::test"] [""]; | ||
[test_normalize_tag_15] ["contiguous_____underscores"] ["contiguous_underscores"] [""]; | ||
[test_normalize_tag_16] ["foo_"] ["foo"] [""]; | ||
[test_normalize_tag_17] ["\u{017F}odd_\u{017F}case\u{017F}"] ["\u{017F}odd_\u{017F}case\u{017F}"] [""]; // edge-case | ||
[test_normalize_tag_18] [""] [""] [""]; | ||
[test_normalize_tag_19] [" "] [""] [""]; | ||
[test_normalize_tag_20] ["ok"] ["ok"] [""]; | ||
[test_normalize_tag_21] ["™Ö™Ö™™Ö™"] ["ö_ö_ö"] [""]; | ||
[test_normalize_tag_22] ["AlsO:ök"] ["also:ök"] [""]; | ||
[test_normalize_tag_23] [":still_ok"] [":still_ok"] [""]; | ||
[test_normalize_tag_24] ["___trim"] ["trim"] [""]; | ||
[test_normalize_tag_25] ["12.:trim@"] [":trim"] [""]; | ||
[test_normalize_tag_26] ["12.:trim@@"] [":trim"] [""]; | ||
[test_normalize_tag_27] ["fun:ky__tag/1"] ["fun:ky_tag/1"] [""]; | ||
[test_normalize_tag_28] ["fun:ky@tag/2"] ["fun:ky_tag/2"] [""]; | ||
[test_normalize_tag_29] ["fun:ky@@@tag/3"] ["fun:ky_tag/3"] [""]; | ||
[test_normalize_tag_30] ["tag:1/2.3"] ["tag:1/2.3"] [""]; | ||
[test_normalize_tag_31] ["---fun:k####y_ta@#g/1_@@#"]["fun:k_y_ta_g/1"] [""]; | ||
[test_normalize_tag_32] ["AlsO:œ#@ö))œk"] ["also:œ_ö_œk"] [""]; | ||
[test_normalize_tag_33] ["a".repeat(888).as_str()] ["a".repeat(200)] [""]; | ||
[test_normalize_tag_34] [("a".to_owned() + &"🐶".repeat(799)).as_str()] ["a"] [""]; | ||
[test_normalize_tag_35] [("a".to_string() + &char::REPLACEMENT_CHARACTER.to_string()).as_str()] ["a"] [""]; | ||
[test_normalize_tag_36] [("a".to_string() + &char::REPLACEMENT_CHARACTER.to_string() + &char::REPLACEMENT_CHARACTER.to_string()).as_str()] ["a"] [""]; | ||
[test_normalize_tag_37] [("a".to_string() + &char::REPLACEMENT_CHARACTER.to_string() + &char::REPLACEMENT_CHARACTER.to_string() + "b").as_str()] ["a_b"] [""]; | ||
[test_normalize_tag_38] | ||
["A00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000"] | ||
["a00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000_0"] | ||
[""]; | ||
)] | ||
#[test] | ||
fn test_name() { | ||
match normalize_utils::normalize_tag(input) { | ||
Ok(normalized_tag) => { | ||
assert_eq!(expected_err, ""); | ||
assert_eq!(normalized_tag, expected) | ||
} | ||
Err(err) => { | ||
assert_eq!(format!("{err}"), expected_err); | ||
} | ||
} | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to think about here, is that is_alphabetic uses the unicode definition of alphabetic characters, so includes characters from all alphabets, not just ascii. Whereas this helper function used elsewhere is only checking for ascii alpha characters.
https://github.com/DataDog/libdatadog/blob/main/trace-normalization/src/normalize_utils.rs#L162
Our documentation only mentions alphanumerics:
What does the go trace agent do here? We should try to be consistent in which method we use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to mirror exactly what the go tag normalizer does, which:
isNormalizedASCIITag
as a fast path (here)Here is the go tag normalizer
The helper function you linked is used in
normalize_metric_names
, who's go counterpart normMetricNameParse also just uses the ASCII check