-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Match-on-metadata #6035
Conversation
Also moves construction to the constructor
I believe if you do |
@@ -249,7 +249,18 @@ | |||
} | |||
else | |||
{ | |||
itemsToRemove = FindItemsUsingMatchOnMetadata(group, child, bucket, matchOnMetadata, matchingOptions); | |||
ImmutableList<string> metadataList = matchOnMetadata.ToImmutableList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep this as a separate method to have ExecuteRemove read nicer by only contain top level intent and delegating more nitty gritty implementation details to helper methods. Like the other if branch.
StringComparer comparer = options == MatchOnMetadataOptions.CaseSensitive ? StringComparer.Ordinal : StringComparer.OrdinalIgnoreCase; | ||
children = new Dictionary<string, MetadataSet<P, I>>(comparer); | ||
normalize = options == MatchOnMetadataOptions.PathLike ? p => FileUtilities.NormalizePathForComparisonNoThrow(p, Environment.CurrentDirectory) : p => p; | ||
foreach (ItemSpec<P, I>.ItemExpressionFragment frag in itemSpec.Fragments) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this type check, isn't ItemSpec.Fragments a list of the base class ItemSpecFragment? (List<ItemSpecFragment>
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. It might be casting it automatically when it sees that, but I'm not sure. I don't think we should care about other kinds of fragments, since I don't think they can have metadata, so I think this is ok, but that was a lot of "I think"s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as whether this is actually a safe, I should have mentioned that the type is checked before the MetadataSet is constructed, so it should always be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! I have left comments - mostly nits, but the logic in Contains
appears to be incorrect, please add a test case exercising all code paths there.
@@ -14,6 +14,7 @@ class RemoveOperation : LazyItemOperation | |||
{ | |||
readonly ImmutableList<string> _matchOnMetadata; | |||
readonly MatchOnMetadataOptions _matchOnMetadataOptions; | |||
private MetadataSet<P, I> metadataSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Naming convention, field name is missing the leading underscore.
{ | ||
if (metadataSet == null) | ||
{ | ||
metadataSet = new MetadataSet<P, I>(_matchOnMetadataOptions, _matchOnMetadata, _itemSpec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This lazy initialization of metadataSet
is not thread safe. Is it guaranteed to always run only on one thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can put it in the constructor—shouldn't affect perf, assuming operations are always executed.
src/Build/Evaluation/ItemSpec.cs
Outdated
@@ -504,4 +459,78 @@ public GlobFragment(string textFragment, string projectDirectory) | |||
&& TextFragment[2] == '*' | |||
&& FileUtilities.IsAnySlash(TextFragment[3]); | |||
} | |||
|
|||
internal class MetadataSet<P, I> where P : class, IProperty where I : class, IItem, IMetadataTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super-nit: The class can be sealed
.
src/Build/Evaluation/ItemSpec.cs
Outdated
private Dictionary<string, MetadataSet<P, I>> children; | ||
Func<string, string> normalize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Names missing leading underscores, both fields can be readonly
.
children = new Dictionary<string, MetadataSet<P, I>>(comparer); | ||
} | ||
|
||
// Relies on IEnumerable returning the metadata in a reasonable order. Reasonable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is whether the matching IEnumerable<>
is passed to the constructor and to Contains
. It appears to be true although I wonder if there's a way to enforce it in the code.
src/Build/Evaluation/ItemSpec.cs
Outdated
if (current.children.TryGetValue(normalizedString, out MetadataSet<P, I> child)) | ||
{ | ||
current = child; | ||
} | ||
else | ||
{ | ||
current.children.Add(normalizedString, new MetadataSet<P, I>(comparer)); | ||
current = current.children[normalizedString]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Redundant dictionary lookup.
if (current.children.TryGetValue(normalizedString, out MetadataSet<P, I> child)) | |
{ | |
current = child; | |
} | |
else | |
{ | |
current.children.Add(normalizedString, new MetadataSet<P, I>(comparer)); | |
current = current.children[normalizedString]; | |
} | |
if (!current.children.TryGetValue(normalizedString, out MetadataSet<P, I> child)) | |
{ | |
child = new MetadataSet<P, I>(comparer); | |
current.children.Add(normalizedString, child); | |
} | |
current = child; |
src/Build/Evaluation/ItemSpec.cs
Outdated
internal bool Contains(IEnumerable<string> metadata) | ||
{ | ||
bool nonEmptyFound = false; | ||
MetadataSet<P, I> curr = this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super-nit: current
to match the name used in Add
?
@@ -249,7 +249,18 @@ private void ExecuteRemove(ProjectItemGroupTaskItemInstance child, ItemBucket bu | |||
} | |||
else | |||
{ | |||
itemsToRemove = FindItemsUsingMatchOnMetadata(group, child, bucket, matchOnMetadata, matchingOptions); | |||
ImmutableList<string> metadataList = matchOnMetadata.ToImmutableList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the conversion to ImmutableList<>
? You should be able to make the MetadataSet<,>
constructor take IEnumerable<>
.
src/Build/Evaluation/ItemSpec.cs
Outdated
{ | ||
if (!String.IsNullOrEmpty(m)) | ||
{ | ||
nonEmptyFound = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this return false
? The previous code was checking if all metadata is defined and their value matches. I suggest adding test coverage for the case where m
is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #4997:
if a metadata name doesn't exist in both items, then it's considered a match. Special case: if none of the metadata exist in both items, then it should not match.
But you're right—the code I'm replacing checks for string.Empty and automatically returns false. Given our conversation in the PR review meeting, I'll change to that. Thanks!
@@ -1988,63 +1988,6 @@ public void KeepWithItemReferenceOnNonmatchingMetadata() | |||
items.ElementAt(3).GetMetadataValue("d").ShouldBe("d"); | |||
} | |||
|
|||
[Fact] | |||
public void FailWithMatchingMultipleMetadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't delete the tests, instead replace the exception assertion with an assertion on I2's contents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new test implies you can match with multiple data now though. If we want to keep that, the FailWithMatchingMultipleMetadata
test is actually invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to "RemoveWithMatchingMultipleMetdata." I did enable matching multiple.
.HelpKeyword.ShouldBe("MSBuild.OM_MatchOnMetadataIsRestrictedToOnlyOneReferencedItem"); | ||
Project project = ObjectModelHelpers.CreateInMemoryProject(content); | ||
IEnumerable<ProjectItem> items = project.ItemsIgnoringCondition.Where(i => i.ItemType.Equals("I2")); | ||
items.Count().ShouldBe(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both of the tests, assert that the right items are left in I2.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
src/Build/Evaluation/ItemSpec.cs
Outdated
|
||
internal sealed class MetadataSet<P, I> where P : class, IProperty where I : class, IItem, IMetadataTable | ||
{ | ||
private readonly Dictionary<string, MetadataSet<P, I>> _children; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use this convoluted data structure instead of dumping all metadata values from the Itemspec into a flat set of normalized metadata values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not 100% sure I agree with the data structure, but I want to clarify my understanding from the PR review today.
A flat set of normalized metadata values would prevent us from being able to determine exactly which metadata values from which items were matched.
We'd need to do this to enable the logic in a test like RemoveWithMultipleItemReferenceOnMatchingMetadata
.
<I3 Remove='@(I1);@(I2)' MatchOnMetadata='M1' />
This remove attribute effectively translates to:
Remove items in I3 where:
- I1 OR I2 match on the following metadata:
- M1
I believe this also enables something like:
<I3 Remove='@(I1);@(I2)' MatchOnMetadata='M1;M2' />
Where this remove attribute translates to:
Remove items in I3 where:
- I1 OR I2 match on the following metadata:
- M1 AND M2
@Forgind is this understanding correct? If so I think there should be a test for multiple items and multiple metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote detailed comment for MetadataTrie class. Will push soon.
ProjectFileErrorUtilities.VerifyThrowInvalidProjectFile( | ||
_matchOnMetadata.IsEmpty || _itemSpec.Fragments.All(f => f is ItemSpec<ProjectProperty, ProjectItem>.ItemExpressionFragment), | ||
new BuildEventFileInfo(string.Empty), | ||
"OM_MatchOnMetadataIsRestrictedToOnlyOneReferencedItem"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't forget to change the error message to reflect the relaxed validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep returning no-match if all the metadata are empty on both items?
I can theoretically see a use case for "Match on everything that has no metadata." Maybe we can use a special keyword/symbol here to represent that? Might be best to file that as an issue and see if that gets traction.
@@ -1988,63 +1988,6 @@ public void KeepWithItemReferenceOnNonmatchingMetadata() | |||
items.ElementAt(3).GetMetadataValue("d").ShouldBe("d"); | |||
} | |||
|
|||
[Fact] | |||
public void FailWithMatchingMultipleMetadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new test implies you can match with multiple data now though. If we want to keep that, the FailWithMatchingMultipleMetadata
test is actually invalid.
ExecuteTask(task, lookup); | ||
ICollection<ProjectItemInstance> items = lookup.GetItems("I2"); | ||
items.Count().ShouldBe(3); | ||
items.ElementAt(0).EvaluatedInclude.ShouldBe("a2"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So matching the names of metadata is case insensitive, but their values are not. The values being case sensitive makes sense. I remember us having a discussion on the topic in a linux issue.
Just thinking out loud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MatchOnMetadata allows three options: case sensitive, case insensitive, and pathlike. The third is most similar to what you were thinking, I think, with defaulting to case insensitive regardless of OS.
Because mono not up-to-date ;(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for getting back to this after so many days.
Playing devil's advocate, imagine that instead of the Trie, we use a simple HashSet<string>
where the keys are a concatenation of the respective metadata value with a delimiter & proper escaping to make sure that matching doesn't have false positives. What would be the advantages and disadvantages of this approach?
Using your example from the code comment, the HashSet would contain something like:
"eating|talking|sleeping"
"sleeping|eating|talking"
"talking|sleeping|eating"
so it wouldn't match the Forgind item with the "signature" of "sleeping|talking|eating"
.
/// Item Child has metadata GoodAt="sleeping" BadAt="eating" OkAt="talking" | ||
/// Item Adolescent has metadata GoodAt="talking" BadAt="sleeping" OkAt="eating" | ||
/// Specifying these three metadata: | ||
/// Item Forgind with metadata GoodAt="sleeping" BadAt="talking" OkAt="eating" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄
You clearly get the advantage (disadvantage? 😋) of a simpler data structure, but the logic of how to construct the key is more complicated, which partially balances it out. Time to hash a particular string is officially constant but in practice scales according to the length of the string (nlg(n) and assumes the string is short, if I remember correctly), so I don't think the total hashing time would be less. We would win on lookups in the dictionary, but that isn't really relevant. We would also have to factor in time-to-construct the string, which would also always create a new string unless there happens to be exactly one metadatum, and we put in an optimization to not add a character to the end of that. (The argument against an optimization is that we would have to find out that we don't need the extra character, which would either require counting the length of an enumerable (fast, but still theoretically O(n)) or creating something like a StringBuilder, adding it to that StringBuilder, and having a check that the SB has only had one thing added to it before stringifying it. Admittedly, that part should be better with StringTools, so this argument is pretty weak.) The biggest disadvantage in my mind is with reusing strings, since even if we concatenated them all in one step, we'd still make a new string for each item, and if there are several metadata or the metadata are large (noting that these are metadata values, not metadata keys), it could end up consuming a lot of memory. This is a case in which StringTools would struggle because each of the metadata values, once concatenated to others, could be unique, in which case we'd make a large number of unique large strings. Even matching on two metadata, if the values are large, say, "description" and "path" (with long paths enabled), these could be hundreds or even thousands of characters. With a metadata trie, we don't make any new strings. With a single hash set, we would. |
Thank you for the detailed analysis!
Actually we do if |
Fixes AB#1261123 (feedback ticket)
Context
Checking metadata for matches was taking (#fragments in item) * (#fragments in item to remove) * (#metadata). For many users, this was extremely slow.
Changes Made
Put fragments into a custom MetadataSet object before checking for a match. This takes it from O(n * r * m) to O(m * (n + r)). Since m is normally small, this is a substantial improvement.
Testing
Time on a single operation went from ~180 seconds to ~150 ms for the penultimate version. It also passes @cdmihai's extensive unit tests.*
Notes
Is there any reason to keep the error for referencing multiple items? As a user, I would expect not-match-on-metadata to ignore metadata and match-on-metadata to ignore the item.
Is it ok to rely on IEnumerable returning element in a consistent order?
Should we keep returning no-match if all the metadata are empty on both items? It makes more sense to me to have that be a match, and it would simplify the code slightly. If it makes more sense to others, now would be the time to do it, since the code isn't that old.
* Except one on paths. Not sure why that's failing, since it's passing in VS, just not from the command line. Will look into it more.