Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Match-on-metadata #6035

Merged
merged 17 commits into from
Feb 24, 2021
Merged

Optimize Match-on-metadata #6035

merged 17 commits into from
Feb 24, 2021

Conversation

Forgind
Copy link
Member

@Forgind Forgind commented Jan 14, 2021

Fixes AB#1261123 (feedback ticket)

Context

Checking metadata for matches was taking (#fragments in item) * (#fragments in item to remove) * (#metadata). For many users, this was extremely slow.

Changes Made

Put fragments into a custom MetadataSet object before checking for a match. This takes it from O(n * r * m) to O(m * (n + r)). Since m is normally small, this is a substantial improvement.

Testing

Time on a single operation went from ~180 seconds to ~150 ms for the penultimate version. It also passes @cdmihai's extensive unit tests.*

Notes

Is there any reason to keep the error for referencing multiple items? As a user, I would expect not-match-on-metadata to ignore metadata and match-on-metadata to ignore the item.
Is it ok to rely on IEnumerable returning element in a consistent order?
Should we keep returning no-match if all the metadata are empty on both items? It makes more sense to me to have that be a match, and it would simplify the code slightly. If it makes more sense to others, now would be the time to do it, since the code isn't that old.

* Except one on paths. Not sure why that's failing, since it's passing in VS, just not from the command line. Will look into it more.

@benvillalobos
Copy link
Member

I believe if you do AB/<feedback ticket number> then a tool that kevin pilch made will auto-link to the ticket

@@ -249,7 +249,18 @@
}
else
{
itemsToRemove = FindItemsUsingMatchOnMetadata(group, child, bucket, matchOnMetadata, matchingOptions);
ImmutableList<string> metadataList = matchOnMetadata.ToImmutableList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep this as a separate method to have ExecuteRemove read nicer by only contain top level intent and delegating more nitty gritty implementation details to helper methods. Like the other if branch.

StringComparer comparer = options == MatchOnMetadataOptions.CaseSensitive ? StringComparer.Ordinal : StringComparer.OrdinalIgnoreCase;
children = new Dictionary<string, MetadataSet<P, I>>(comparer);
normalize = options == MatchOnMetadataOptions.PathLike ? p => FileUtilities.NormalizePathForComparisonNoThrow(p, Environment.CurrentDirectory) : p => p;
foreach (ItemSpec<P, I>.ItemExpressionFragment frag in itemSpec.Fragments)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this type check, isn't ItemSpec.Fragments a list of the base class ItemSpecFragment? (List<ItemSpecFragment>)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It might be casting it automatically when it sees that, but I'm not sure. I don't think we should care about other kinds of fragments, since I don't think they can have metadata, so I think this is ok, but that was a lot of "I think"s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as whether this is actually a safe, I should have mentioned that the type is checked before the MetadataSet is constructed, so it should always be correct.

Copy link
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! I have left comments - mostly nits, but the logic in Contains appears to be incorrect, please add a test case exercising all code paths there.

@@ -14,6 +14,7 @@ class RemoveOperation : LazyItemOperation
{
readonly ImmutableList<string> _matchOnMetadata;
readonly MatchOnMetadataOptions _matchOnMetadataOptions;
private MetadataSet<P, I> metadataSet;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Naming convention, field name is missing the leading underscore.

{
if (metadataSet == null)
{
metadataSet = new MetadataSet<P, I>(_matchOnMetadataOptions, _matchOnMetadata, _itemSpec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This lazy initialization of metadataSet is not thread safe. Is it guaranteed to always run only on one thread?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put it in the constructor—shouldn't affect perf, assuming operations are always executed.

@@ -504,4 +459,78 @@ public GlobFragment(string textFragment, string projectDirectory)
&& TextFragment[2] == '*'
&& FileUtilities.IsAnySlash(TextFragment[3]);
}

internal class MetadataSet<P, I> where P : class, IProperty where I : class, IItem, IMetadataTable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: The class can be sealed.

Comment on lines 465 to 466
private Dictionary<string, MetadataSet<P, I>> children;
Func<string, string> normalize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Names missing leading underscores, both fields can be readonly.

children = new Dictionary<string, MetadataSet<P, I>>(comparer);
}

// Relies on IEnumerable returning the metadata in a reasonable order. Reasonable?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is whether the matching IEnumerable<> is passed to the constructor and to Contains. It appears to be true although I wonder if there's a way to enforce it in the code.

Comment on lines 494 to 502
if (current.children.TryGetValue(normalizedString, out MetadataSet<P, I> child))
{
current = child;
}
else
{
current.children.Add(normalizedString, new MetadataSet<P, I>(comparer));
current = current.children[normalizedString];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Redundant dictionary lookup.

Suggested change
if (current.children.TryGetValue(normalizedString, out MetadataSet<P, I> child))
{
current = child;
}
else
{
current.children.Add(normalizedString, new MetadataSet<P, I>(comparer));
current = current.children[normalizedString];
}
if (!current.children.TryGetValue(normalizedString, out MetadataSet<P, I> child))
{
child = new MetadataSet<P, I>(comparer);
current.children.Add(normalizedString, child);
}
current = child;

internal bool Contains(IEnumerable<string> metadata)
{
bool nonEmptyFound = false;
MetadataSet<P, I> curr = this;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: current to match the name used in Add?

@@ -249,7 +249,18 @@ private void ExecuteRemove(ProjectItemGroupTaskItemInstance child, ItemBucket bu
}
else
{
itemsToRemove = FindItemsUsingMatchOnMetadata(group, child, bucket, matchOnMetadata, matchingOptions);
ImmutableList<string> metadataList = matchOnMetadata.ToImmutableList();
Copy link
Member

@ladipro ladipro Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the conversion to ImmutableList<>? You should be able to make the MetadataSet<,> constructor take IEnumerable<>.

{
if (!String.IsNullOrEmpty(m))
{
nonEmptyFound = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this return false? The previous code was checking if all metadata is defined and their value matches. I suggest adding test coverage for the case where m is empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #4997:
if a metadata name doesn't exist in both items, then it's considered a match. Special case: if none of the metadata exist in both items, then it should not match.

But you're right—the code I'm replacing checks for string.Empty and automatically returns false. Given our conversation in the PR review meeting, I'll change to that. Thanks!

@@ -1988,63 +1988,6 @@ public void KeepWithItemReferenceOnNonmatchingMetadata()
items.ElementAt(3).GetMetadataValue("d").ShouldBe("d");
}

[Fact]
public void FailWithMatchingMultipleMetadata()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't delete the tests, instead replace the exception assertion with an assertion on I2's contents

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test implies you can match with multiple data now though. If we want to keep that, the FailWithMatchingMultipleMetadata test is actually invalid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to "RemoveWithMatchingMultipleMetdata." I did enable matching multiple.

.HelpKeyword.ShouldBe("MSBuild.OM_MatchOnMetadataIsRestrictedToOnlyOneReferencedItem");
Project project = ObjectModelHelpers.CreateInMemoryProject(content);
IEnumerable<ProjectItem> items = project.ItemsIgnoringCondition.Where(i => i.ItemType.Equals("I2"));
items.Count().ShouldBe(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both of the tests, assert that the right items are left in I2.

@cdmihai
Copy link
Contributor

cdmihai commented Jan 21, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).


internal sealed class MetadataSet<P, I> where P : class, IProperty where I : class, IItem, IMetadataTable
{
private readonly Dictionary<string, MetadataSet<P, I>> _children;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use this convoluted data structure instead of dumping all metadata values from the Itemspec into a flat set of normalized metadata values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not 100% sure I agree with the data structure, but I want to clarify my understanding from the PR review today.

A flat set of normalized metadata values would prevent us from being able to determine exactly which metadata values from which items were matched.

We'd need to do this to enable the logic in a test like RemoveWithMultipleItemReferenceOnMatchingMetadata.

<I3 Remove='@(I1);@(I2)' MatchOnMetadata='M1' />

This remove attribute effectively translates to:
Remove items in I3 where:

  • I1 OR I2 match on the following metadata:
    • M1

I believe this also enables something like:

<I3 Remove='@(I1);@(I2)' MatchOnMetadata='M1;M2' />

Where this remove attribute translates to:
Remove items in I3 where:

  • I1 OR I2 match on the following metadata:
    • M1 AND M2

@Forgind is this understanding correct? If so I think there should be a test for multiple items and multiple metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote detailed comment for MetadataTrie class. Will push soon.

ProjectFileErrorUtilities.VerifyThrowInvalidProjectFile(
_matchOnMetadata.IsEmpty || _itemSpec.Fragments.All(f => f is ItemSpec<ProjectProperty, ProjectItem>.ItemExpressionFragment),
new BuildEventFileInfo(string.Empty),
"OM_MatchOnMetadataIsRestrictedToOnlyOneReferencedItem");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't forget to change the error message to reflect the relaxed validation.

Copy link
Member

@benvillalobos benvillalobos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep returning no-match if all the metadata are empty on both items?

I can theoretically see a use case for "Match on everything that has no metadata." Maybe we can use a special keyword/symbol here to represent that? Might be best to file that as an issue and see if that gets traction.

@@ -1988,63 +1988,6 @@ public void KeepWithItemReferenceOnNonmatchingMetadata()
items.ElementAt(3).GetMetadataValue("d").ShouldBe("d");
}

[Fact]
public void FailWithMatchingMultipleMetadata()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test implies you can match with multiple data now though. If we want to keep that, the FailWithMatchingMultipleMetadata test is actually invalid.

ExecuteTask(task, lookup);
ICollection<ProjectItemInstance> items = lookup.GetItems("I2");
items.Count().ShouldBe(3);
items.ElementAt(0).EvaluatedInclude.ShouldBe("a2");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So matching the names of metadata is case insensitive, but their values are not. The values being case sensitive makes sense. I remember us having a discussion on the topic in a linux issue.

Just thinking out loud.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MatchOnMetadata allows three options: case sensitive, case insensitive, and pathlike. The third is most similar to what you were thinking, I think, with defaulting to case insensitive regardless of OS.

Copy link
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for getting back to this after so many days.

Playing devil's advocate, imagine that instead of the Trie, we use a simple HashSet<string> where the keys are a concatenation of the respective metadata value with a delimiter & proper escaping to make sure that matching doesn't have false positives. What would be the advantages and disadvantages of this approach?

Using your example from the code comment, the HashSet would contain something like:

"eating|talking|sleeping"
"sleeping|eating|talking"
"talking|sleeping|eating"

so it wouldn't match the Forgind item with the "signature" of "sleeping|talking|eating".

/// Item Child has metadata GoodAt="sleeping" BadAt="eating" OkAt="talking"
/// Item Adolescent has metadata GoodAt="talking" BadAt="sleeping" OkAt="eating"
/// Specifying these three metadata:
/// Item Forgind with metadata GoodAt="sleeping" BadAt="talking" OkAt="eating"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

@Forgind
Copy link
Member Author

Forgind commented Jan 29, 2021

You clearly get the advantage (disadvantage? 😋) of a simpler data structure, but the logic of how to construct the key is more complicated, which partially balances it out. Time to hash a particular string is officially constant but in practice scales according to the length of the string (nlg(n) and assumes the string is short, if I remember correctly), so I don't think the total hashing time would be less. We would win on lookups in the dictionary, but that isn't really relevant. We would also have to factor in time-to-construct the string, which would also always create a new string unless there happens to be exactly one metadatum, and we put in an optimization to not add a character to the end of that. (The argument against an optimization is that we would have to find out that we don't need the extra character, which would either require counting the length of an enumerable (fast, but still theoretically O(n)) or creating something like a StringBuilder, adding it to that StringBuilder, and having a check that the SB has only had one thing added to it before stringifying it. Admittedly, that part should be better with StringTools, so this argument is pretty weak.)

The biggest disadvantage in my mind is with reusing strings, since even if we concatenated them all in one step, we'd still make a new string for each item, and if there are several metadata or the metadata are large (noting that these are metadata values, not metadata keys), it could end up consuming a lot of memory. This is a case in which StringTools would struggle because each of the metadata values, once concatenated to others, could be unique, in which case we'd make a large number of unique large strings. Even matching on two metadata, if the values are large, say, "description" and "path" (with long paths enabled), these could be hundreds or even thousands of characters. With a metadata trie, we don't make any new strings. With a single hash set, we would.

@ladipro
Copy link
Member

ladipro commented Feb 1, 2021

Thank you for the detailed analysis!

With a metadata trie, we don't make any new strings.

Actually we do if MatchOnMetadataOptions.PathLike is used, though I agree that usually the footprint of the Trie would be smaller than that of a flat set. As long as the Trie is not significantly less efficient in the worst case (all values distinct and path-normalized) I think that a slightly more complex datastructure is justified. Thanks again for taking care of this one!

@Forgind Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Feb 1, 2021
@ladipro ladipro merged commit 30bbc5a into dotnet:master Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants