Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Outlining] Remove overlapping sequences #7146

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

[Outlining] Remove overlapping sequences #7146

wants to merge 10 commits into from

Conversation

ashleynh
Copy link
Collaborator

While determining whether repeat sequences of instructions are candidates for outlining, remove sequences that overlap, giving weight to sequences that are longer and appear more frequently.

@ashleynh ashleynh force-pushed the intervals branch 3 times, most recently from 1a589b6 to 0d961b8 Compare January 17, 2025 21:03
@ashleynh ashleynh changed the title [*WIP* - Outlining] Remove overlapping sequences [Outlining] Remove overlapping sequences Jan 17, 2025
@ashleynh ashleynh requested a review from tlively January 18, 2025 00:05
@ashleynh ashleynh marked this pull request as ready for review January 18, 2025 00:05
Comment on lines 36 to 38
bool operator<(const Interval& other) const {
return start < other.start && weight < other.weight;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should take end into account as well. Otherwise the std::set<Interval> returned by IntervalProcessor::getOverlaps() will not be able to hold two intervals that differ only in their ends.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


std::set<Interval>
IntervalProcessor::getOverlaps(std::vector<Interval>& intervals) {
std::sort(intervals.begin(), intervals.end(), [](Interval a, Interval b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::sort(intervals.begin(), intervals.end(), [](Interval a, Interval b) {
std::sort(intervals.begin(), intervals.end(), [](const Interval& a, const Interval& b) {

Just to avoid copying intervals around unnecessarily.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

});

std::set<Interval> overlaps;
auto& firstInterval = intervals[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an early return if the input vector is empty to avoid UB here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for (auto startIdx : substring.StartIndices) {
auto interval =
Interval(startIdx,
startIdx + substring.Length - 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we're using intervals inclusive of their ends. Would this work without the - 1 as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, done

auto interval =
Interval(startIdx,
startIdx + substring.Length - 1,
substring.Length * substring.StartIndices.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth a comment about why we are using this weight.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 170 to 174
std::set<Interval> overlaps = IntervalProcessor::getOverlaps(intervals);
std::set<unsigned> doNotInclude;
for (auto& interval : overlaps) {
doNotInclude.insert(intervalMap[interval]);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could simplify the code here and get away without any map or set lookups if IntervalProcessor returned a sequence of kept indices in its input vector rather than a set of removed intervals. With a sequence of kept indices, we could directly construct the list of kept substrings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea, thanks!

@@ -1006,3 +1006,57 @@
(loop (nop))
)
)

;; Test that no attempt is made to outline overlapping repeat substrings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add comments about what the overlapping substrings are.

Comment on lines 1013 to 1016
(drop (i32.add
(i32.const 0)
(i32.const 1)
))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make the test more concise and easy to read by just using constants and drops, unless I'm missing some reason why this wouldn't work.

(drop (i32.const 0))
(drop (i32.const 1))
(drop (i32.const 2))
(drop (i32.const 3))
(drop (i32.const 0))
(drop (i32.const 1))
(drop (i32.const 2))
(drop (i32.const 3))
(drop (i32.const 1))
(drop (i32.const 2))
(drop (i32.const 1))
(drop (i32.const 2))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, thanks

};

struct IntervalProcessor {
static std::set<Interval> getOverlaps(std::vector<Interval>&);
Copy link
Collaborator Author

@ashleynh ashleynh Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add gTests for edge cases

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still be good to include gtest unit tests for the various kinds of overlaps.

for (Index i = 0; i < substrings.size(); i++) {
auto substring = substrings[i];
for (auto startIdx : substring.StartIndices) {
// TODO: This weight was picked with an assumption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What assumption?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -280,6 +280,8 @@ struct Outlining : public Pass {
DBG(printHashString(stringify.hashString, stringify.exprs));
// Remove substrings that are substrings of longer repeat substrings.
substrings = StringifyProcessor::dedupe(substrings);
// Remove substrings with overlapping indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Remove substrings with overlapping indices
// Remove substrings with overlapping indices.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +153 to +154
std::vector<Interval> intervals;
std::vector<int> substringIdxs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a comment saying how these two vectors relate to each other.

Comment on lines 179 to 182
if (substringsIncluded.find(substringIdx) != substringsIncluded.end()) {
continue;
}
substringsIncluded.insert(substringIdx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (substringsIncluded.find(substringIdx) != substringsIncluded.end()) {
continue;
}
substringsIncluded.insert(substringIdx);
if (!substringsIncluded.insert(substringIdx)->second) {
continue;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

};

struct IntervalProcessor {
// TODO: Given a vector of Interval, returns a vector of the indices, mapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the TODO here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was for me to review the comment before submitting

;; CHECK-NEXT: (i32.add
;; CHECK-NEXT: (i32.const 0)
;; CHECK-NEXT: (i32.const 1)
;; CHECK-NEXT: (i32.sub
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test changed because the order of the outlined functions to be created changed. This is theoretically possible now because we wait to add substrings to the result vector in hash-stringify-walker's removeOverlaps() until every interval for a repeat substring has been seen. For our purposes, It does not actually matter what order the substrings are in, because we create an OutliningSequence to represent each substring, and ensure that is sorted by idx, line 373 in Outlining.cpp.

Comment on lines 184 to 186
if (seenCount[substringIdx] == substring.StartIndices.size() &&
substringsIncluded.insert(substringIdx).second) {
result.push_back(substring);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're only considering a substring for outlining at all if all of its ocurrences survive overlap filtering? Could we keep the substring in consideration and just remove the particular occurrence of it that had the overlap instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -36,26 +37,24 @@ IntervalProcessor::filterOverlaps(std::vector<Interval>& intervals) {

std::sort(
intIntervals.begin(), intIntervals.end(), [](const auto& a, const auto& b) {
return a.first.start < b.first.end;
return a.first.start < b.first.start;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the lambda here if you fix operator< to be a total order (meaning that for any pair of intervals a and b, exactly one of a < b,b < a, or a == b is true)

});

std::vector<int> result;
auto& firstInterval = intIntervals[0];
auto& formerInterval = intIntervals[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this a reference means that when you do formerInterval = latterInterval below, it writes to the first element in intIntervals, which is a little odd. Intervals should be small enough that copying them is cheap, so let's just make this a non-reference. Alternatively, to avoid copying intervals, you could make this an index.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

firstInterval = nextInterval;
} else {
result.push_back(firstInterval.second);
if (latterInterval.first.weight > formerInterval.first.weight) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the weights are equal, perhaps you can choose to keep the interval with the nearest end to reduce its potential to overlap with subsequent intervals.

// back to the original input vector, of non-overlapping indices, ie, the
// intervals that overlap have already been removed.
// Given a vector of Interval, returns a vector of the indices that, mapping
// back to the original input vector, do not overlap with each other, ie: the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// back to the original input vector, do not overlap with each other, ie: the
// back to the original input vector, do not overlap with each other, i.e. the

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

std::vector<Interval> intervals;
intervals.emplace_back(Interval{0, 4, 2});
intervals.emplace_back(Interval{4, 8, 2});
ASSERT_EQ(IntervalProcessor::filterOverlaps(intervals).size(), 2u);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to test the precise results rather than just the size of the results. You can still do it with a single ASSERT_EQ:

std::vector<int> expected{0, 1};
ASSERT_EQ(IntervalProcessor::filterOverlaps(intervals), expected);

ASSERT_EQ(IntervalProcessor::filterOverlaps(intervals).size(), 2u);
}

TEST(IntervalsTest, TestOverlapFound) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add tests for different kinds of overlaps, different input orders, different weights. There are a lot of interesting combinations!


struct IntervalProcessor {
// Given a vector of Interval, returns a vector of the indices that, mapping
// back to the original input vector, do not overlap with each other, ie: the
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjust punctuation around ie to , i.e.,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants