Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a source generator and end-to-end compile benchmarks #4124

Merged
merged 22 commits into from
Aug 13, 2024

Conversation

chandlerc
Copy link
Contributor

@chandlerc chandlerc commented Jul 11, 2024

The big addition here is a very, very rough and very early skeleton of a source code generator framework. This builds upon the lexers identifier synthesis logic, improving on its framework and wiring it up with the most rudimentary of source file generation. This is just enough to roughly replicate my "big API file" source code benchmarks.

The source generation works very hard to both vary the structure and content of the source as much as possible while ensuring the same total amount of each construct is in use, from bytes in identifiers to line breaks, parameters, etc. This lets us generate randomly structure inputs that should consistently take the exact same amount of total work to compile.

The complex identifier synthesis logic from the lexer's benchmark is moved over here and the lexer uses APIs in the source generator for identifiers. The other source synthesis in the lexer's benchmark isn't yet moved over, but should likely be slowly absorbed here as it can be refactored into a more principled and re-usable form. Some bits may stay of course if they're just too lexer-specific.

Next, this adds a simple end-to-end compile benchmark for the driver that directly and much more clearly reproduces all the measurements I've done manually up until now. It should also be easy to extend to more patterns over time as we add support to the source generator to produce those patterns.

Last but not least, I've added a tiny CLI to the source generator so that you can generate source code manually. This is especially nice for generating demo source code to actually run through the driver or look at in an editor. The CLI can also generate C++ source code which lets us do some minimal comparative benchmarking between Carbon and C++/Clang.

There are huge number of TODOs in the source generation framework. This is going to be a large ongoing effort I suspect.

There are also a bunch of rough edges I've left to try and get this out for review sooner. I've left TODOs for refactorings that really need to be done here, but hoping these can maybe be follow-ups. If not, please flag and I'll try to layer them on here.

Sample compile benchmark output, nicely showing where we are w.r.t. our goal speeds (2x behind on lex and check, 5x on parse) at least on a recent AMD server CPU:

------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations      Lines
------------------------------------------------------------------------------------------------------
BM_CompileAPIFileDenseDecls<Phase::Lex>/256           29420 ns        29419 ns        22860 6.62847M/s
BM_CompileAPIFileDenseDecls<Phase::Lex>/1024         146130 ns       146128 ns         4840 6.69959M/s
BM_CompileAPIFileDenseDecls<Phase::Lex>/4096         601584 ns       601577 ns         1020 6.69573M/s
BM_CompileAPIFileDenseDecls<Phase::Lex>/16384       2547578 ns      2547313 ns          280   6.404M/s
BM_CompileAPIFileDenseDecls<Phase::Lex>/65536      10816591 ns     10816389 ns           80 6.05193M/s
BM_CompileAPIFileDenseDecls<Phase::Lex>/262144     52191320 ns     52189828 ns           20 5.02261M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/256        101706 ns       101698 ns         6900 1.91745M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/1024       512161 ns       512162 ns         1380  1.9115M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/4096      2078426 ns      2078430 ns          340   1.938M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/16384     8795786 ns      8795583 ns          100 1.85468M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/65536    35073596 ns     35072973 ns           20 1.86639M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/262144  151100688 ns    151097370 ns           20 1.73483M/s
BM_CompileAPIFileDenseDecls<Phase::Check>/256        957059 ns       957049 ns          740 203.751k/s
BM_CompileAPIFileDenseDecls<Phase::Check>/1024      1956134 ns      1955985 ns          360 500.515k/s
BM_CompileAPIFileDenseDecls<Phase::Check>/4096      5797864 ns      5797417 ns          120 694.792k/s
BM_CompileAPIFileDenseDecls<Phase::Check>/16384    21219608 ns     21217584 ns           40 768.843k/s
BM_CompileAPIFileDenseDecls<Phase::Check>/65536    96311116 ns     96302334 ns           20 679.734k/s
BM_CompileAPIFileDenseDecls<Phase::Check>/262144  371637963 ns    371609964 ns           20 705.387k/s

Lest someone think this is bad, the fact that we're already within 2x of our rather audacious goals makes me quite happy. =D

The big addition here is a very, very rough and very early skeleton of
a source code generator framework. This builds upon the lexers
identifier synthesis logic, improving on its framework and wiring it up
with the most rudimentary of source file generation. This is just enough
to roughly replicate my "big API file" source code benchmarks.

The source generation works *very* hard to both vary the structure and
content of the source as much as possible while ensuring the same
*total* amount of each construct is in use, from bytes in identifiers to
line breaks, parameters, etc. This lets us generate randomly structure
inputs that should consistently take the exact same amount of total work
to compile.

The complex identifier synthesis logic from the lexer's benchmark is
moved over here and the lexer uses APIs in the source generator for
identifiers. The other source synthesis in the lexer's benchmark isn't
yet moved over, but should likely be slowly absorbed here as it can be
refactored into a more principled and re-usable form. Some bits may stay
of course if they're just too lexer-specific.

Next, this adds a simple end-to-end compile benchmark for the driver that
directly and much more clearly reproduces all the measurements I've
done manually up until now. It should also be easy to extend to more
patterns over time as we add support to the source generator to produce
those patterns.

Last but not least, I've added a tiny CLI to the source generator so
that you can generate source code manually. This is especially nice for
generating demo source code to actually run through the driver or look
at in an editor. The CLI can also generate C++ source code which lets us
do some minimal comparative benchmarking between Carbon and C++/Clang.

There are huge number of TODOs in the source generation framework. This
is going to be a large ongoing effort I suspect.

There are also a bunch of rough edges I've left to try and get this out
for review sooner. I've left TODOs for refactorings that really need to
be done here, but hoping these can maybe be follow-ups. If not, please
flag and I'll try to layer them on here.
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding more comments to this code? I'm seeing a lot that's not commented, and I think it'd be easier to review if there were more explanation of what the entities are.

toolchain/lex/tokenized_buffer_benchmark.cpp Outdated Show resolved Hide resolved
toolchain/driver/compile_benchmark_test.sh Show resolved Hide resolved
toolchain/driver/compile_benchmark.cpp Outdated Show resolved Hide resolved
testing/base/source_gen_test.cpp Show resolved Hide resolved
testing/base/source_gen_test.cpp Outdated Show resolved Hide resolved
testing/base/source_gen_test.cpp Outdated Show resolved Hide resolved
testing/base/source_gen_test.cpp Outdated Show resolved Hide resolved
testing/base/source_gen_test.cpp Show resolved Hide resolved
testing/base/source_gen_test.cpp Show resolved Hide resolved
testing/base/source_gen.h Outdated Show resolved Hide resolved
@chandlerc
Copy link
Contributor Author

What do you think about adding more comments to this code? I'm seeing a lot that's not commented, and I think it'd be easier to review if there were more explanation of what the entities are.

I just forgot to go back to that item in my own TODOs before sending this out for review, mostly in trying to get it out sooner. Sorry about that. I'll make a pass through.

Copy link
Contributor Author

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got to ever comment, but let me know if I missed anything.

testing/base/source_gen_test.cpp Outdated Show resolved Hide resolved
testing/base/source_gen_test.cpp Outdated Show resolved Hide resolved
testing/base/source_gen_test.cpp Outdated Show resolved Hide resolved
testing/base/source_gen_test.cpp Outdated Show resolved Hide resolved
testing/base/source_gen_test.cpp Show resolved Hide resolved
testing/base/source_gen.h Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer_benchmark.cpp Outdated Show resolved Hide resolved
toolchain/driver/compile_benchmark_test.sh Show resolved Hide resolved
testing/base/source_gen_test.cpp Show resolved Hide resolved
testing/base/source_gen_test.cpp Show resolved Hide resolved
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take some time to walk through the source_gen files and add comments? While I see public members in source_gen.h have comments, none of the private members do (including public members of the private UniqueIdPopper), and most static/constexpr entities in the cpp file don't. Not sure if you missed this because I didn't comment on the file specifically.

common/benchmark_main.h Outdated Show resolved Hide resolved
testing/base/source_gen.cpp Outdated Show resolved Hide resolved
chandlerc and others added 2 commits July 12, 2024 22:46
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
@chandlerc
Copy link
Contributor Author

Can you take some time to walk through the source_gen files and add comments? While I see public members in source_gen.h have comments, none of the private members do (including public members of the private UniqueIdPopper), and most static/constexpr entities in the cpp file don't. Not sure if you missed this because I didn't comment on the file specifically.

No, I think I just only looked at a few places rather than the whole file, sorry about that.

Some of the functions don't seem to need much in the way of comments, the function name and param names would just be repeated in a sentence I think... But a bunch definitely needed comments. There was a bunch of subtle stuff, etc... Anyways, everything that jumped out at me as needing a comment has one now I think, and I'm much more sure I actually went all the way through the file rather than just glancing at one area in the file and forgetting about the rest. If there are still things that would really benefit from a comment, it may be useful to understand a bit better what needs clarification at this point as I hopefully didn't miss anything for a third time.

I also spotted a few places where one comment you made applies more broadly and tried to update based on it, and cleaned up a few things that adding comments made me think better about.

Copy link
Contributor

@zygoloid zygoloid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonmeow asked me to take a look at this.

#include "common/init_llvm.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/StringRef.h"

static bool after_main = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this name confusing -- I expected it to indicated whether main had completed. Maybe entered_main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to re-work this and all the other things in this code, but this is copied directly from the unittest version, and there is a TODO to try to refactor and share the code.

It seems a bit weird to only fix it here when it's copied, but also seems weird to make this PR larger to fix in both places...

My preference is just to address comments here in the PR that factors things together, but lemme know what you'd prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to defer handling the comments in this file to another PR and deal with them at the same time as we improve the other file. Maybe for now just repeating the TODO in this file would help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +24 to +26
CARBON_CHECK(after_main)
<< "Must not query the executable path until after `main` is entered!";
return exe_path;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, do we need after_main at all? Would it be correct to instead CHECK that !exe_path.is_empty() here, or do we need to cope with FindExecutablePath producing an empty string? (If we do need it, an optional<StringRef> would seem clearer to me.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Comment on lines +35 to +37
std::string exe_path_storage = Carbon::FindExecutablePath(orig_argv[0]);
exe_path = exe_path_storage;
after_main = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we could also have problems if GetBenchmarkExePath is called after main returns. If we need to defend against calls before main, do we also need to defend against calls after main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

testing/base/source_gen.cpp Show resolved Hide resolved
Comment on lines 59 to 60
// Blank line, comment line, and class open line.
double avg = 3.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems inconsistent to include a blank line and comment in the value returned by this function but not the values returned by the other two. A comment would be useful to explain which things are covered.

It might be nice if the number returned by EstimateAvgFooLines was the expected number of lines added by GenerateFoo below -- so include the comment but not the blank line. (For that to hold, we'd also want to rename this function to EstimateAvgClassDefLines.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and done for all of these (I think). This also corrects the estimation by not including a blank line for the first class.

testing/base/source_gen.h Outdated Show resolved Hide resolved
Comment on lines 63 to 64
// Check that uniform id length results in exact coverage of each possible
// length for an easy case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to also test the case with a remainder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

}

// Largely covered by `Ids`, but need to check for uniqueness specifically.
TEST(SourceGenTest, UniqueIds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to test the length distribution and rounding behavior here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what test you would like for that? (And it seems like that should be in the non-unique test?)

I can copy and paste the distribution from the implementation into the test, and check that it matches, but that seemed like it would add a lot of fragility without much added value.

I could test small N-counts, but as you point out the behavior there isn't necessarily "good" or something that seems like we should hold as a thing that doesn't regress.

I've added a test that with a reasonably large N-count we get at least one identifier in each length bucket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I meant a non-unique test :)

One simple thing you could maybe test is that identifiers of length 4 are the most common length, which I think ought to hold for N >= 4 or so (with the adjustment I suggested earlier), and shouldn't be too much of a change detector test since I don't think we'd expect that property to change as we tune the distribution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to add something like this...

std::string source =
gen.GenAPIFileDenseDecls(1000, SourceGen::DenseDeclParams{});
// Should be within 10% of the requested line count.
EXPECT_THAT(source, Contains('\n').Times(AllOf(Ge(900), Le(1100))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to allow 10% each way? Given that we're forcing the sample distribution to exactly match the specified distribution (up to rounding various things to an integer), I think we should always get a lot closer than this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't seem especially important to be that close, and for small files, it is tricky. I was worried about being fragile. For example, the current logic doesn't end up within 1%. (979 lines is one that I just got.)

I've switched it to 5% -- could get to 1% with a larger file size, but curious what you think here?

testing/base/source_gen_test.cpp Outdated Show resolved Hide resolved
chandlerc and others added 3 commits July 23, 2024 22:11
Co-authored-by: Richard Smith <richard@metafoo.co.uk>
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comment. Note I'm assuming zygoloid will finish the review, so I've marked my comments resolved without looking closely.

testing/base/source_gen.h Outdated Show resolved Hide resolved
github-merge-queue bot pushed a commit that referenced this pull request Jul 31, 2024
Move subtree sizes over to TreeAndSubtrees, using the different
structure to represent the additional parse work that occurs, as well as
making it clear which functions require the extra information. My intent
is to make it hard to use this by accident.

The subtree size is still tracked during Parse::Tree construction. I
think a lot of that can be cleaned up, although we use it during
placeholder assignment so it may take some work. I wanted to see what
people thought about this before taking action on such a change.

I'm using a 1m line source file generated by #4124 for testing. Command
is `time bazel-bin/toolchain/install/prefix_root/bin/carbon compile
--phase=check --dump-mem-usage ~/tmp/data.carbon`

At head, what I'm seeing is:

```
...
parse_tree_.node_impls_:
  used_bytes:      61516116
  reserved_bytes:  61516116
...
Total:
  used_bytes:      447814230
  reserved_bytes:  551663894
...
1.43s user 0.14s system 99% cpu 1.565 total
```

With `Tree::Verify` disabled completely, it looks like:
```
parse_tree_.node_impls_:
  used_bytes:      41010744
  reserved_bytes:  41010744
...
Total:
  used_bytes:      427308858
  reserved_bytes:  531158522
...
1.20s user 0.13s system 99% cpu 1.332 total
```

Re-enabling just the basic verification (what is now `Tree::Verify`),
I'm seeing maybe 0.05s slower, but that's within noise for my system. I
do see variability in my timing results, and overall I think this is a
0.2s +/- 0.1s improvement versus the earlier (always testing `Extract`
code) implementation. That's opt; debug builds will be unaffected,
because the same checking occurs as before.

Note, the subtree size is a third of the node representation, which is
why I'm showing the decrease in memory usage here.
Co-authored-by: Richard Smith <richard@metafoo.co.uk>
Copy link
Contributor Author

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's been slow, I keep being interrupted working on this. =/

I had somewhat rushed to get it out for review because folks seemed to really want it even if in a rough shape. But it seems like that's not actually working well. I've suggested fixing forward for a lot of things here, but maybe that approach just isn't appealing? If so, maybe good to chat about that to figure out what makes sense.

As a concrete case -- some of the things with several comments are really pre-existing issues that are showing up again. If fixing forward doesn't make sense, then I'll need to work on fixing those in separate PRs and rebase this on top of them. Want to understand what folks would prefer as a strategy here.

Also, I still have to do the 'id' -> 'identifier' renaming. I think I got everything else, but that one I'll just need to spend some time going through and renaming. It makes sense, but figured I'd send out the other changes and ask the questions above first as its already been a while.

#include "common/init_llvm.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/StringRef.h"

static bool after_main = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to re-work this and all the other things in this code, but this is copied directly from the unittest version, and there is a TODO to try to refactor and share the code.

It seems a bit weird to only fix it here when it's copied, but also seems weird to make this PR larger to fix in both places...

My preference is just to address comments here in the PR that factors things together, but lemme know what you'd prefer.

Comment on lines +24 to +26
CARBON_CHECK(after_main)
<< "Must not query the executable path until after `main` is entered!";
return exe_path;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Comment on lines +35 to +37
std::string exe_path_storage = Carbon::FindExecutablePath(orig_argv[0]);
exe_path = exe_path_storage;
after_main = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

testing/base/source_gen.cpp Show resolved Hide resolved
Comment on lines 59 to 60
// Blank line, comment line, and class open line.
double avg = 3.0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and done for all of these (I think). This also corrects the estimation by not including a blank line for the first class.

testing/base/source_gen.cpp Show resolved Hide resolved
testing/base/source_gen.cpp Outdated Show resolved Hide resolved
Comment on lines 63 to 64
// Check that uniform id length results in exact coverage of each possible
// length for an easy case.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

}

// Largely covered by `Ids`, but need to check for uniqueness specifically.
TEST(SourceGenTest, UniqueIds) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what test you would like for that? (And it seems like that should be in the non-unique test?)

I can copy and paste the distribution from the implementation into the test, and check that it matches, but that seemed like it would add a lot of fragility without much added value.

I could test small N-counts, but as you point out the behavior there isn't necessarily "good" or something that seems like we should hold as a thing that doesn't regress.

I've added a test that with a reasonably large N-count we get at least one identifier in each length bucket.

std::string source =
gen.GenAPIFileDenseDecls(1000, SourceGen::DenseDeclParams{});
// Should be within 10% of the requested line count.
EXPECT_THAT(source, Contains('\n').Times(AllOf(Ge(900), Le(1100))));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't seem especially important to be that close, and for small files, it is tricky. I was worried about being fragile. For example, the current logic doesn't end up within 1%. (979 lines is one that I just got.)

I've switched it to 5% -- could get to 1% with a larger file size, but curious what you think here?

#include "common/init_llvm.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/StringRef.h"

static bool after_main = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to defer handling the comments in this file to another PR and deal with them at the same time as we improve the other file. Maybe for now just repeating the TODO in this file would help.

testing/base/source_gen.cpp Show resolved Hide resolved
Comment on lines +426 to +431
int length_count = (number / count_sum) * scale;
if (number_rem > 0) {
int rem_adjustment = std::min(scale, number_rem);
length_count += rem_adjustment;
number_rem -= rem_adjustment;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that

Suggested change
int length_count = (number / count_sum) * scale;
if (number_rem > 0) {
int rem_adjustment = std::min(scale, number_rem);
length_count += rem_adjustment;
number_rem -= rem_adjustment;
}
int length_count = (scale * number) / count_sum;
if (number_rem > 0) {
length_count++;
number_rem--;
}

would give a better approximation of the distribution (with an error of at most one in each bucket's size, rather than at most IdLengthCounts[i]).

With the code as-is, if you ask for 120 identifiers with your heuristic distribution, then count_sum is something like 350, number / count_sum is 0, and you'll end up with 40 of size 1, 40 of size 2, and 40 of size 3, despite size 4 being the largest in the distribution. Whereas with the modified version, you should get around 14 of size 1, 2, and 3, and around 18 of size 4. (Ballpark numbers, I didn't actually work out the correct count_sum figure.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I tried to reply to the original form of this comment here:

#4124 (comment)

I re-checked, and the issue still seems to be there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't see your previous comment for Github Reasons.

Hm, yeah, I see the problem. It's not easy to predict the amount of rounding error in advance with my approach. We could do it in two passes, but yeah, probably doesn't matter enough to bother given we primarily expect to generate large examples. This is good enough for an initial implementation.

Comment on lines 70 to 95
// Check that uniform id length results in exact coverage of each possible
// length for an easy case, both without and with a remainder.
ids = gen.GetShuffledIds(100, /*min_length=*/10, /*max_length=*/19,
/*uniform=*/true);
EXPECT_THAT(ids, Contains(SizeIs(10)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(11)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(12)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(13)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(14)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(15)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(16)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(17)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(18)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(19)).Times(10));
ids = gen.GetShuffledIds(97, /*min_length=*/10, /*max_length=*/19,
/*uniform=*/true);
EXPECT_THAT(ids, Contains(SizeIs(10)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(11)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(12)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(13)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(14)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(15)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(16)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(17)).Times(9));
EXPECT_THAT(ids, Contains(SizeIs(18)).Times(9));
EXPECT_THAT(ids, Contains(SizeIs(19)).Times(9));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests for the uniform distribution in the right test function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... think so?

There are two axes: unique vs. non-unique, and uniform distribution vs. non-uniform distribution.

I have a function split out for the uniqueness tests. I guess can split the two distributions into separate functions if useful?

}

// Largely covered by `Ids`, but need to check for uniqueness specifically.
TEST(SourceGenTest, UniqueIds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I meant a non-unique test :)

One simple thing you could maybe test is that identifiers of length 4 are the most common length, which I think ought to hold for N >= 4 or so (with the adjustment I suggested earlier), and shouldn't be too much of a change detector test since I don't think we'd expect that property to change as we tune the distribution.

Copy link
Contributor Author

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to do the great renaming, but a quick update with some of the other fixes.

testing/base/source_gen.cpp Show resolved Hide resolved
Comment on lines 70 to 95
// Check that uniform id length results in exact coverage of each possible
// length for an easy case, both without and with a remainder.
ids = gen.GetShuffledIds(100, /*min_length=*/10, /*max_length=*/19,
/*uniform=*/true);
EXPECT_THAT(ids, Contains(SizeIs(10)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(11)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(12)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(13)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(14)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(15)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(16)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(17)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(18)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(19)).Times(10));
ids = gen.GetShuffledIds(97, /*min_length=*/10, /*max_length=*/19,
/*uniform=*/true);
EXPECT_THAT(ids, Contains(SizeIs(10)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(11)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(12)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(13)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(14)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(15)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(16)).Times(10));
EXPECT_THAT(ids, Contains(SizeIs(17)).Times(9));
EXPECT_THAT(ids, Contains(SizeIs(18)).Times(9));
EXPECT_THAT(ids, Contains(SizeIs(19)).Times(9));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... think so?

There are two axes: unique vs. non-unique, and uniform distribution vs. non-uniform distribution.

I have a function split out for the uniqueness tests. I guess can split the two distributions into separate functions if useful?

}

// Largely covered by `Ids`, but need to check for uniqueness specifically.
TEST(SourceGenTest, UniqueIds) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to add something like this...

@chandlerc
Copy link
Contributor Author

Ok, all the renaming of ID stuff is I think done. At least, my regex-ing in source_gen* seems to be clean.

@chandlerc chandlerc requested a review from jonmeow August 10, 2024 00:39
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Note, I'm trying to focus on the delta since zygoloid's last review. Feel free to merge, just a few small spelling questions.

template <typename AppendIds>
auto SourceGen::GetIdsImpl(int number, int min_length, int max_length,
bool uniform, AppendIds append_ids)
template <typename AppendFunc>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but I think we're generally using Fn instead of Func as an abbreviation, consistent with Carbon.

But, and sorry if I'm missing context here, but had you considered something like llvm::function_ref<int, int, llvm::SmallVector<llvm::StringRef>&>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on naming, but even better call on using function_ref, switched to that.

But using SmallVectorImpl to avoid encoding the small size (as above).

llvm::SmallVector<llvm::StringRef> idents =
GetIdentifiersImpl(number, min_length, max_length, uniform,
[this](int length, int length_count,
llvm::SmallVectorImpl<llvm::StringRef>& dest) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why SmallVectorImpl instead of SmallVector? (actually, I see this in a few spots)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the idiom of passing a small vector reference for appending without encoding a specific small size.

Copy link
Contributor Author

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL!

template <typename AppendIds>
auto SourceGen::GetIdsImpl(int number, int min_length, int max_length,
bool uniform, AppendIds append_ids)
template <typename AppendFunc>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on naming, but even better call on using function_ref, switched to that.

But using SmallVectorImpl to avoid encoding the small size (as above).

llvm::SmallVector<llvm::StringRef> idents =
GetIdentifiersImpl(number, min_length, max_length, uniform,
[this](int length, int length_count,
llvm::SmallVectorImpl<llvm::StringRef>& dest) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the idiom of passing a small vector reference for appending without encoding a specific small size.

@jonmeow jonmeow enabled auto-merge August 13, 2024 19:28
@jonmeow jonmeow added this pull request to the merge queue Aug 13, 2024
Merged via the queue into carbon-language:trunk with commit a9c815c Aug 13, 2024
7 checks passed
brymer-meneses pushed a commit to brymer-meneses/carbon-lang that referenced this pull request Aug 15, 2024
…4174)

Move subtree sizes over to TreeAndSubtrees, using the different
structure to represent the additional parse work that occurs, as well as
making it clear which functions require the extra information. My intent
is to make it hard to use this by accident.

The subtree size is still tracked during Parse::Tree construction. I
think a lot of that can be cleaned up, although we use it during
placeholder assignment so it may take some work. I wanted to see what
people thought about this before taking action on such a change.

I'm using a 1m line source file generated by carbon-language#4124 for testing. Command
is `time bazel-bin/toolchain/install/prefix_root/bin/carbon compile
--phase=check --dump-mem-usage ~/tmp/data.carbon`

At head, what I'm seeing is:

```
...
parse_tree_.node_impls_:
  used_bytes:      61516116
  reserved_bytes:  61516116
...
Total:
  used_bytes:      447814230
  reserved_bytes:  551663894
...
1.43s user 0.14s system 99% cpu 1.565 total
```

With `Tree::Verify` disabled completely, it looks like:
```
parse_tree_.node_impls_:
  used_bytes:      41010744
  reserved_bytes:  41010744
...
Total:
  used_bytes:      427308858
  reserved_bytes:  531158522
...
1.20s user 0.13s system 99% cpu 1.332 total
```

Re-enabling just the basic verification (what is now `Tree::Verify`),
I'm seeing maybe 0.05s slower, but that's within noise for my system. I
do see variability in my timing results, and overall I think this is a
0.2s +/- 0.1s improvement versus the earlier (always testing `Extract`
code) implementation. That's opt; debug builds will be unaffected,
because the same checking occurs as before.

Note, the subtree size is a third of the node representation, which is
why I'm showing the decrease in memory usage here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants