-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize scan
with a lookup table to reduce branching
#59106
base: main
Are you sure you want to change the base?
Conversation
It's probably faster to actually check the common case of spaces, tabs, and newlines, but the current draft always does the lookup. @typescript-bot perf test this |
@DanielRosenwasser Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
startupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
@typescript-bot perf test this |
@DanielRosenwasser Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
startupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
@typescript-bot perf test this |
@typescript-bot perf test this |
@DanielRosenwasser Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
startupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
@DanielRosenwasser Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
startupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
I'm actually not sure if that's much better or not, it could just be a better run. @typescript-bot pack this |
Hey @DanielRosenwasser, I've packed this into an installable tgz. You can install it for testing by referencing it in your
and then running There is also a playground for this build and an npm module you can use via |
scan
with a lookup table to reduce branching
@typescript-bot pack this |
@typescript-bot test top400 |
Hey @DanielRosenwasser, I've packed this into an installable tgz. You can install it for testing by referencing it in your
and then running There is also a playground for this build and an npm module you can use via |
@DanielRosenwasser Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
startupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
@DanielRosenwasser Here are the results of running the top 400 repos with tsc comparing Everything looks good! |
@dragomirtitian maybe this balances out the parse-time hit from #58928. 😄 |
27392e7
to
33ae58d
Compare
@typescript-bot perf test this |
@DanielRosenwasser Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
startupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
} | ||
|
||
const tokenCategoryLookup: TokenCategory[] = []; | ||
const tokenCategoryLookupUncommon = new Map<CharacterCodes, TokenCategory>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does map lookup performance and memory usage compare to branching if
s with comparison operators?
[CharacterCodes.replacementCharacter, TokenCategory.RecognizedMisc], | ||
] | ||
) { | ||
if (key < tokenCategoryLookup.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this in a branch? The uncommon lookups could just be passed to the Map
constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific? In the first two commits, I split this out from a map to an array and map to avoid the map lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that, since we know ahead of time which characters belong in the Map
, we can just construct the map with those character codes rather than looping and comparing. It's not that important, and if it makes more sense to have them visually grouped together, that's perfectly fine.
} | ||
|
||
if (!(tokenCategory & TokenCategory.RecognizedMisc)) Debug.fail(`Unhandled token category ${tokenCategory}`); | ||
switch (ch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the switch
here is optimized in any way or if it is doing case-by-case comparisons against each character code. If the latter, maybe it would be worthwhile to analyze what that most common character codes in JS are and change the order of the cases so that the more frequent cases are earlier in the list. I expect the perf improvement here actually comes from handling digit and identifier characters earlier than the switch
, since those are the most frequent characters in a given file. If that's the case, some of the complexity and memory overhead could be reduced by just optimizing for those two cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is typically case-by-case comparisons. You're absolutely right that doing the checks earlier on for common classes; however, I think that checking for all the identifier starts isn't practical (or requires the same sort of fast-path/slow-path separation where we check for ranges), so that's why the token categorization is helpful here. By dismissing all the other types of categories, we can immediately launch into the fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming CharacterCodes
is a const enum
, the VM should be able to optimize the switch case quite heavily as a branch table, but that's a question for V8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the majority of identifier/keyword characters are ASCII characters, then it's fairly trivial to check for those major cases without an array or Map
using a simple comparison:
if (
(ch | 32) >= CharacterCodes.a && (ch | 32) <= CharacterCodes.z ||
ch === CharacterCodes._ ||
ch === CharacterCodes.$
) { /* [A-Za-z_$] */ }
If the array/map lookups are faster than this for the identifier case, then by all means keep it as is. I only point this out as you explicitly avoid hitting the table for line terminators and spaces. If that was an important optimization, then it might make sense to do the same kind of optimization for the most frequent (and easy to optimize for) characters in a source file, e.g., ASCII letters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming
CharacterCodes
is aconst enum
, the VM should be able to optimize the switch case quite heavily as a branch table, but that's a question for V8.
We've generally found that not to be the case, which is why we've had to implement our own jump tables for functions like forEachChild
. I'm not sure what heuristics V8 uses for switch
optimization, if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the majority of identifier/keyword characters are ASCII characters, then it's fairly trivial to check for those major cases without an array or Map using a simple comparison:
Fair, and this is what @ahejlsberg also suggested. The same way whitespace is prioritized, ASCII range is also very likely and worth prioritizing as a quicker up-front case instead of hitting the lookup table. It would mean there's some code duplication but maybe it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming
CharacterCodes
is aconst enum
, the VM should be able to optimize the switch case quite heavily as a branch table, but that's a question for V8.
We've generally found that not to be the case, which is why we've had to implement our own jump tables for functions like
forEachChild
. I'm not sure what heuristics V8 uses forswitch
optimization, if any.
Especially in the case where these character codes are so sparse - I am very doubtful that V8 would do such an optimization, and at best I'd guess we'd end up with something like a binary search.
|
||
if (tokenCategory & TokenCategory.SimpleToken) { | ||
pos++; | ||
return token = tokenCategory & TokenCategory.SimpleTokenMask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here would be good for future readers as it isn't obvious at first glance how a TokenCategory
becomes a token.
if (ch === CharacterCodes._0) { | ||
if (pos + 2 < end && (charCodeUnchecked(pos + 1) === CharacterCodes.X || charCodeUnchecked(pos + 1) === CharacterCodes.x)) { | ||
pos += 2; | ||
tokenValue = scanMinimumNumberOfHexDigits(1, /*canHaveSeparators*/ true); | ||
if (!tokenValue) { | ||
error(Diagnostics.Hexadecimal_digit_expected); | ||
tokenValue = "0"; | ||
} | ||
else { | ||
pos++; | ||
tokenValue = "0x" + tokenValue; | ||
tokenFlags |= TokenFlags.HexSpecifier; | ||
return token = checkBigIntSuffix(); | ||
} | ||
else if (pos + 2 < end && (charCodeUnchecked(pos + 1) === CharacterCodes.B || charCodeUnchecked(pos + 1) === CharacterCodes.b)) { | ||
pos += 2; | ||
tokenValue = scanBinaryOrOctalDigits(/* base */ 2); | ||
if (!tokenValue) { | ||
error(Diagnostics.Binary_digit_expected); | ||
tokenValue = "0"; | ||
} | ||
return token = SyntaxKind.NewLineTrivia; | ||
} | ||
case CharacterCodes.tab: | ||
case CharacterCodes.verticalTab: | ||
case CharacterCodes.formFeed: | ||
case CharacterCodes.space: | ||
case CharacterCodes.nonBreakingSpace: | ||
case CharacterCodes.ogham: | ||
case CharacterCodes.enQuad: | ||
case CharacterCodes.emQuad: | ||
case CharacterCodes.enSpace: | ||
case CharacterCodes.emSpace: | ||
case CharacterCodes.threePerEmSpace: | ||
case CharacterCodes.fourPerEmSpace: | ||
case CharacterCodes.sixPerEmSpace: | ||
case CharacterCodes.figureSpace: | ||
case CharacterCodes.punctuationSpace: | ||
case CharacterCodes.thinSpace: | ||
case CharacterCodes.hairSpace: | ||
case CharacterCodes.zeroWidthSpace: | ||
case CharacterCodes.narrowNoBreakSpace: | ||
case CharacterCodes.mathematicalSpace: | ||
case CharacterCodes.ideographicSpace: | ||
case CharacterCodes.byteOrderMark: | ||
if (skipTrivia) { | ||
pos++; | ||
continue; | ||
tokenValue = "0b" + tokenValue; | ||
tokenFlags |= TokenFlags.BinarySpecifier; | ||
return token = checkBigIntSuffix(); | ||
} | ||
else { | ||
while (pos < end && isWhiteSpaceSingleLine(charCodeUnchecked(pos))) { | ||
pos++; | ||
else if (pos + 2 < end && (charCodeUnchecked(pos + 1) === CharacterCodes.O || charCodeUnchecked(pos + 1) === CharacterCodes.o)) { | ||
pos += 2; | ||
tokenValue = scanBinaryOrOctalDigits(/* base */ 8); | ||
if (!tokenValue) { | ||
error(Diagnostics.Octal_digit_expected); | ||
tokenValue = "0"; | ||
} | ||
return token = SyntaxKind.WhitespaceTrivia; | ||
tokenValue = "0o" + tokenValue; | ||
tokenFlags |= TokenFlags.OctalSpecifier; | ||
return token = checkBigIntSuffix(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (ch === CharacterCodes._0) { | |
if (pos + 2 < end && (charCodeUnchecked(pos + 1) === CharacterCodes.X || charCodeUnchecked(pos + 1) === CharacterCodes.x)) { | |
pos += 2; | |
tokenValue = scanMinimumNumberOfHexDigits(1, /*canHaveSeparators*/ true); | |
if (!tokenValue) { | |
error(Diagnostics.Hexadecimal_digit_expected); | |
tokenValue = "0"; | |
} | |
else { | |
pos++; | |
tokenValue = "0x" + tokenValue; | |
tokenFlags |= TokenFlags.HexSpecifier; | |
return token = checkBigIntSuffix(); | |
} | |
else if (pos + 2 < end && (charCodeUnchecked(pos + 1) === CharacterCodes.B || charCodeUnchecked(pos + 1) === CharacterCodes.b)) { | |
pos += 2; | |
tokenValue = scanBinaryOrOctalDigits(/* base */ 2); | |
if (!tokenValue) { | |
error(Diagnostics.Binary_digit_expected); | |
tokenValue = "0"; | |
} | |
return token = SyntaxKind.NewLineTrivia; | |
} | |
case CharacterCodes.tab: | |
case CharacterCodes.verticalTab: | |
case CharacterCodes.formFeed: | |
case CharacterCodes.space: | |
case CharacterCodes.nonBreakingSpace: | |
case CharacterCodes.ogham: | |
case CharacterCodes.enQuad: | |
case CharacterCodes.emQuad: | |
case CharacterCodes.enSpace: | |
case CharacterCodes.emSpace: | |
case CharacterCodes.threePerEmSpace: | |
case CharacterCodes.fourPerEmSpace: | |
case CharacterCodes.sixPerEmSpace: | |
case CharacterCodes.figureSpace: | |
case CharacterCodes.punctuationSpace: | |
case CharacterCodes.thinSpace: | |
case CharacterCodes.hairSpace: | |
case CharacterCodes.zeroWidthSpace: | |
case CharacterCodes.narrowNoBreakSpace: | |
case CharacterCodes.mathematicalSpace: | |
case CharacterCodes.ideographicSpace: | |
case CharacterCodes.byteOrderMark: | |
if (skipTrivia) { | |
pos++; | |
continue; | |
tokenValue = "0b" + tokenValue; | |
tokenFlags |= TokenFlags.BinarySpecifier; | |
return token = checkBigIntSuffix(); | |
} | |
else { | |
while (pos < end && isWhiteSpaceSingleLine(charCodeUnchecked(pos))) { | |
pos++; | |
else if (pos + 2 < end && (charCodeUnchecked(pos + 1) === CharacterCodes.O || charCodeUnchecked(pos + 1) === CharacterCodes.o)) { | |
pos += 2; | |
tokenValue = scanBinaryOrOctalDigits(/* base */ 8); | |
if (!tokenValue) { | |
error(Diagnostics.Octal_digit_expected); | |
tokenValue = "0"; | |
} | |
return token = SyntaxKind.WhitespaceTrivia; | |
tokenValue = "0o" + tokenValue; | |
tokenFlags |= TokenFlags.OctalSpecifier; | |
return token = checkBigIntSuffix(); | |
} | |
} | |
if (ch === CharacterCodes._0 && pos + 2 < end) { | |
switch (charCodeUnchecked(pos + 1)) { | |
case CharacterCodes.X: | |
case CharacterCodes.x: | |
pos += 2; | |
tokenValue = scanMinimumNumberOfHexDigits(1, /*canHaveSeparators*/ true); | |
if (!tokenValue) { | |
error(Diagnostics.Hexadecimal_digit_expected); | |
tokenValue = "0"; | |
} | |
tokenValue = "0x" + tokenValue; | |
tokenFlags |= TokenFlags.HexSpecifier; | |
return token = checkBigIntSuffix(); | |
case CharacterCodes.B: | |
case CharacterCodes.b: | |
pos += 2; | |
tokenValue = scanBinaryOrOctalDigits(/* base */ 2); | |
if (!tokenValue) { | |
error(Diagnostics.Binary_digit_expected); | |
tokenValue = "0"; | |
} | |
tokenValue = "0b" + tokenValue; | |
tokenFlags |= TokenFlags.BinarySpecifier; | |
return token = checkBigIntSuffix(); | |
case CharacterCodes.O: | |
case CharacterCodes.o: | |
pos += 2; | |
tokenValue = scanBinaryOrOctalDigits(/* base */ 8); | |
if (!tokenValue) { | |
error(Diagnostics.Octal_digit_expected); | |
tokenValue = "0"; | |
} | |
tokenValue = "0o" + tokenValue; | |
tokenFlags |= TokenFlags.OctalSpecifier; | |
return token = checkBigIntSuffix(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to parse out what the suggestion is - can you explain it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's fair, I don't know why GitHub insists on showing the old code in the diff rather than diffing against your PR version of the code.
Suggestion is to promote the pos + 2 < end
check to the same if
that checks for CharacterCodes._0
, since all branches beneath it use that condition, then replace the nested if statements with a single switch (charCodeUnchecked(pos + 1))
with grouped case statements for each of the radix markers.
} | ||
|
||
if (!(tokenCategory & TokenCategory.RecognizedMisc)) Debug.fail(`Unhandled token category ${tokenCategory}`); | ||
switch (ch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming CharacterCodes
is a const enum
, the VM should be able to optimize the switch case quite heavily as a branch table, but that's a question for V8.
#59244 tries to just check for identifiers up-front, and moves checks for |
This change introduces two lookup tables to the lexical scanner/tokenizer. One is a simple array-based lookup for entities in the ASCII range, and the other is a
Map
for wider codepoints.These tables assign categories to codepoints to indicate how they should be handled. For example, all line breaks, single-line whitespace, and single-character tokens can be handled generically in their own respective fashion. Doing so allows us to perform fewer individual checks to jump into or skip over a given branch.
One additional optimization is the way in which simple single-character tokens are handled is that the token itself is encoded in the lower 8 bits the category value. This allows us to avoid any branching for such cases.
This all comes at the expense of some memory overhead, along with some overhead from fetching from the table. To avoid hitting the table too often, spaces, tabs, and carriage returns, and line feeds are currently handled separately.