Champion(s): champion name(s) Author(s): Bradley Meck Farias (Socket) Stage: -1
JS RegExp requires complete input to perform a match. Common JS idioms include streaming and async chunked I/O. This means that for streaming inputs that come in chunks, any input that does not create a match may potentially need to be buffered and concatenated. Even when a match is found, it is unsafe to slice the string due to a variety of reasons.
The goal of this proposal is to address the scenarios listed here.
The standard workflow of a stream of data from JS I/O can be imagined as follows. Showing a problem once chunking is involved using current RegExp methods:
const pattern = /ab/g
async function* stream() {
yield 'never needed but kept all the same.'
yield 'a'
yield 'b'
}
for await (const chunk of stream()) {
chunk.matchAll(pattern) // never matches
}
// but if we buffer...
let buffer = ''
for await (const chunk of stream()) {
buffer += chunk
pattern.exec(buffer) // matches
}
This leads to a few direct effects:
- Bloated memory usage due to inability to GC chunks
- Underutilized CPU during I/O idle
- Needing to create extra ticks just to append strings to no effect
- Loss of pattern matching state on each search
This also has some indirect effects:
const fire_surrogates = '🔥'.split('')
const emoji_pattern = /\p{Emoji_Presentation}/gu
const text_presentation_str = ''
for await (const chunk of fire_surrogates) {
text_presentation_str += chunk.replaceAll(emoji_pattern, '�') // never matches
}
See Scunthorpe problem.
const pattern = new RegExp(String.raw`\b(${DIRTY_WORDS.join('|')})\b`, 'gi')
async function* stream(str, size) {
while (str.length) {
yield str.slice(0, size)
str = str.slice(size)
}
}
let reject = false;
for await (const chunk of stream('shitakemushrooms.example.com', 4)) {
if (pattern.test(chunk)) {
reject = true // misclassifies based on the first chunk
break
}
}
Often for long lived parses it is desirable to save data in a way that minimizes re-processing. Existing means to do this are fragile at best and do not work with complex patterns.
const progress_pattern = /(?=$|bc?)/g
const chunks = 'abc'.split(progress_pattern) // ['a', 'bc']
// can skip processing for 'a' of 'abc' for future chunks
// this approach currently isn't possible for complex patterns
Due to lack of knowledge regarding lookbehind, constant buffering must be used to safely account for it.
const pattern = /(?<a=)ba/g
let chunks = ['ab', 'a', 'ba']
let buffer = ''
for (const chunk of chunks) {
// lookbehind means buffering it all even if match found, cannot drop 'aba' since last 'a' is needed
buffer += chunk
// state loss due to new string ref and possibly
// repetitive searching from start of string
const matches = buffer.matchAll(pattenr)
}
const pattern = /ab+(?=b)/g
let chunks = ['ab', 'bb']
let buffer = ''
for (const chunk of chunks) {
buffer += chunk
const match
if (pattern.lastIndex === buffer.length) {
// may be unfinished, due to lookahead aggreagation
} else {
// cannot slice buffer due to lookbehind again
}
}
Streaming resource utilization: Data is often streamed to JS in chunks from websockets, HTTP messages, file reading, etc. This generally supports forward progress while waiting on I/O, spreading out CPU and memory pressure during gaps. RegExp should be able to effectively participate in such processes.
Streaming data boundaries: When processing chunked data, RegExp should allow for handling patterns around data boundaries in a way that prevents false positives or false negatives on incomplete data.
Text manipulation pattern: When performing matching, it can be valuable to know if a chunk of data will potentially be used in a future match and if so what parts. This can aid in various text manipulation or viewing use cases by allowing complete text to be chunked and processed async/in parallel.
- Java
Matcher.hitEnd
https://docs.oracle.com/javase/9/docs/api/java/util/regex/Matcher.html#hitEnd--
Uses a matcher result separated from the pattern. Lacks info on where in chunk to resume next attempt for things like lookbehind. Requires other class like Scanner to mitigate bookkeeping.
- Python
regex
partial=True
https://github.com/mrabarnett/mrab-regex#added-partial-matches-hg-issue-102
Uses a matcher result separated from the pattern. Lacks info on where in chunk to resume next attempt for things like lookbehind. Requires manual bookkeeping for buffering string properly.
- C++
boost/regex
https://www.boost.org/doc/libs/1_81_0/libs/regex/doc/html/boost_regex/partial_matches.html
Uses a matcher result separated from the pattern. Lacks info on where in chunk to resume next attempt for things like lookbehind. Requires manual bookkeeping for buffering string properly.
- PCRE2 Multi Segment Support https://pcre.org/current/doc/html/pcre2partial.html#SEC6
Manages all state on the pattern itself. Lacks info on where in chunk to resume next attempt for things like lookbehind. Requires manual bookkeeping for buffering results properly.
N/A
N/A
Frequently asked questions, or questions you think might be asked. Issues on the issue tracker or questions from past reviews can be a good source for these.
Q: Why is the proposal this way?
A: Because reasons!
Q: Why does this need to be built-in, instead of being implemented in JavaScript?
A: We could encourage people to continue doing this in user-space. However, that would significantly increase load time of web pages. Additionally, web browsers already have a built-in RegExp engine which is higher quality.
Q: Is it really necessary to create such a high-level built-in construct, rather than using lower-level primitives?
A: Low level constructs and book keeping are the general source of the issues and don't allow for the same fidelity of results or robust handling of edge cases.