-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for quoted boundary for multipart request parsing. #924
Conversation
Is there any advantage in splitting this into two regex operations? The original fix suggested was: |
making sure quotes are either there or not, original pattern would still match if only one quote appears either at the front or back of the boundary |
I don't believe doublequotes are valid boundary characters - so that shouldn't be a problem: from Appendix A of https://www.ietf.org/rfc/rfc2046.txt Not a big problem/concern - so will leave just as comment 👍 |
yeah, seems the fix I proposed in #876 was just a straight up gsub. Still, can't be too cautious, nothing is currently preventing a client from using one as the pattern does not check the boundary character or the boundary length. |
> bench::mark(iterations = 1000000,
+ stri_replace_first_regex(boundary, '^"(.*)"$', "$1"),
+ gsub("^\"|\"$", "", boundary),
+ stri_replace_first_regex(boundary, '^"([^"]+)"$', "$1"),
+ sub('^"(.*)"$', "\\1", boundary)
+ )
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list>
1 "stri_rep… 3.81µs 4.42µs 210046. 0B 7.35 999965 35 4.76s <chr> <Rprofmem> <bench_tm>
2 "gsub(\"^… 4.08µs 4.78µs 194017. 0B 8.34 999957 43 5.15s <chr> <Rprofmem> <bench_tm>
3 "stri_rep… 4.82µs 6.31µs 151239. 0B 3.02 999980 20 6.61s <chr> <Rprofmem> <bench_tm>
4 "sub(\"^\… 5.67µs 6.28µs 146406. 0B 6.73 999954 46 6.83s <chr> <Rprofmem> <bench_tm>
# ℹ 1 more variable: gc <list> |
Slighly less works for the regex parser
Addressing #924 (comment) by comparing a single call, two calls, and webutils implementation... content_type <- "multipart/form-data; boundary=\"----WebKitFormBoundaryMYdShB9nBc32BUhQ\"";
bench::mark(
stri_match_first_regex(content_type, "boundary=\"?([^; \"]{2,})\"?", case_insensitive = TRUE)[,2],
{
boundary <- stri_match_first_regex(content_type, "boundary=([^; ]{2,})", case_insensitive = TRUE)[,2]
boundary <- stri_replace_first_regex(boundary, '^"(.*)"$', "$1")
},
{
m <- regexpr('boundary=[^; ]{2,}', content_type, ignore.case = TRUE)
boundary <- sub('boundary=','',regmatches(content_type, m)[[1]])
sub('^"(.*)"$', "\\1", boundary)
}
)
# A tibble: 3 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 "stri_match_… 8.77µs 10µs 90181. 0B 18.0 9998 2 111ms <chr> <Rprofmem> <bench_tm> <tibble>
2 "{ boundary … 12.01µs 12.5µs 77089. 0B 15.4 9998 2 130ms <chr> <Rprofmem> <bench_tm> <tibble>
3 "{ m <- rege… 19.76µs 20.7µs 45794. 0B 18.3 9996 4 218ms <chr> <Rprofmem> <bench_tm> <tibble> There is a 2.5µs speed increase. So, we should use the single statement as both calls are being done. |
Does this also fix #915? I can't tell from the issue |
It should. The C# httpclient surrounds the boundary in dquotes. |
Fixes #876
Fixes #915
See #915, #876, thanks @slodge for finding the problem and @MJSchut for pointing out the issue.
Per RFC2046
The issue will be present in jeroen/webutils too.
PR task list:
devtools::document()