Add support for quoted boundary for multipart request parsing. #924

meztez · 2023-11-17T03:57:28Z

Fixes #876
Fixes #915

See #915, #876, thanks @slodge for finding the problem and @MJSchut for pointing out the issue.

Per RFC2046

The issue will be present in jeroen/webutils too.

WARNING TO IMPLEMENTORS: The grammar for parameters on the Content-
type field is such that it is often necessary to enclose the boundary
parameter values in quotes on the Content-type line. This is not
always necessary, but never hurts. Implementors should be sure to
study the grammar carefully in order to avoid producing invalid
Content-type fields. Thus, a typical "multipart" Content-Type header
field might look like this:
 Content-Type: multipart/mixed; boundary=gc0p4Jq0M2Yt08j34c0p
But the following is not valid:
 Content-Type: multipart/mixed; boundary=gc0pJq0M:08jU534c0p
(because of the colon) and must instead be represented as
 Content-Type: multipart/mixed; boundary="gc0pJq0M:08jU534c0p"
This Content-Type value indicates that the content consists of one or
more parts, each with a structure that is syntactically identical to
an RFC 822 message, except that the header area is allowed to be
completely empty, and that the parts are each preceded by the line

PR task list:

Update NEWS
Add tests
Update documentation with devtools::document()

slodge · 2023-11-17T05:55:04Z

Is there any advantage in splitting this into two regex operations? The original fix suggested was:

slodge@f0e37b8

meztez · 2023-11-17T06:08:11Z

making sure quotes are either there or not, original pattern would still match if only one quote appears either at the front or back of the boundary

slodge · 2023-11-17T06:18:29Z

I don't believe doublequotes are valid boundary characters - so that shouldn't be a problem:

from Appendix A of https://www.ietf.org/rfc/rfc2046.txt

Not a big problem/concern - so will leave just as comment 👍

meztez · 2023-11-17T06:42:27Z

yeah, seems the fix I proposed in #876 was just a straight up gsub. Still, can't be too cautious, nothing is currently preventing a client from using one as the pattern does not check the boundary character or the boundary length.

R/parse-body.R

meztez · 2023-11-20T15:22:11Z

> bench::mark(iterations = 1000000,
+   stri_replace_first_regex(boundary, '^"(.*)"$', "$1"),
+   gsub("^\"|\"$", "", boundary),
+   stri_replace_first_regex(boundary, '^"([^"]+)"$', "$1"),
+   sub('^"(.*)"$', "\\1", boundary)
+ )
# A tibble: 4 × 13
  expression    min median `itr/sec` mem_alloc `gc/sec`  n_itr  n_gc total_time result memory     time      
  <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl>  <int> <dbl>   <bch:tm> <list> <list>     <list>    
1 "stri_rep… 3.81µs 4.42µs   210046.        0B     7.35 999965    35      4.76s <chr>  <Rprofmem> <bench_tm>
2 "gsub(\"^… 4.08µs 4.78µs   194017.        0B     8.34 999957    43      5.15s <chr>  <Rprofmem> <bench_tm>
3 "stri_rep… 4.82µs 6.31µs   151239.        0B     3.02 999980    20      6.61s <chr>  <Rprofmem> <bench_tm>
4 "sub(\"^\… 5.67µs 6.28µs   146406.        0B     6.73 999954    46      6.83s <chr>  <Rprofmem> <bench_tm>
# ℹ 1 more variable: gc <list>

R/parse-body.R

Slighly less works for the regex parser

schloerke · 2023-11-20T15:26:54Z

Addressing #924 (comment) by comparing a single call, two calls, and webutils implementation...

content_type <- "multipart/form-data; boundary=\"----WebKitFormBoundaryMYdShB9nBc32BUhQ\""; 
bench::mark(
  stri_match_first_regex(content_type, "boundary=\"?([^; \"]{2,})\"?", case_insensitive = TRUE)[,2],
  {
    boundary <- stri_match_first_regex(content_type, "boundary=([^; ]{2,})", case_insensitive = TRUE)[,2]
    boundary <- stri_replace_first_regex(boundary, '^"(.*)"$', "$1")
  },
  {
    m <- regexpr('boundary=[^; ]{2,}', content_type, ignore.case = TRUE)
    boundary <- sub('boundary=','',regmatches(content_type, m)[[1]])
    sub('^"(.*)"$', "\\1", boundary)
  }
)
# A tibble: 3 × 13
  expression        min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc
  <bch:expr>    <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>
1 "stri_match_…  8.77µs   10µs    90181.        0B     18.0  9998     2      111ms <chr>  <Rprofmem> <bench_tm> <tibble>
2 "{ boundary … 12.01µs 12.5µs    77089.        0B     15.4  9998     2      130ms <chr>  <Rprofmem> <bench_tm> <tibble>
3 "{ m <- rege… 19.76µs 20.7µs    45794.        0B     18.3  9996     4      218ms <chr>  <Rprofmem> <bench_tm> <tibble>

There is a 2.5µs speed increase. So, we should use the single statement as both calls are being done.

R/parse-body.R

schloerke · 2023-11-20T15:33:16Z

Does this also fix #915? I can't tell from the issue

meztez · 2023-11-20T15:34:53Z

Does this also fix #915? I can't tell from the issue

It should. The C# httpclient surrounds the boundary in dquotes.

Add support for quoted boundary for multipart request parsing.

ef2aae7

meztez mentioned this pull request Nov 17, 2023

Quoted multipart boundary jeroen/webutils#6

Closed

schloerke added 2 commits November 17, 2023 11:05

Merge branch 'main' into quoted_boundary

b73ae10

Merge branch 'main' into quoted_boundary

5b356a1

schloerke reviewed Nov 20, 2023

View reviewed changes

R/parse-body.R Outdated Show resolved Hide resolved

Update regex to require an end quote

4181d23

meztez commented Nov 20, 2023

View reviewed changes

R/parse-body.R Outdated Show resolved Hide resolved

Update R/parse-body.R

14a0807

Slighly less works for the regex parser

schloerke reviewed Nov 20, 2023

View reviewed changes

R/parse-body.R Outdated Show resolved Hide resolved

Using single call for regex

e734d83

schloerke merged commit 72d68c6 into rstudio:main Nov 20, 2023
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for quoted boundary for multipart request parsing. #924

Add support for quoted boundary for multipart request parsing. #924

meztez commented Nov 17, 2023 •

edited by schloerke

Loading

slodge commented Nov 17, 2023

meztez commented Nov 17, 2023

slodge commented Nov 17, 2023

meztez commented Nov 17, 2023

meztez commented Nov 20, 2023 •

edited

Loading

schloerke commented Nov 20, 2023

schloerke commented Nov 20, 2023

meztez commented Nov 20, 2023

Add support for quoted boundary for multipart request parsing. #924

Add support for quoted boundary for multipart request parsing. #924

Conversation

meztez commented Nov 17, 2023 • edited by schloerke Loading

slodge commented Nov 17, 2023

meztez commented Nov 17, 2023

slodge commented Nov 17, 2023

meztez commented Nov 17, 2023

meztez commented Nov 20, 2023 • edited Loading

schloerke commented Nov 20, 2023

schloerke commented Nov 20, 2023

meztez commented Nov 20, 2023

meztez commented Nov 17, 2023 •

edited by schloerke

Loading

meztez commented Nov 20, 2023 •

edited

Loading