Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for quoted boundary for multipart request parsing. #924

Merged
merged 6 commits into from
Nov 20, 2023

Conversation

meztez
Copy link
Collaborator

@meztez meztez commented Nov 17, 2023

Fixes #876
Fixes #915

See #915, #876, thanks @slodge for finding the problem and @MJSchut for pointing out the issue.

Per RFC2046

The issue will be present in jeroen/webutils too.

WARNING TO IMPLEMENTORS: The grammar for parameters on the Content-
type field is such that it is often necessary to enclose the boundary
parameter values in quotes on the Content-type line. This is not
always necessary, but never hurts. Implementors should be sure to
study the grammar carefully in order to avoid producing invalid
Content-type fields. Thus, a typical "multipart" Content-Type header
field might look like this:

 Content-Type: multipart/mixed; boundary=gc0p4Jq0M2Yt08j34c0p

But the following is not valid:

 Content-Type: multipart/mixed; boundary=gc0pJq0M:08jU534c0p

(because of the colon) and must instead be represented as

 Content-Type: multipart/mixed; boundary="gc0pJq0M:08jU534c0p"

This Content-Type value indicates that the content consists of one or
more parts, each with a structure that is syntactically identical to
an RFC 822 message, except that the header area is allowed to be
completely empty, and that the parts are each preceded by the line

PR task list:

  • Update NEWS
  • Add tests
  • Update documentation with devtools::document()

@slodge
Copy link

slodge commented Nov 17, 2023

Is there any advantage in splitting this into two regex operations? The original fix suggested was:

slodge@f0e37b8

@meztez
Copy link
Collaborator Author

meztez commented Nov 17, 2023

making sure quotes are either there or not, original pattern would still match if only one quote appears either at the front or back of the boundary

@slodge
Copy link

slodge commented Nov 17, 2023

I don't believe doublequotes are valid boundary characters - so that shouldn't be a problem:

image

from Appendix A of https://www.ietf.org/rfc/rfc2046.txt

Not a big problem/concern - so will leave just as comment 👍

@meztez
Copy link
Collaborator Author

meztez commented Nov 17, 2023

yeah, seems the fix I proposed in #876 was just a straight up gsub. Still, can't be too cautious, nothing is currently preventing a client from using one as the pattern does not check the boundary character or the boundary length.

R/parse-body.R Outdated Show resolved Hide resolved
@meztez
Copy link
Collaborator Author

meztez commented Nov 20, 2023

> bench::mark(iterations = 1000000,
+   stri_replace_first_regex(boundary, '^"(.*)"$', "$1"),
+   gsub("^\"|\"$", "", boundary),
+   stri_replace_first_regex(boundary, '^"([^"]+)"$', "$1"),
+   sub('^"(.*)"$', "\\1", boundary)
+ )
# A tibble: 4 × 13
  expression    min median `itr/sec` mem_alloc `gc/sec`  n_itr  n_gc total_time result memory     time      
  <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl>  <int> <dbl>   <bch:tm> <list> <list>     <list>    
1 "stri_rep… 3.81µs 4.42µs   210046.        0B     7.35 999965    35      4.76s <chr>  <Rprofmem> <bench_tm>
2 "gsub(\"^… 4.08µs 4.78µs   194017.        0B     8.34 999957    43      5.15s <chr>  <Rprofmem> <bench_tm>
3 "stri_rep4.82µs 6.31µs   151239.        0B     3.02 999980    20      6.61s <chr>  <Rprofmem> <bench_tm>
4 "sub(\"^\… 5.67µs 6.28µs   146406.        0B     6.73 999954    46      6.83s <chr>  <Rprofmem> <bench_tm>
# ℹ 1 more variable: gc <list>

R/parse-body.R Outdated Show resolved Hide resolved
Slighly less works for the regex parser
@schloerke
Copy link
Collaborator

Addressing #924 (comment) by comparing a single call, two calls, and webutils implementation...

content_type <- "multipart/form-data; boundary=\"----WebKitFormBoundaryMYdShB9nBc32BUhQ\""; 
bench::mark(
  stri_match_first_regex(content_type, "boundary=\"?([^; \"]{2,})\"?", case_insensitive = TRUE)[,2],
  {
    boundary <- stri_match_first_regex(content_type, "boundary=([^; ]{2,})", case_insensitive = TRUE)[,2]
    boundary <- stri_replace_first_regex(boundary, '^"(.*)"$', "$1")
  },
  {
    m <- regexpr('boundary=[^; ]{2,}', content_type, ignore.case = TRUE)
    boundary <- sub('boundary=','',regmatches(content_type, m)[[1]])
    sub('^"(.*)"$', "\\1", boundary)
  }
)
# A tibble: 3 × 13
  expression        min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc
  <bch:expr>    <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>
1 "stri_match_…  8.77µs   10µs    90181.        0B     18.0  9998     2      111ms <chr>  <Rprofmem> <bench_tm> <tibble>
2 "{ boundary12.01µs 12.5µs    77089.        0B     15.4  9998     2      130ms <chr>  <Rprofmem> <bench_tm> <tibble>
3 "{ m <- rege… 19.76µs 20.7µs    45794.        0B     18.3  9996     4      218ms <chr>  <Rprofmem> <bench_tm> <tibble>

There is a 2.5µs speed increase. So, we should use the single statement as both calls are being done.

R/parse-body.R Outdated Show resolved Hide resolved
@schloerke
Copy link
Collaborator

Does this also fix #915? I can't tell from the issue

@meztez
Copy link
Collaborator Author

meztez commented Nov 20, 2023

Does this also fix #915? I can't tell from the issue

It should. The C# httpclient surrounds the boundary in dquotes.

@schloerke schloerke merged commit 72d68c6 into rstudio:main Nov 20, 2023
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calling POST from C# code does not process body correctly multipart boundary support surrounding quotes
3 participants