-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for feather files in POST #77
Conversation
For some reason the installation of But before making any changes, i'd like to know you opinion on generalizing this to more (all?) binary formats. Either we move out all the code related to feather, so that the raw binary body is returned to be handles by the function written in the endpoint. We can return the name of the handle and the name of the file, as both are available in the multiparse object and the tmpfile location of the binary file (as set by The other option is to include the handling of files here, either based on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really promising! Thanks for taking the time to put it together.
As pointed out in #75, I think there's potential here to have much broader applicability. My preference would be to attach a files
list off of the incoming request that could provide pointers to the relevant files that were uploaded and are now stored on the server's disk. Then the caller could just decide themselves that they want to read in the file as a feather file, for instance. And plumber doesn't have to be in the business of providing special parsers for every file type someone might want to upload.
So you could do something like
@post /upload
function(req){
cat("Files uploaded: ", names(req$files))
feather::read_feather(files[[1]])
...
}
We'd also need to figure out how to limit the filetypes and file sizes that we want people uploading. I'd hope the server would start off with an empty list of supported file extensions which could be added to in $run()
, perhaps.
R/post-body.R
Outdated
boundary <- "-----------------" | ||
parsed_binary <- webutils::parse_multipart(body, boundary) | ||
file_name <- parsed_binary[[1]][[1]] | ||
file_data <- parsed_binary[[1]][[2]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding correctly that we could iterate here to support more than one file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure yet. The list indexes used here are not the best way of accessing the parsed content. This looks better (myfile
is a name assigned by webutils::parse_multipart()
)
parsed_binary$myfile$filename
parsed_binary$myfile$value
For readability I've now implemented it as named lists, and have yet to figure out yet how to parse multiple files.
R/post-body.R
Outdated
file_data <- parsed_binary[[1]][[2]] | ||
|
||
tmpfile <- tempfile() | ||
writeBin(file_data, tmpfile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be neat if we could keep this off disk using textConnections, but I know some packages struggle with connections and prefer files, so maybe landing this stuff on disk is the more generic solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my purpose I would prefer to be able to land the files on disk. If we use textConnections, the data would not be persistent at all, and not be available after a restart of the service.
I can imagine use cases where persistence would not be an issue. Maybe this is something to make a separate ticket of, to add as an option later.
DESCRIPTION
Outdated
@@ -18,7 +18,9 @@ Imports: | |||
R6 (>= 2.0.0), | |||
stringi (>= 0.3.0), | |||
jsonlite (>= 0.9.16), | |||
httpuv (>= 1.2.3) | |||
httpuv (>= 1.2.3), | |||
feather, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd hope to keep feather off of imports
and in suggests
. See how I handle the htmlwidgets
package in this PR: https://github.com/trestletech/plumber/pull/82/files
We could just suggest and then, if a user goes to use a feature that requires a particular package and doesn't have it, we'd abort at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! I had not looked carefully enough at how you are handling this. Your solution is elegant. Also, perhaps I should focus on a more generic solution to POST binary files in plumber and leave the handling of feather or other formats to the API code.
R/post-body.R
Outdated
# does not work when the boundary is set to the content type as per the | ||
# webutils docs. | ||
boundary <- "-----------------" | ||
parsed_binary <- webutils::parse_multipart(body, boundary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unfamiliar with the specs here but this feels pretty hacky. I wonder if this is something that we should just fix in webutils directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing wrong with webutils. It was just poorly implemented on my part. The full boundary is part of the req$CONTENT_TYPE so we can fetch each current boundary like so:
boundary <- strsplit(req$CONTENT_TYPE, split = "=")[[1]][2]
Hi Jeff, I've pushed the update based on your review. I could not implement everything yet, perhaps I can set up tickets to tackle them in the future. Things changed:
Things no changed (yet)
I think this will be an answer to #75 and #126 that is workable. But by all means send this back for changes if you see a need for further improvements before merging. |
Travis fails with the message:
Crayon is outside of the scope of the changes I have made. Should I merge with upstream and commit again? |
any example about upload the file? @post /upload |
HI @yangbenfa , I included an example of binary uploads in the pull request. You can find it in the folder plumber/inst/examples/13-binary-upload/. Please note the following two things:
The examples look as follows: # To upload and use the content of the file in a function:
# curl -X POST http://localhost:9080/upload -F 'myfile=@test.feather'
#* @post /upload
function(name, data) {
# work with binary files after upload (in this example a feather file)
content <- read_feather(data)
return(content)
}
# To upload and use the properties of the file in a function:
# curl -X POST http://localhost:9080/inspect -F 'myfile=@test.feather'
#* @post /inspect
function(name, data) {
file_info <- data.frame(
filename = name,
mtime = file.info(data)$mtime,
ctime = file.info(data)$ctime
)
return(file_info)
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry this took so long for me to get back to, Frans. I had somehow missed that the ball was in my court!)
This is getting close. There are a few missing features (like limits on how big of a file we'd try to bring in -- since it does get loaded into memory -- and testing), but I like the approach.
How much time do you have to devote to this? With the Thanksgiving week next week I'm hoping to have some time to devote to Plumber and to finishing up the release. I might be able to build a few more commits on top of this to get it to a point where I'm happy with it, but if you were willing/able to do it yourself I'd be happy to go that route.
One thing we could consider is using some sort of "feature flag" in which we could merge this so we can keep iterating on it but require that users opt-in (perhaps via some R option) in order to get this behavior. That way we can get it in on master and keep working on it there and it's available for users who want it but without putting anyone else at risk while we iron out the kinks.
if (!is.null(req$CONTENT_TYPE) && grepl("multipart/form-data; boundary=", | ||
req$CONTENT_TYPE, fixed = TRUE)) { | ||
|
||
boundary <- strsplit(req$CONTENT_TYPE, split = "=")[[1]][2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this vulnerable to having an =
sign in the body of the message?
return(list()) | ||
} | ||
parsed_binary <- webutils::parse_multipart(body, boundary) | ||
file_name <- parsed_binary$myfile$filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time finding where myfile
comes from? Is that baked in to parse_multipart
's return value?
file_name <- parsed_binary$myfile$filename | ||
file_data <- parsed_binary$myfile$value | ||
|
||
tmpfile <- tempfile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. It's unfortunate that we're not able to clean up these temp files when we're done. But if we added an on.exit
here the file would be gone before we could access the file downstream. Maybe I need to add a hook to the response object so that we can clean up any associated resources when it's done.
if (is.null(body) || length(body) == 0 || body == "") { | ||
return(list()) | ||
} | ||
parsed_binary <- webutils::parse_multipart(body, boundary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there limits built in to any of these on how big of a file we'd attempt to parse? Anything to prevent a client from sending a 10GB file and running the server out of RAM?
tmpfile <- tempfile() | ||
writeBin(file_data, tmpfile) | ||
|
||
ret <- NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this any different than ret <-list(name = file_name, data=tmpfile)
?
It surprised me that you could assign to a field on a NULL objec.t
Oh and the one other high-level concern I had was around the hard-coding (I think) of the file name. I'd love to make that more generic before we bring it in. |
Any updates on this feature? |
|
#550 adds support for any type of file, this should cover this use case and I think this one can be closed. Documentations to follow |
Hi Jeff,
This was not an open request in the issues, but was something that came up in a plumber use case: the need to be able to send feather (binary) files. I would much appreciate your review, as I think this may not be the most elegant solution. But it works, provided you upload the file as multipart/form-data:
It then uses
webutils
from Jeroen Ooms to parse the multipart content into its different components and writes out a temporary file to read it back in using feather.What I have not been able to do (yet) is write a test for this. I've tried to copy an example multipart upload out of the debug session (by transforming rawToChar()) but for some reason I cannot get it right. Any pointers there are welcome.