Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for feather files in POST #77

Closed
wants to merge 10 commits into from
Closed

Conversation

FvD
Copy link
Contributor

@FvD FvD commented Mar 1, 2017

Hi Jeff,

This was not an open request in the issues, but was something that came up in a plumber use case: the need to be able to send feather (binary) files. I would much appreciate your review, as I think this may not be the most elegant solution. But it works, provided you upload the file as multipart/form-data:

 curl -X POST http://localhost:9080/ -F 'myfile=@mydata.feather'   

It then uses webutils from Jeroen Ooms to parse the multipart content into its different components and writes out a temporary file to read it back in using feather.

What I have not been able to do (yet) is write a test for this. I've tried to copy an example multipart upload out of the debug session (by transforming rawToChar()) but for some reason I cannot get it right. Any pointers there are welcome.

@FvD
Copy link
Contributor Author

FvD commented Mar 1, 2017

For some reason the installation of feather failed in Travis CI. I'll have a look why.

But before making any changes, i'd like to know you opinion on generalizing this to more (all?) binary formats. Either we move out all the code related to feather, so that the raw binary body is returned to be handles by the function written in the endpoint. We can return the name of the handle and the name of the file, as both are available in the multiparse object and the tmpfile location of the binary file (as set by tempfile()

The other option is to include the handling of files here, either based on Content-type (that has to be included then by the user in the POST request) or the file extension (that has to be correct, otherwise it will fail, but this is seldom an issue). The trouble there is that there are many different formats, and why should plumber enforce how to handle XLSX files or ODS files, for instance. It may be much better to leave all that to the code in the endpoint.

Copy link
Contributor

@trestletech trestletech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really promising! Thanks for taking the time to put it together.

As pointed out in #75, I think there's potential here to have much broader applicability. My preference would be to attach a files list off of the incoming request that could provide pointers to the relevant files that were uploaded and are now stored on the server's disk. Then the caller could just decide themselves that they want to read in the file as a feather file, for instance. And plumber doesn't have to be in the business of providing special parsers for every file type someone might want to upload.

So you could do something like

@post /upload
function(req){
  cat("Files uploaded: ", names(req$files))
  feather::read_feather(files[[1]])
  ...
}

We'd also need to figure out how to limit the filetypes and file sizes that we want people uploading. I'd hope the server would start off with an empty list of supported file extensions which could be added to in $run(), perhaps.

R/post-body.R Outdated
boundary <- "-----------------"
parsed_binary <- webutils::parse_multipart(body, boundary)
file_name <- parsed_binary[[1]][[1]]
file_data <- parsed_binary[[1]][[2]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that we could iterate here to support more than one file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure yet. The list indexes used here are not the best way of accessing the parsed content. This looks better (myfile is a name assigned by webutils::parse_multipart())

parsed_binary$myfile$filename
parsed_binary$myfile$value

For readability I've now implemented it as named lists, and have yet to figure out yet how to parse multiple files.

R/post-body.R Outdated
file_data <- parsed_binary[[1]][[2]]

tmpfile <- tempfile()
writeBin(file_data, tmpfile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be neat if we could keep this off disk using textConnections, but I know some packages struggle with connections and prefer files, so maybe landing this stuff on disk is the more generic solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my purpose I would prefer to be able to land the files on disk. If we use textConnections, the data would not be persistent at all, and not be available after a restart of the service.

I can imagine use cases where persistence would not be an issue. Maybe this is something to make a separate ticket of, to add as an option later.

DESCRIPTION Outdated
@@ -18,7 +18,9 @@ Imports:
R6 (>= 2.0.0),
stringi (>= 0.3.0),
jsonlite (>= 0.9.16),
httpuv (>= 1.2.3)
httpuv (>= 1.2.3),
feather,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hope to keep feather off of imports and in suggests. See how I handle the htmlwidgets package in this PR: https://github.com/trestletech/plumber/pull/82/files

We could just suggest and then, if a user goes to use a feature that requires a particular package and doesn't have it, we'd abort at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! I had not looked carefully enough at how you are handling this. Your solution is elegant. Also, perhaps I should focus on a more generic solution to POST binary files in plumber and leave the handling of feather or other formats to the API code.

R/post-body.R Outdated
# does not work when the boundary is set to the content type as per the
# webutils docs.
boundary <- "-----------------"
parsed_binary <- webutils::parse_multipart(body, boundary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unfamiliar with the specs here but this feels pretty hacky. I wonder if this is something that we should just fix in webutils directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing wrong with webutils. It was just poorly implemented on my part. The full boundary is part of the req$CONTENT_TYPE so we can fetch each current boundary like so:

boundary <- strsplit(req$CONTENT_TYPE, split = "=")[[1]][2]

@FvD
Copy link
Contributor Author

FvD commented Aug 1, 2017

Hi Jeff,

I've pushed the update based on your review. I could not implement everything yet, perhaps I can set up tickets to tackle them in the future.

Things changed:

  • Made a generic solution for binary files
  • Made a reference to the boundary in CONTENT_TYPE for use in webutils::part_multipart()
  • Do not clutter imports but move additional packages to suggest and warn user when suggested package is required.
  • Added specific use case of feather files to an example

Things no changed (yet)

  • [] Accept and parse multiple files
  • [] Leave data in-memory until user decides to write to disk
  • [] Specify supported (accepted?) extensions in the call to $run()

I think this will be an answer to #75 and #126 that is workable. But by all means send this back for changes if you see a need for further improvements before merging.

@FvD
Copy link
Contributor Author

FvD commented Aug 1, 2017

Travis fails with the message:

checking package dependencies ... ERROR
Namespace dependency not required: ‘crayon’

Crayon is outside of the scope of the changes I have made. Should I merge with upstream and commit again?

@korterling
Copy link

any example about upload the file?

@post /upload
function(req){
cat("Files uploaded: ", names(req$files))
}
curl -X POST http://localhost:1371/upload -F 'req=./test.txt'

@FvD
Copy link
Contributor Author

FvD commented Aug 11, 2017

HI @yangbenfa , I included an example of binary uploads in the pull request. You can find it in the folder plumber/inst/examples/13-binary-upload/. Please note the following two things:

  1. For it to work you have to install from the feather branch in my repository - this pull request is still under review and has not been merged yet.
  2. I have only tested this with binary files. I'm not sure if a text file would work, but I would love to know whether it does. If it does not work, please let me know.

The examples look as follows:

# To upload and use the content of the file in a function:
# curl -X POST http://localhost:9080/upload -F 'myfile=@test.feather'

#* @post /upload
function(name, data) {
  # work with binary files after upload (in this example a feather file)
  content <- read_feather(data)
  return(content)
}

# To upload and use the properties of the file in a function:
# curl -X POST http://localhost:9080/inspect -F 'myfile=@test.feather'

#* @post /inspect
function(name, data) {
  file_info <- data.frame(
    filename = name,
    mtime = file.info(data)$mtime,
    ctime = file.info(data)$ctime
    )
  return(file_info)
}

Copy link
Contributor

@trestletech trestletech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry this took so long for me to get back to, Frans. I had somehow missed that the ball was in my court!)

This is getting close. There are a few missing features (like limits on how big of a file we'd try to bring in -- since it does get loaded into memory -- and testing), but I like the approach.

How much time do you have to devote to this? With the Thanksgiving week next week I'm hoping to have some time to devote to Plumber and to finishing up the release. I might be able to build a few more commits on top of this to get it to a point where I'm happy with it, but if you were willing/able to do it yourself I'd be happy to go that route.

One thing we could consider is using some sort of "feature flag" in which we could merge this so we can keep iterating on it but require that users opt-in (perhaps via some R option) in order to get this behavior. That way we can get it in on master and keep working on it there and it's available for users who want it but without putting anyone else at risk while we iron out the kinks.

if (!is.null(req$CONTENT_TYPE) && grepl("multipart/form-data; boundary=",
req$CONTENT_TYPE, fixed = TRUE)) {

boundary <- strsplit(req$CONTENT_TYPE, split = "=")[[1]][2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this vulnerable to having an = sign in the body of the message?

return(list())
}
parsed_binary <- webutils::parse_multipart(body, boundary)
file_name <- parsed_binary$myfile$filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time finding where myfile comes from? Is that baked in to parse_multipart's return value?

file_name <- parsed_binary$myfile$filename
file_data <- parsed_binary$myfile$value

tmpfile <- tempfile()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. It's unfortunate that we're not able to clean up these temp files when we're done. But if we added an on.exit here the file would be gone before we could access the file downstream. Maybe I need to add a hook to the response object so that we can clean up any associated resources when it's done.

if (is.null(body) || length(body) == 0 || body == "") {
return(list())
}
parsed_binary <- webutils::parse_multipart(body, boundary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there limits built in to any of these on how big of a file we'd attempt to parse? Anything to prevent a client from sending a 10GB file and running the server out of RAM?

tmpfile <- tempfile()
writeBin(file_data, tmpfile)

ret <- NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this any different than ret <-list(name = file_name, data=tmpfile)?

It surprised me that you could assign to a field on a NULL objec.t

@trestletech
Copy link
Contributor

Oh and the one other high-level concern I had was around the hard-coding (I think) of the file name. I'd love to make that more generic before we bring it in.

@eduardszoecs
Copy link

Any updates on this feature?

@CLAassistant
Copy link

CLAassistant commented Sep 4, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@meztez
Copy link
Collaborator

meztez commented Jun 27, 2020

#550 adds support for any type of file, this should cover this use case and I think this one can be closed. Documentations to follow

@schloerke schloerke added the Fixed in PR A PR has implemented a fix for this Issue label Jul 31, 2020
@schloerke schloerke added this to the v0.5.0 - Next CRAN release milestone Jul 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Fixed in PR A PR has implemented a fix for this Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants