[WIP] feat: add support for chunked uploads #851

hugomrdias · 2018-09-03T16:22:50Z

This is still a work in progress to add support for chunked uploads (ipfs.add) and fix multiple issues related to adding big files.

Tests are filtered here https://github.com/ipfs/js-ipfs-api/blob/90c40363fbcd55d29307e51f4feabb8be867ded8/test/add-experimental.spec.js#L38-L46 to make review easy, just run ipfs daemon with ipfs/js-ipfs#1540

features/fixes in this PR together with ipfs/js-ipfs#1540:

big data add non-chunked (this will either break with browser memory or hit the maxBytes config in the daemon, see next)
really big data add chunked (theoretically the limit is daemon disk space or maybe request timeouts)
streaming progress reporting
error handling and reporting
add multiple files with wrapWithDirectory
improved browser support, handles File's directly from the input

const files = document.getElementById('file').files;
        this.ipfsApi
            .add([...files], {
                wrapWithDirectory: true,
                experimental: true,
                progress: prog => console.log(`received back: ${prog}`)
                chunkSize: 10 * 1024 * 1024
            })
            .then(console.log)
            .catch(console.error);

jsdoc for top level api and more

Notes:

trailers https://stackoverflow.com/questions/13371367/do-any-browsers-support-trailers-sent-in-chunked-encoding-responses

Needs:

[WIP] feat: support chunked add requests ipfs/js-ipfs#1540

Todo:

validate this example works after https://github.com/ipfs/js-ipfs-api/tree/master/examples/upload-file-via-browser
what to do with progress ? add another handler ?
multiple files return from the daemon only first hash
~~concurrent upload chunks~~ new PR for this
check uuid impl (maybe change to uuid v5 or nano-id)
avoid preflight as much has possible
callbackify top level
try handling non chunked
fix multipart boundary handling for non chunked

Uploading big files crush page in Chrome #654
upload-file-via-browser can only upload file size less than 165MB #788
Adding 50Mb+ Files to a browser js-ipfs crashes Chrome ipfs/js-ipfs#952
https://github.com/ipfs/js-ipfs-api/issues/842
https://github.com/ipfs/js-ipfs-api/issues/797 - make sure this PR fixes this /cc @lidel
Unable to upload to IPFS API: passing multiple files to ipfs.files.add sometimes fails ipfs/ipfs-companion#480 - will this PR fix this ? /cc @lidel

lidel

Thank you for working on this @hugomrdias!

My understanding is that

Upload payload is split into small parts (chunkSize = 256000)
(we probably want to make it a parameter)
Each part is sent as a a sequence of HTTP POST requests that have
- a unique identifier for entire upload session (uuid? – see below)
- a sequential counter within upload session (a chunk index)
API backend needs to support additional HTTP headers to perform re-assembly of entire payload from chunks and passing it to the regular ipfs.files.add call in a transparent manner
- PR for js-ipfs: [WIP] feat: support chunked add requests ipfs/js-ipfs#1540
- PR for go-ipfs: (TODO)
The goal is to hide chunking behind ordinary ipfs.files.add but for now /api/v0/add-chunked is used as the PoC endpoint in js-ipfs

Please let me know if I missed anything or you have a different vision for it.

Some early feedback from my end

👍 for moving forward: I am very hopeful this will address various bugs and limitations coming from buffering entire thing in memory in web browser contexts.
(we really need to solve big uploads from the browser.. without crashing it 🙃)
for progress reporting we could so various things, from the top of my head options are:
- A) add another endpoint or a header and periodically send asynchronous request to fetch upload progress updates for specific "group identifier" (unique identifier of ongoing upload session)
- B) do the same thing files.add in go-ipfs does right now (streaming status information while processing large requests), but make it aware of total upload size
- C) do nothing extra, and report progress based on how many of chunks were uploaded (add ability to control chunk size via param to control progress reporting resolution vs performance)
On backends:
- js-ipfs: will comment in [WIP] feat: support chunked add requests ipfs/js-ipfs#1540
- go-ipfs: as soon we have a working PoC against js-ipfs we should write a small spec and include go-ipfs in the loop to plan adding support there as well
  - Note to self: increased number of HTTP calls might produce a bigger surface for http: invalid Read on closed Body to occur when used with go-ipfs.
API endpoint: Is the plan to use a separate endpoint or eventually merge support into /api/v0/add? (I assume the latter)
Added some inline comments with usual bikeshed :))

lidel · 2018-09-03T21:59:56Z

src/add2/add2.js

+    .then(res => res.json())
+}
+
+function createName () {


Ipfs-Chunk-Name

I may be missing something here, but is there a reason why we can't use UUID v5 here? JavaScript's random numbers are usually weak, so v5 sounds like a safer option:

RFC 4122 advises that "distributed applications generating UUIDs at a variety of hosts must be willing to rely on the random number source at all hosts. If this is not feasible, the namespace variant should be used."

Fast generation of RFC4122 UUIDs: require('uuid/v5').
If use of UUID here is fine, we may consider renaming the field to -Uuid or even -Chunk-Group-Uuid to remove ambiguity.

lidel · 2018-09-03T22:03:06Z

src/add2/add2.js

+      'Content-Range': `bytes ${start}-${end}/${size}`,
+      'Ipfs-Chunk-Name': name,
+      'Ipfs-Chunk-Id': id,
+      'Ipfs-Chunk-Boundary': boundary


Should we use X- prefix for all custom headers?
We already have X-Ipfs-Path and I wonder if we should follow that convention.

lidel · 2018-09-03T22:05:24Z

src/add2/add2.js

+      'Content-Type': 'application/octet-stream',
+      'Content-Range': `bytes ${start}-${end}/${size}`,
+      'Ipfs-Chunk-Name': name,
+      'Ipfs-Chunk-Id': id,


Id probably should be renamed to Index (its index everywhere else)

hugomrdias · 2018-09-04T11:21:27Z

@lidel your understanding is correct :), updated the PR with some of your feedback

regarding the uuid i had looked into it, for now i want to keep the poor man's version should be safe for now, it goes over math.random a couple of times. (i have a note to go back to this)

final integration will use the normal add api, only with one change a new option called chunkSize, if this option is set to a number we go through the chunked codepath.

about progress im still trying to add directly without files if i succeed this should work the same as right now, if not, one solution i though was adding a new handler uploadProgress.

the current progress handler would still work as-is but only in the last request and it would mean adding to ipfs progress only and uploadProgress would mean upload only progress. With this we wouldn't actually break anything relying on the progress handler the user would only see 0% for a long time (uploading) and on the last request it would update correctly as data goes in ipfs (adding) to improve on this the developer will have the new updloadProgress. Does this make sense ?

lidel · 2018-09-04T12:42:10Z

@hugomrdias thanks!

My view is that we should do our best to to make it work without changing current progress API.
Details of the chunked upload should be abstracted away in best-effort fashion and hidden behind existing progress reporter.

What if we detect presence of chunkSize parameter, and switch logic used for progress reporting behind the scenes?

For upload split into N chunks:

uploading chunks 1 to (N-1) would show "upload only progress"
(initially we could just return % based on the number of uploaded chunks, more resolution can be added later)
uploading the last chunk N could show real "add progress" but only when it is bigger than "upload progress"

The end result would be a best-effort progress reporting that works with existing API and that is not stuck at 0% until the last chunk and behaves in expected manner (% always grows).

hugomrdias · 2018-09-04T17:57:05Z

lidel

🚲 🏡

Some feedback on how to make this feature more dev friendly:

Reducing number of headers

I just noticed we already have kinda unrelated header with chunk in it: X-Chunked-Output.
We probably should follow that naming convention somehow, or avoid use of chunk in names, maybe rename it with upload. Which got me thinking about the number of new headers this PR introduces and how we can make it simpler.

What if we replace three custom headers with only one + repurpose Content-Range?
I think you already started refactoring in that direction, but just for the record:

X-Chunked-Input: <upload-group-uuid>
Content-Range: <unit> <range-start>-<range-end>/<total-size>

X-Chunked-Input + Content-Range (or X- version with the same semantics) seem to pass all the info we need for re-assembly chunks in js-ipfs.

Chunked upload should also work without `js-ipfs-api`

A good indicator of being client-agnostic will be a demo of this type of upload with curl.
Assuming we have chunks on disk, we should be able to do something like below:

$ curl -X POST http://127.0.0.1:5001/api/v0/add-chunked  \
-H 'Content-Type: application/octet-stream' \
-H 'Content-Disposition: "file; filename="BIG-FILE.avi"' \
-H 'X-Chunked-Input: 87d0d13b-07de-4df1-b274-9b26a07a6f2a' \
-H 'Content-Range: bytes 0-1048576/2097152' \                                               
--data-binary @/tmp/chunk1

$ curl -X POST http://127.0.0.1:5001/api/v0/add-chunked  \
-H 'Content-Type: application/octet-stream' \
-H 'Content-Disposition: "file; filename="BIG-FILE.avi"' \
-H 'X-Chunked-Input: 87d0d13b-07de-4df1-b274-9b26a07a6f2a' \
-H 'Content-Range: bytes 1048577-2097152/2097152' \                                               
--data-binary @/tmp/chunk2

..and get CID of BIG-FILE.avi in response for the last request.

How retry/resume will look like?

What happens when n-th chunk fails due to temporary network problem?

js-ipfs-api: Fail entire upload? Retry failed chunk a few times before throwing an error with the offset of the end of last successful chunk?
Perhaps js-ipfs-api could internally retry chunk three times before returning an error.

js-ipfs: should there be a way for resuming failed upload from last successful chunk?

hugomrdias · 2018-09-10T16:24:13Z

@lidel the first two topics should be adressed in the last commit

about the resumeable stuff, its mostly:

having good errors for failed chunks, http-api should retry those
extra GET endpoint to return uploaded chunks with this response http-api should able to figure out the missing chunks and only upload those
one thing missing is how too identify a upload session to resume the current uuid is not enough, need to do more research for this

so, lets leave the resume feature to a follow up PR

hugomrdias · 2018-09-11T16:31:34Z

the jsdoc should create some nice docs with documentation.js

npx documentation serve ./js-ipfs-api/src/add2/add2.js -w -f html
run this cmd outside of the repo's folder to get the latest documentation.js, aegir still uses an old one.

also should give code completion to anyone using editors with jsdoc support

this can bubble up to the top level public api with minimal changes to this file

vasco-santos

All in all, it is looking good to me. Added some minor notes

src/files/add-experimental.js

vasco-santos · 2018-09-18T09:28:29Z

src/files/add-experimental.js

+ * @typedef {Object} AddResult
+ * @property {string} path
+ * @property {string} hash
+ * @property {number} size


Add descriptions for the properties here as well

src/files/add-pull-stream.js

src/utils/multipart-experimental.js

src/utils/send-stream-experimental.js

vasco-santos · 2018-09-18T10:12:01Z

src/utils/send-stream-experimental.js

+    this.index = 0
+    this.rangeStart = 0
+    this.rangeEnd = 0
+    this.rangeTotal = 0


What about using an object to handle all the range related properties?

src/utils/send-stream-experimental.js

vasco-santos · 2018-09-18T10:15:56Z

test/add-experimental.spec.js

+    // end runs
+  })
+
+  it.skip('files.add pins by default', (done) => {


What is blocking this?

Doesn't run connected to a js daemon, not related to this PR

vasco-santos · 2018-09-18T10:16:01Z

test/add-experimental.spec.js

+    })
+  })
+
+  it.skip('files.add with pin=false', (done) => {


What is blocking this?

Doesn't run connected to a js daemon, not related to this PR

alanshaw · 2018-09-27T11:42:55Z

@Stebalien could we get your thoughts on adding this to go-ipfs?

This PR is adding a feature to the HTTP add endpoint that will allow big files to be uploaded to IPFS by making multiple requests.

@lidel kindly put together a good summary of the proposed process:

Upload payload is split into small parts (chunkSize = 256000)

Each part is sent as a a sequence of HTTP POST requests that have

a unique identifier for entire upload session (uuid? – see below)

a sequential counter within upload session (a chunk index)

API backend needs to support additional HTTP headers to perform re-assembly of entire payload from chunks and passing it to the regular ipfs.files.add call in a transparent manner

PR for js-ipfs: [WIP] feat: support chunked add requests ipfs/js-ipfs#1540

PR for go-ipfs: (TODO)

Reasons for doing this:

It's not possible to stream a HTTP upload request (in Firefox) without buffering the entire payload into memory first
Has potential to allow resume for failed upload requests

alanshaw · 2018-09-27T11:59:14Z

@hugomrdias @lidel I think that regardless of what happens with this PR we need to switch to using the streaming fetch API. Firefox is notably the only browser that hasn't shipped the streams API yet but it sounds like this might happen soon. I think we can conditionally opt out of it for Firefox for the time being.

Switching to using the streaming fetch API will solve the buffering issue without any changes to the HTTP API and depending on priorities for go-ipfs we might be able to ship this before chunked uploads.

It's also worth noting that, streaming fetch will be way more efficient then multiple HTTP requests for chunked uploading.

hugomrdias · 2018-09-27T12:01:17Z

That's only for response bodies not request bodies, this is the only way currently available to us. I didn't find any indication that request bodies will get streams soon on any browser

alanshaw · 2018-09-28T09:56:42Z

That's only for response bodies not request bodies, this is the only way currently available to us. I didn't find any indication that request bodies will get streams soon on any browser

You're absolutely right - my bad. Thanks for clarifying!

lidel · 2019-07-10T12:06:56Z

Sounds like worth mentioning here, in case concepts from this PR are revisited in the future:

a proposal of open protocol for resumable file uploads:
https://tus.io / https://tus.io/protocols/resumable-upload.html

hugomrdias · 2019-07-11T15:05:08Z

Sounds like worth mentioning here, in case concepts from this PR are revisited in the future:

a proposal of open protocol for resumable file uploads:
tus.io / tus.io/protocols/resumable-upload.html

yep i based the impl in tus

feat: add support for chunked uploads

775213a

ghost assigned hugomrdias Sep 3, 2018

ghost added the in progress label Sep 3, 2018

hugomrdias mentioned this pull request Sep 3, 2018

[WIP] feat: support chunked add requests ipfs/js-ipfs#1540

Closed

3 tasks

fix: fix sendChunkRequest

01a575e

lidel reviewed Sep 3, 2018

View reviewed changes

alanshaw changed the title ~~WIP feat: add support for chunked uploads~~ [WIP] feat: add support for chunked uploads Sep 4, 2018

fix: cleanup code, headers, chunkSize options

695b23e

feat: support File directly from input elements

f9df326

kziemianek mentioned this pull request Sep 5, 2018

Support large file uploads to IPFS embarklabs/embark#774

Closed

hugomrdias added 2 commits September 6, 2018 16:35

fix: error handling and stream backpressure

52a6117

feat: refactor to send to SendStream, non-chunked, add*Stream methods

0da5783

lidel reviewed Sep 10, 2018

View reviewed changes

feat: improved header and curl test files

d58e8cb

hugomrdias added 2 commits September 11, 2018 10:19

fix: use nanoid for uuid

300af44

feat: callbackify and top level jsdoc and more

3cd19e7

hugomrdias added 3 commits September 12, 2018 11:03

fix: integration with old add

4191525

feat: tests, concurrency, simplification

0a4f008

fix: remove concurrency for now

90c4036

hugomrdias changed the title ~~[WIP] feat: add support for chunked uploads~~ feat: add support for chunked uploads Sep 17, 2018

hugomrdias requested review from alanshaw, daviddias and vasco-santos September 17, 2018 12:49

vasco-santos suggested changes Sep 18, 2018

View reviewed changes

fix: feedback changes

d596295

hugomrdias added 2 commits September 18, 2018 16:27

fix: add some property descriptions

7489447

fix: add jsdoc to pushFile

a3bf0a6

lidel mentioned this pull request Sep 20, 2018

Improving Uploads of Big Files via Web Browsers ipfs/in-web-browsers#115

Open

4 tasks

alanshaw changed the title ~~feat: add support for chunked uploads~~ [WIP] feat: add support for chunked uploads Nov 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] feat: add support for chunked uploads #851

[WIP] feat: add support for chunked uploads #851

hugomrdias commented Sep 3, 2018 •

edited

Loading

lidel left a comment •

edited

Loading

lidel Sep 3, 2018

lidel Sep 3, 2018

lidel Sep 3, 2018

hugomrdias commented Sep 4, 2018 •

edited

Loading

lidel commented Sep 4, 2018

hugomrdias commented Sep 4, 2018

lidel left a comment •

edited

Loading

hugomrdias commented Sep 10, 2018

hugomrdias commented Sep 11, 2018 •

edited

Loading

vasco-santos left a comment

vasco-santos Sep 18, 2018

vasco-santos Sep 18, 2018

vasco-santos Sep 18, 2018

hugomrdias Sep 18, 2018

vasco-santos Sep 18, 2018

hugomrdias Sep 18, 2018

alanshaw commented Sep 27, 2018 •

edited

Loading

alanshaw commented Sep 27, 2018

hugomrdias commented Sep 27, 2018 •

edited

Loading

alanshaw commented Sep 28, 2018

lidel commented Jul 10, 2019

hugomrdias commented Jul 11, 2019

[WIP] feat: add support for chunked uploads #851

Are you sure you want to change the base?

[WIP] feat: add support for chunked uploads #851

Conversation

hugomrdias commented Sep 3, 2018 • edited Loading

lidel left a comment • edited Loading

Choose a reason for hiding this comment

My understanding is that

Some early feedback from my end

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hugomrdias commented Sep 4, 2018 • edited Loading

lidel commented Sep 4, 2018

hugomrdias commented Sep 4, 2018

lidel left a comment • edited Loading

Choose a reason for hiding this comment

Reducing number of headers

Chunked upload should also work without js-ipfs-api

How retry/resume will look like?

hugomrdias commented Sep 10, 2018

hugomrdias commented Sep 11, 2018 • edited Loading

vasco-santos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanshaw commented Sep 27, 2018 • edited Loading

alanshaw commented Sep 27, 2018

hugomrdias commented Sep 27, 2018 • edited Loading

alanshaw commented Sep 28, 2018

lidel commented Jul 10, 2019

hugomrdias commented Jul 11, 2019

hugomrdias commented Sep 3, 2018 •

edited

Loading

lidel left a comment •

edited

Loading

hugomrdias commented Sep 4, 2018 •

edited

Loading

lidel left a comment •

edited

Loading

Chunked upload should also work without `js-ipfs-api`

hugomrdias commented Sep 11, 2018 •

edited

Loading

alanshaw commented Sep 27, 2018 •

edited

Loading

hugomrdias commented Sep 27, 2018 •

edited

Loading