Skip to content
This repository has been archived by the owner on Mar 10, 2020. It is now read-only.

[WIP] feat: add support for chunked uploads #851

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

hugomrdias
Copy link
Contributor

@hugomrdias hugomrdias commented Sep 3, 2018

This is still a work in progress to add support for chunked uploads (ipfs.add) and fix multiple issues related to adding big files.

Tests are filtered here https://github.com/ipfs/js-ipfs-api/blob/90c40363fbcd55d29307e51f4feabb8be867ded8/test/add-experimental.spec.js#L38-L46 to make review easy, just run ipfs daemon with ipfs/js-ipfs#1540

features/fixes in this PR together with ipfs/js-ipfs#1540:

  • big data add non-chunked (this will either break with browser memory or hit the maxBytes config in the daemon, see next)
  • really big data add chunked (theoretically the limit is daemon disk space or maybe request timeouts)
  • streaming progress reporting
  • error handling and reporting
  • add multiple files with wrapWithDirectory
  • improved browser support, handles File's directly from the input
const files = document.getElementById('file').files;
        this.ipfsApi
            .add([...files], {
                wrapWithDirectory: true,
                experimental: true,
                progress: prog => console.log(`received back: ${prog}`)
                chunkSize: 10 * 1024 * 1024
            })
            .then(console.log)
            .catch(console.error);
  • jsdoc for top level api and more

Notes:

Needs:

Todo:

Related:

Copy link
Contributor

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @hugomrdias!

My understanding is that

  • Upload payload is split into small parts (chunkSize = 256000)
    (we probably want to make it a parameter)
  • Each part is sent as a a sequence of HTTP POST requests that have
    • a unique identifier for entire upload session (uuid? – see below)
    • a sequential counter within upload session (a chunk index)
  • API backend needs to support additional HTTP headers to perform re-assembly of entire payload from chunks and passing it to the regular ipfs.files.add call in a transparent manner
  • The goal is to hide chunking behind ordinary ipfs.files.add but for now /api/v0/add-chunked is used as the PoC endpoint in js-ipfs

Please let me know if I missed anything or you have a different vision for it.

Some early feedback from my end

  • 👍 for moving forward: I am very hopeful this will address various bugs and limitations coming from buffering entire thing in memory in web browser contexts.
    (we really need to solve big uploads from the browser.. without crashing it 🙃)
  • for progress reporting we could so various things, from the top of my head options are:
    • A) add another endpoint or a header and periodically send asynchronous request to fetch upload progress updates for specific "group identifier" (unique identifier of ongoing upload session)
    • B) do the same thing files.add in go-ipfs does right now (streaming status information while processing large requests), but make it aware of total upload size
    • C) do nothing extra, and report progress based on how many of chunks were uploaded (add ability to control chunk size via param to control progress reporting resolution vs performance)
  • On backends:
  • API endpoint: Is the plan to use a separate endpoint or eventually merge support into /api/v0/add? (I assume the latter)
  • Added some inline comments with usual bikeshed :))

src/add2/add2.js Outdated
.then(res => res.json())
}

function createName () {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ipfs-Chunk-Name

I may be missing something here, but is there a reason why we can't use UUID v5 here? JavaScript's random numbers are usually weak, so v5 sounds like a safer option:

RFC 4122 advises that "distributed applications generating UUIDs at a variety of hosts must be willing to rely on the random number source at all hosts. If this is not feasible, the namespace variant should be used."

Fast generation of RFC4122 UUIDs: require('uuid/v5').
If use of UUID here is fine, we may consider renaming the field to -Uuid or even -Chunk-Group-Uuid to remove ambiguity.

src/add2/add2.js Outdated
'Content-Range': `bytes ${start}-${end}/${size}`,
'Ipfs-Chunk-Name': name,
'Ipfs-Chunk-Id': id,
'Ipfs-Chunk-Boundary': boundary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use X- prefix for all custom headers?
We already have X-Ipfs-Path and I wonder if we should follow that convention.

src/add2/add2.js Outdated
'Content-Type': 'application/octet-stream',
'Content-Range': `bytes ${start}-${end}/${size}`,
'Ipfs-Chunk-Name': name,
'Ipfs-Chunk-Id': id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Id probably should be renamed to Index (its index everywhere else)

@alanshaw alanshaw changed the title WIP feat: add support for chunked uploads [WIP] feat: add support for chunked uploads Sep 4, 2018
@hugomrdias
Copy link
Contributor Author

hugomrdias commented Sep 4, 2018

@lidel your understanding is correct :), updated the PR with some of your feedback

regarding the uuid i had looked into it, for now i want to keep the poor man's version should be safe for now, it goes over math.random a couple of times. (i have a note to go back to this)

final integration will use the normal add api, only with one change a new option called chunkSize, if this option is set to a number we go through the chunked codepath.

about progress im still trying to add directly without files if i succeed this should work the same as right now, if not, one solution i though was adding a new handler uploadProgress.

the current progress handler would still work as-is but only in the last request and it would mean adding to ipfs progress only and uploadProgress would mean upload only progress. With this we wouldn't actually break anything relying on the progress handler the user would only see 0% for a long time (uploading) and on the last request it would update correctly as data goes in ipfs (adding) to improve on this the developer will have the new updloadProgress. Does this make sense ?

@lidel
Copy link
Contributor

lidel commented Sep 4, 2018

@hugomrdias thanks!

My view is that we should do our best to to make it work without changing current progress API.
Details of the chunked upload should be abstracted away in best-effort fashion and hidden behind existing progress reporter.

What if we detect presence of chunkSize parameter, and switch logic used for progress reporting behind the scenes?

For upload split into N chunks:

  • uploading chunks 1 to (N-1) would show "upload only progress"
    (initially we could just return % based on the number of uploaded chunks, more resolution can be added later)
  • uploading the last chunk N could show real "add progress" but only when it is bigger than "upload progress"

The end result would be a best-effort progress reporting that works with existing API and that is not stuck at 0% until the last chunk and behaves in expected manner (% always grows).

@hugomrdias
Copy link
Contributor Author

ipfs-chunked-add

Copy link
Contributor

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚲 🏡

Some feedback on how to make this feature more dev friendly:

Reducing number of headers

I just noticed we already have kinda unrelated header with chunk in it: X-Chunked-Output.
We probably should follow that naming convention somehow, or avoid use of chunk in names, maybe rename it with upload. Which got me thinking about the number of new headers this PR introduces and how we can make it simpler.

What if we replace three custom headers with only one + repurpose Content-Range?
I think you already started refactoring in that direction, but just for the record:

X-Chunked-Input: <upload-group-uuid>
Content-Range: <unit> <range-start>-<range-end>/<total-size>

X-Chunked-Input + Content-Range (or X- version with the same semantics) seem to pass all the info we need for re-assembly chunks in js-ipfs.

Chunked upload should also work without js-ipfs-api

A good indicator of being client-agnostic will be a demo of this type of upload with curl.
Assuming we have chunks on disk, we should be able to do something like below:

$ curl -X POST http://127.0.0.1:5001/api/v0/add-chunked  \
-H 'Content-Type: application/octet-stream' \
-H 'Content-Disposition: "file; filename="BIG-FILE.avi"' \
-H 'X-Chunked-Input: 87d0d13b-07de-4df1-b274-9b26a07a6f2a' \
-H 'Content-Range: bytes 0-1048576/2097152' \                                               
--data-binary @/tmp/chunk1

$ curl -X POST http://127.0.0.1:5001/api/v0/add-chunked  \
-H 'Content-Type: application/octet-stream' \
-H 'Content-Disposition: "file; filename="BIG-FILE.avi"' \
-H 'X-Chunked-Input: 87d0d13b-07de-4df1-b274-9b26a07a6f2a' \
-H 'Content-Range: bytes 1048577-2097152/2097152' \                                               
--data-binary @/tmp/chunk2

..and get CID of BIG-FILE.avi in response for the last request.

How retry/resume will look like?

What happens when n-th chunk fails due to temporary network problem?

js-ipfs-api: Fail entire upload? Retry failed chunk a few times before throwing an error with the offset of the end of last successful chunk?
Perhaps js-ipfs-api could internally retry chunk three times before returning an error.

js-ipfs: should there be a way for resuming failed upload from last successful chunk?

@hugomrdias
Copy link
Contributor Author

@lidel the first two topics should be adressed in the last commit

about the resumeable stuff, its mostly:

  • having good errors for failed chunks, http-api should retry those
  • extra GET endpoint to return uploaded chunks with this response http-api should able to figure out the missing chunks and only upload those
  • one thing missing is how too identify a upload session to resume the current uuid is not enough, need to do more research for this

so, lets leave the resume feature to a follow up PR

@hugomrdias
Copy link
Contributor Author

hugomrdias commented Sep 11, 2018

the jsdoc should create some nice docs with documentation.js

npx documentation serve ./js-ipfs-api/src/add2/add2.js -w -f html
run this cmd outside of the repo's folder to get the latest documentation.js, aegir still uses an old one.

api-docs

also should give code completion to anyone using editors with jsdoc support
api-completion
this can bubble up to the top level public api with minimal changes to this file

@hugomrdias hugomrdias changed the title [WIP] feat: add support for chunked uploads feat: add support for chunked uploads Sep 17, 2018
Copy link
Contributor

@vasco-santos vasco-santos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all, it is looking good to me. Added some minor notes

src/files/add-experimental.js Show resolved Hide resolved
* @typedef {Object} AddResult
* @property {string} path
* @property {string} hash
* @property {number} size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add descriptions for the properties here as well

src/files/add-pull-stream.js Outdated Show resolved Hide resolved
src/utils/multipart-experimental.js Show resolved Hide resolved
src/utils/multipart-experimental.js Show resolved Hide resolved
src/utils/send-stream-experimental.js Show resolved Hide resolved
this.index = 0
this.rangeStart = 0
this.rangeEnd = 0
this.rangeTotal = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using an object to handle all the range related properties?

src/utils/send-stream-experimental.js Outdated Show resolved Hide resolved
// end runs
})

it.skip('files.add pins by default', (done) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is blocking this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't run connected to a js daemon, not related to this PR

})
})

it.skip('files.add with pin=false', (done) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is blocking this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't run connected to a js daemon, not related to this PR

@alanshaw
Copy link
Contributor

alanshaw commented Sep 27, 2018

@Stebalien could we get your thoughts on adding this to go-ipfs?

This PR is adding a feature to the HTTP add endpoint that will allow big files to be uploaded to IPFS by making multiple requests.

@lidel kindly put together a good summary of the proposed process:

  • Upload payload is split into small parts (chunkSize = 256000)
  • Each part is sent as a a sequence of HTTP POST requests that have
    • a unique identifier for entire upload session (uuid? – see below)
    • a sequential counter within upload session (a chunk index)
  • API backend needs to support additional HTTP headers to perform re-assembly of entire payload from chunks and passing it to the regular ipfs.files.add call in a transparent manner

Reasons for doing this:

  1. It's not possible to stream a HTTP upload request (in Firefox) without buffering the entire payload into memory first
  2. Has potential to allow resume for failed upload requests

@alanshaw
Copy link
Contributor

@hugomrdias @lidel I think that regardless of what happens with this PR we need to switch to using the streaming fetch API. Firefox is notably the only browser that hasn't shipped the streams API yet but it sounds like this might happen soon. I think we can conditionally opt out of it for Firefox for the time being.

Switching to using the streaming fetch API will solve the buffering issue without any changes to the HTTP API and depending on priorities for go-ipfs we might be able to ship this before chunked uploads.

It's also worth noting that, streaming fetch will be way more efficient then multiple HTTP requests for chunked uploading.

@hugomrdias
Copy link
Contributor Author

hugomrdias commented Sep 27, 2018

That's only for response bodies not request bodies, this is the only way currently available to us. I didn't find any indication that request bodies will get streams soon on any browser

@alanshaw
Copy link
Contributor

That's only for response bodies not request bodies, this is the only way currently available to us. I didn't find any indication that request bodies will get streams soon on any browser

You're absolutely right - my bad. Thanks for clarifying!

@lidel
Copy link
Contributor

lidel commented Jul 10, 2019

Sounds like worth mentioning here, in case concepts from this PR are revisited in the future:

@hugomrdias
Copy link
Contributor Author

Sounds like worth mentioning here, in case concepts from this PR are revisited in the future:

yep i based the impl in tus

@alanshaw alanshaw changed the title feat: add support for chunked uploads [WIP] feat: add support for chunked uploads Nov 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants