feat: non-bufferring multipart body encoder #3151

Gozala · 2020-07-07T23:34:57Z

Context:

This PR aims to improve performance of ipfs.add in ipfs-http-client in browser context by addressing findings from Embracing web native FormData / File where possible #3029.

Alternative approach can be found in fix: send blobs when running ipfs-http-client in the browser #3184

Status

Implementation is complete, some tests need to be updated.

Overview

Normalization

Before

normaliseInput used to normalize arbitrary input taken by ipfs.add into AsyncIterable<FileObject> where FileObject is:

type FileObject = {
  path:string,
  content?: AsyncIterable<ArrayBufferView|ArrayBuffer>,
  mtime?: number | [number, number] | Date | { secs: number, nsecs?: number },
  mode:string|number
}

There was (implicit) invariant that if FileObject doesn't have content it represents a directory.

However representing content as AsyncIterable<ArrayBufferView|ArrayBuffer> is what lead to buffering in the browser as fetch still does support stream body.

After

This patch changes normaliseInput to produce a different output: AsyncIterable<ExtendedFile|FileStream|Directory> where

Directory is just like FileObject and does not have content.
ExtendedFile represents a FileObject with known size
- It is a subclass of the File
  - Polyfill of File is used in node
    - Polyfill of Blob is used in node
- It is created only from inputs of known sizes like strings, buffers, blobs, byte arrays etc...
- It adds mtime, mode and path properties (assumed by ipfs-unixfs-importer).
- It adds content getter which returns AsyncIterable<Uint8Array> of it's parts, which creates compatibility with FileObject interface.
FileStream is just like FileObject that does have a content.
- It is created only when input is of unknown size like AsyncIterable<*>
- It is different from ExtendedFile because multipartRequest can't add it to the FormData without buffering it's body, while it can do that with ExtendedFile.

Multipart Encoder

New FormDataEncoder class was added that provides can encode AsyncIterable<Part> into AsyncIterable<BlobPart> representing body of the multipart request, where Part is:

type Part = {
  name: string,
  content: void|Blob|AsyncIterable<ArrayBufferView|ArrayBuffer>,
  filename?: string,
  headers?: Record<string, string>
}

to-stream module had being replaced by to-body which turns AsyncIterable<BlobPart> to readable stream on node and into Blob in browser.

With above pieces in place multipartRequest now

normalized input into AsyncIterable<ExtendedFile|FileStream|Directory>
Turns it into intermediate representation of AsyncIterable<Part> (and ensures that ExtendedFile is passed as content instead of passing it's content, to avoid buffering)
Turns it into multipart request body encoded as AsyncIterable<BlobPart> via FormDataEncoder.
Turns that into request body via toBody (that in node produces readable stream and in browser produces blob).

Result

ipfs.add can continue using normalizeInput as changes to it should be API (backwards) compatible.
ipfs-http-client on node should continue using streams. Only thing that changed there is that some inputs are turned into Blobs instead of AsyncIterators but during form data encoding all gets flattened anyway.
ipfs-http-client in browser will not buffer as long as input passed in isn't a stream and will fall back to buffering otherwise. E.g.
- ipfs.add([ 'hello', await (await fetch(url)).blob(), { path: '/foo/bar', content: droppedFile } ]) will not incur buffering
- ipfs.add([ 'hello', { path: '/foo', content: droppedFile.stream() }, await (await fetch(url)).blob() ]) will only buffer content's of the droppedFile and use other pieces as is.

I am not super happy with complexity of all this, nor with the fact that user can accidentally fall of happy path and incur buffering but I do not believe there is a better option without changing an API.

attempt to fix #3029

Gozala · 2020-07-08T01:19:10Z

Reminder to myself to include tests discussed in #3138 here

Gozala · 2020-07-09T05:43:26Z

@hugomrdias there one issue that I'm not sure how to resolve. It appears that electron-renderer chooses to load blob.js over blob.browser.js which is causing problems. Is there a way to make it pick up browser overrides ? Otherwise only I other thing I could think of is to do a runtime check.

Gozala · 2020-07-09T06:56:23Z

All tests except the example one (that also fails on master) are passing now. I think this is ready for the review.

achingbrain · 2020-07-10T07:51:42Z

The test was failing due to a temporary infrastructure problem. All good now.

lidel

If merged, this is great performance win for browser devs, users, but also for IPFS Desktop users (ipfs/ipfs-webui#1529).

Side concerns:

I am worried about added complexity and potential regression in the future.
Are we able to add tests/benchmarks that safeguard browser-related improvements?
As noted during our review call, problematic metadata is only supported by js-ipfs, we may want to look into tweaking HTTP API separately from this PR, to fix it before go-ipfs implements it.

packages/ipfs-core-utils/src/files/file.node.js

packages/ipfs-core-utils/test/files/blob.spec.js

packages/ipfs-core-utils/src/files/normalise-input.js

Gozala · 2020-07-11T00:40:16Z

I am worried about added complexity and potential regression in the future.

That worries me as well. I am also not happy with increased complexity. Only other way I can imagine going about this (that would not involve API changes) is to have a normalise-input.browser.js. In fact I have tried that approach but it had other major problems:

It is where most complexity lies and having two different implementations are very likely to diverge unintentionally.
Normalized input is consumed by actual ipfs.add (ipfs-unixfs-importer) and the http client.
- ipfs-unixfs-importer expects file content to be AsyncIterable<ArrayBufferView>. We could push some complexity from here and teach it how to handle blobs, but it seemed to spread complexity instead of reducing it.
- This also required diverging ipfs.add implementation of http client because in node we'd get normalized files in one form and in browser in the other form.

I think there is opportunity to simplify this approach a bit by using our custom types instead of Blob and File although that trade-off there would be more custom stuff in browser while Blob and File fit perfectly well.

Are we able to add tests/benchmarks that safeguard browser-related improvements ?

I was trying to come up with some approach here, e.g.

We can have an endpoint on echo server that generates as many gigs of data as client asks
We can have a browser test that does ipfs.add(await (await fetch('gen-data?size=3gb')).blob())

However I do not think we can not have a way to tell if browser did any buffering or not. Only thing I could come up with is to generate fragment of data from echo server stop writing until corresponding put occurs on other endpoint. However that is really complex and we need to go through some hoops. There is also no guarantee that browser doesn't read say 2 two chunks at a time.

I think better strategy is to test that when we put in blobs (and alike) what we get on the other end is blobs (not objects with async iterate content). That is a lot easier to test and is free from breaking when browser changes (e.g. how much it fetches before it starts upload).

Gozala · 2020-07-14T01:53:25Z

Added more tests to ensure that result of normaliseInput does not use streams / async iterators unless absolutely necessary. Which hopefully addresses some of the @lidel's concern

I am worried about added complexity and potential regression in the future.
Are we able to add tests/benchmarks that safeguard browser-related improvements?

There is the caveat, this will not catch all regression e.g. if for some reason normaliseInput would e.g. read input blobs and use that to produce output files, but I think such regression is highly unlikely.

Gozala · 2020-07-14T07:14:07Z

Test are failing now due to #3169

Gozala · 2020-07-15T18:57:33Z

I had a conversation with @achingbrain earlier today and we have decided:

It would be best to factor out DOM File and Blob poly-fills into separate library.
Attempt to converge factored out Blob with fetch-blob. But that could happen in that factored out library.

I think it might also make sense to factor out introduced FileStream. However mtime is a bit unusual so it may be too specific to js-ipfs.

Gozala · 2020-07-20T08:54:10Z

Externalized File and Blob implementations.

achingbrain · 2020-07-24T08:01:39Z

I've merged /pull/3184 in favour of this. I hope that it's taken on some of the good ideas from this PR.

It bums me out a little, because you've clearly spent a lot of time and effort on this, but ultimately I think requiring people to use non-standard Blob/FormData/etc implementations to use our HTTP API is a step too far, and taking on the long-term maintenance burden of those custom implementations is not something we should be doing given the available dev capacity.

Gozala marked this pull request as draft July 7, 2020 23:35

feat: bufferring free multipart body encoder

3e7baf7

Gozala force-pushed the blobity-blob branch from 913ff46 to 3e7baf7 Compare July 7, 2020 23:37

Gozala added 2 commits July 7, 2020 17:48

fix: add support for String instances

9686628

fix: browser module paths overrides

1596609

Gozala requested a review from hugomrdias July 8, 2020 01:17

Gozala changed the title ~~feat: bufferring free multipart body encoder~~ feat: non-bufferring multipart body encoder Jul 8, 2020

Gozala added 6 commits July 8, 2020 10:38

fix: multipartRequest so body does not emit blobs

e402018

Merge branch 'master' into blobity-blob

58b8d2c

fix: encode filename once

c1c05d0

fix: add \r\n after each part of form-data

567b738

chore: write blob tests

c9fc232

fix: incorrect header used for nsecs

39464aa

Gozala added 2 commits July 8, 2020 22:50

fix: use native blobs in elector renderer

908d99e

fix: prefer native File over polyfill (in elector)

bfe012f

Gozala marked this pull request as ready for review July 9, 2020 06:54

Gozala assigned lidel Jul 9, 2020

Gozala requested a review from lidel July 9, 2020 17:47

Gozala assigned Gozala and unassigned lidel Jul 9, 2020

This comment has been minimized.

Sign in to view

lidel reviewed Jul 10, 2020

View reviewed changes

packages/ipfs-core-utils/src/files/file.node.js Outdated Show resolved Hide resolved

achingbrain reviewed Jul 10, 2020

View reviewed changes

packages/ipfs-core-utils/test/files/blob.spec.js Outdated Show resolved Hide resolved

achingbrain reviewed Jul 10, 2020

View reviewed changes

packages/ipfs-core-utils/src/files/normalise-input.js Outdated Show resolved Hide resolved

fix: error in number of arguments that were passed

0dbd5af

Gozala added 5 commits July 13, 2020 18:28

fix: ensure that FileStream content is valid

ad9d617

fix: preserve file metadata

09c86b9

fix: ensure Iterable<Bytes> instead of assuming

205fde7

fix: properly handle null input.

9af1bf1

chore: test that streams aren't used unnecessarily

35d6eb3

Gozala added 4 commits July 13, 2020 19:55

fix: file api compatiblity

3bdc52b

chore: add file API tests

ee74c82

Merge remote-tracking branch 'upstream/master' into blobity-blob

d44352a

chore: remove unnecessary browser entry

5d8ff81

Gozala mentioned this pull request Jul 14, 2020

Test Failure: should add buffer bigger than Hapi default max bytes (1024 * 1024) #3169

Closed

chore: factor out blob and file into separate libs

c43faf7

Gozala added 5 commits July 20, 2020 08:14

Merge remote-tracking branch 'upstream/master' into blobity-blob

631ebf3

fix: update test to account for lastModified field

2b92e4c

chore: disable test requiring mtime support in go

6421e24

fix: example test to account lastModified field

2fc990c

chore: revert changes to handle File's lastModifed

1859549

This was referenced Jul 20, 2020

Support web File's lastModified field #3187

Closed

Support node's ReadStream in ipfs.add instead of AsyncIterable<Uint8Array> / Iterable<Uint8Array> #3188

Closed

fix: reflect removed lastModified->mtime in tests

dcedb66

Gozala requested review from lidel and achingbrain July 21, 2020 04:31

achingbrain mentioned this pull request Jul 21, 2020

fix: send blobs when running ipfs-http-client in the browser #3184

Merged

lidel mentioned this pull request Jul 22, 2020

Embracing web native FormData / File where possible #3029

Closed

Gozala mentioned this pull request Jul 23, 2020

Implementation bug in normaliseInput #3138

Closed

achingbrain closed this Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: non-bufferring multipart body encoder #3151

feat: non-bufferring multipart body encoder #3151

Gozala commented Jul 7, 2020 •

edited by lidel

Loading

Gozala commented Jul 8, 2020

Gozala commented Jul 9, 2020

Gozala commented Jul 9, 2020

This comment has been minimized.

achingbrain commented Jul 10, 2020

lidel left a comment •

edited

Loading

Gozala commented Jul 11, 2020

Gozala commented Jul 14, 2020

Gozala commented Jul 14, 2020

Gozala commented Jul 15, 2020

Gozala commented Jul 20, 2020

achingbrain commented Jul 24, 2020

feat: non-bufferring multipart body encoder #3151

feat: non-bufferring multipart body encoder #3151

Conversation

Gozala commented Jul 7, 2020 • edited by lidel Loading

Status

Overview

Normalization

Before

After

Multipart Encoder

Result

Gozala commented Jul 8, 2020

Gozala commented Jul 9, 2020

Gozala commented Jul 9, 2020

This comment has been minimized.

achingbrain commented Jul 10, 2020

lidel left a comment • edited Loading

Choose a reason for hiding this comment

Gozala commented Jul 11, 2020

Gozala commented Jul 14, 2020

Gozala commented Jul 14, 2020

Gozala commented Jul 15, 2020

Gozala commented Jul 20, 2020

achingbrain commented Jul 24, 2020

Gozala commented Jul 7, 2020 •

edited by lidel

Loading

lidel left a comment •

edited

Loading