Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

feat: non-bufferring multipart body encoder #3151

Closed
wants to merge 31 commits into from

Conversation

Gozala
Copy link
Contributor

@Gozala Gozala commented Jul 7, 2020

Context:

Status

  • Implementation is complete, some tests need to be updated.

Overview

Normalization

Before

normaliseInput used to normalize arbitrary input taken by ipfs.add into AsyncIterable<FileObject> where FileObject is:

type FileObject = {
  path:string,
  content?: AsyncIterable<ArrayBufferView|ArrayBuffer>,
  mtime?: number | [number, number] | Date | { secs: number, nsecs?: number },
  mode:string|number
}

There was (implicit) invariant that if FileObject doesn't have content it represents a directory.

However representing content as AsyncIterable<ArrayBufferView|ArrayBuffer> is what lead to buffering in the browser as fetch still does support stream body.

After

This patch changes normaliseInput to produce a different output: AsyncIterable<ExtendedFile|FileStream|Directory> where

  • Directory is just like FileObject and does not have content.
  • ExtendedFile represents a FileObject with known size
    • It is a subclass of the File
      • Polyfill of File is used in node
        • Polyfill of Blob is used in node
    • It is created only from inputs of known sizes like strings, buffers, blobs, byte arrays etc...
    • It adds mtime, mode and path properties (assumed by ipfs-unixfs-importer).
    • It adds content getter which returns AsyncIterable<Uint8Array> of it's parts, which creates compatibility with FileObject interface.
  • FileStream is just like FileObject that does have a content.
    • It is created only when input is of unknown size like AsyncIterable<*>
    • It is different from ExtendedFile because multipartRequest can't add it to the FormData without buffering it's body, while it can do that with ExtendedFile.

Multipart Encoder

New FormDataEncoder class was added that provides can encode AsyncIterable<Part> into AsyncIterable<BlobPart> representing body of the multipart request, where Part is:

type Part = {
  name: string,
  content: void|Blob|AsyncIterable<ArrayBufferView|ArrayBuffer>,
  filename?: string,
  headers?: Record<string, string>
}

to-stream module had being replaced by to-body which turns AsyncIterable<BlobPart> to readable stream on node and into Blob in browser.

With above pieces in place multipartRequest now

  1. normalized input into AsyncIterable<ExtendedFile|FileStream|Directory>
  2. Turns it into intermediate representation of AsyncIterable<Part> (and ensures that ExtendedFile is passed as content instead of passing it's content, to avoid buffering)
  3. Turns it into multipart request body encoded as AsyncIterable<BlobPart> via FormDataEncoder.
  4. Turns that into request body via toBody (that in node produces readable stream and in browser produces blob).

Result

  1. ipfs.add can continue using normalizeInput as changes to it should be API (backwards) compatible.

  2. ipfs-http-client on node should continue using streams. Only thing that changed there is that some inputs are turned into Blobs instead of AsyncIterators but during form data encoding all gets flattened anyway.

  3. ipfs-http-client in browser will not buffer as long as input passed in isn't a stream and will fall back to buffering otherwise. E.g.

    • ipfs.add([ 'hello', await (await fetch(url)).blob(), { path: '/foo/bar', content: droppedFile } ]) will not incur buffering
    • ipfs.add([ 'hello', { path: '/foo', content: droppedFile.stream() }, await (await fetch(url)).blob() ]) will only buffer content's of the droppedFile and use other pieces as is.

I am not super happy with complexity of all this, nor with the fact that user can accidentally fall of happy path and incur buffering but I do not believe there is a better option without changing an API.

attempt to fix #3029

@Gozala Gozala marked this pull request as draft July 7, 2020 23:35
@Gozala Gozala requested a review from hugomrdias July 8, 2020 01:17
@Gozala
Copy link
Contributor Author

Gozala commented Jul 8, 2020

Reminder to myself to include tests discussed in #3138 here

@Gozala Gozala changed the title feat: bufferring free multipart body encoder feat: non-bufferring multipart body encoder Jul 8, 2020
@Gozala
Copy link
Contributor Author

Gozala commented Jul 9, 2020

@hugomrdias there one issue that I'm not sure how to resolve. It appears that electron-renderer chooses to load blob.js over blob.browser.js which is causing problems. Is there a way to make it pick up browser overrides ? Otherwise only I other thing I could think of is to do a runtime check.

@Gozala Gozala marked this pull request as ready for review July 9, 2020 06:54
@Gozala
Copy link
Contributor Author

Gozala commented Jul 9, 2020

All tests except the example one (that also fails on master) are passing now. I think this is ready for the review.

@Gozala Gozala requested a review from lidel July 9, 2020 17:47
@Gozala Gozala assigned Gozala and unassigned lidel Jul 9, 2020
@Gozala

This comment has been minimized.

@achingbrain
Copy link
Member

The test was failing due to a temporary infrastructure problem. All good now.

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If merged, this is great performance win for browser devs, users, but also for IPFS Desktop users (ipfs/ipfs-webui#1529).

Side concerns:

  • I am worried about added complexity and potential regression in the future.
    Are we able to add tests/benchmarks that safeguard browser-related improvements?
  • As noted during our review call, problematic metadata is only supported by js-ipfs, we may want to look into tweaking HTTP API separately from this PR, to fix it before go-ipfs implements it.

packages/ipfs-core-utils/src/files/file.node.js Outdated Show resolved Hide resolved
@Gozala
Copy link
Contributor Author

Gozala commented Jul 11, 2020

  • I am worried about added complexity and potential regression in the future.

That worries me as well. I am also not happy with increased complexity. Only other way I can imagine going about this (that would not involve API changes) is to have a normalise-input.browser.js. In fact I have tried that approach but it had other major problems:

  1. It is where most complexity lies and having two different implementations are very likely to diverge unintentionally.
  2. Normalized input is consumed by actual ipfs.add (ipfs-unixfs-importer) and the http client.
    • ipfs-unixfs-importer expects file content to be AsyncIterable<ArrayBufferView>. We could push some complexity from here and teach it how to handle blobs, but it seemed to spread complexity instead of reducing it.
    • This also required diverging ipfs.add implementation of http client because in node we'd get normalized files in one form and in browser in the other form.

I think there is opportunity to simplify this approach a bit by using our custom types instead of Blob and File although that trade-off there would be more custom stuff in browser while Blob and File fit perfectly well.

Are we able to add tests/benchmarks that safeguard browser-related improvements ?

I was trying to come up with some approach here, e.g.

  • We can have an endpoint on echo server that generates as many gigs of data as client asks
  • We can have a browser test that does ipfs.add(await (await fetch('gen-data?size=3gb')).blob())

However I do not think we can not have a way to tell if browser did any buffering or not. Only thing I could come up with is to generate fragment of data from echo server stop writing until corresponding put occurs on other endpoint. However that is really complex and we need to go through some hoops. There is also no guarantee that browser doesn't read say 2 two chunks at a time.

I think better strategy is to test that when we put in blobs (and alike) what we get on the other end is blobs (not objects with async iterate content). That is a lot easier to test and is free from breaking when browser changes (e.g. how much it fetches before it starts upload).

@Gozala
Copy link
Contributor Author

Gozala commented Jul 14, 2020

Added more tests to ensure that result of normaliseInput does not use streams / async iterators unless absolutely necessary. Which hopefully addresses some of the @lidel's concern

I am worried about added complexity and potential regression in the future.
Are we able to add tests/benchmarks that safeguard browser-related improvements?

There is the caveat, this will not catch all regression e.g. if for some reason normaliseInput would e.g. read input blobs and use that to produce output files, but I think such regression is highly unlikely.

@Gozala
Copy link
Contributor Author

Gozala commented Jul 14, 2020

Test are failing now due to #3169

@Gozala
Copy link
Contributor Author

Gozala commented Jul 15, 2020

I had a conversation with @achingbrain earlier today and we have decided:

  1. It would be best to factor out DOM File and Blob poly-fills into separate library.
  2. Attempt to converge factored out Blob with fetch-blob. But that could happen in that factored out library.

I think it might also make sense to factor out introduced FileStream. However mtime is a bit unusual so it may be too specific to js-ipfs.

@Gozala
Copy link
Contributor Author

Gozala commented Jul 20, 2020

Externalized File and Blob implementations.

@achingbrain
Copy link
Member

I've merged /pull/3184 in favour of this. I hope that it's taken on some of the good ideas from this PR.

It bums me out a little, because you've clearly spent a lot of time and effort on this, but ultimately I think requiring people to use non-standard Blob/FormData/etc implementations to use our HTTP API is a step too far, and taking on the long-term maintenance burden of those custom implementations is not something we should be doing given the available dev capacity.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Embracing web native FormData / File where possible
3 participants