Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for inlinling via the id-hash #5281

Merged
merged 3 commits into from
Aug 24, 2018
Merged

Conversation

kevina
Copy link
Contributor

@kevina kevina commented Jul 25, 2018

Add support for inlining via the id-hash to the ipfs add command using the new Builder interface.

@kevina kevina added the status/WIP This a Work In Progress label Jul 25, 2018
@kevina kevina requested a review from Kubuxu as a code owner July 25, 2018 04:59
@ghost ghost assigned kevina Jul 25, 2018
@ghost ghost added the status/in-progress In progress label Jul 25, 2018
@kevina kevina force-pushed the kevina/inline-cids branch 2 times, most recently from 56a41e4 to de8b51f Compare July 25, 2018 05:38
@kevina

This comment has been minimized.

@kevina

This comment has been minimized.

@kevina kevina force-pushed the kevina/inline-cids branch 2 times, most recently from 0da66f9 to dd9e76b Compare August 14, 2018 20:34
@kevina kevina changed the title [WIP] Add support for inlinling via the id-hash Add support for inlinling via the id-hash Aug 14, 2018
@kevina kevina removed the status/WIP This a Work In Progress label Aug 14, 2018
@kevina kevina requested a review from Stebalien August 14, 2018 20:36
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! That's really clean and elegant. However, we'll probably want to move that helper somewhere else (e.g., go-cid). I can do the gx stuff if you make the PR.

@kevina
Copy link
Contributor Author

kevina commented Aug 14, 2018

Nice! That's really clean and elegant. However, we'll probably want to move that helper somewhere else (e.g., go-cid).

We are putting to much stuff in go-cid. We should factor that stuff out into a new package go-cid-util to avoid having to update so many packages when functionality not related to the core Cid functionality is changed.

@Stebalien
Copy link
Member

We are putting to much stuff in go-cid. We should factor that stuff out into a new package go-cid-util to avoid having to update so many packages when functionality not related to the core Cid functionality is changed.

I kind of agree but that won't help. If we put a bunch of useful stuff in go-cid-util, we'll run into the same problem (depend on it everywhere). Really, we just need to make updating packages in gx less of a pain.

@kevina
Copy link
Contributor Author

kevina commented Aug 14, 2018

I kind of agree but that won't help. If we put a bunch of useful stuff in go-cid-util, we'll run into the same problem (depend on it everywhere).

I do not agree with this assessment. For one thing just to use the Cid type you need to import the go-cid package. Although I am not 100% most packages don't need to do much with the Cid other then pass it around or possible convert it to a string or byte-string. Anything that is really useful should stay in go-cid the stuff that should be moved in less commonly used code.

See ipfs/go-cid#68

@Stebalien
Copy link
Member

Hm. Yeah, you're probably right.

License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI nits. I'd rather not bother explaining "identity hashing" to the user. They can just know that we're inlining the data into the CID.

@@ -120,6 +123,8 @@ You can now check what blocks have been created by:
cmdkit.BoolOption(fstoreCacheOptionName, "Check the filestore for pre-existing blocks. (experimental)"),
cmdkit.IntOption(cidVersionOptionName, "CID version. Defaults to 0 unless an option that depends on CIDv1 is passed. (experimental)"),
cmdkit.StringOption(hashOptionName, "Hash function to use. Implies CIDv1 if not sha2-256. (experimental)").WithDefault("sha2-256"),
cmdkit.BoolOption(inlineOptionName, "Inline small objects using identity hash. (experimental)"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just say "inline small blocks (or objects?) into CIDs". Users won't really care about how we do this.

@@ -44,6 +45,8 @@ const (
fstoreCacheOptionName = "fscache"
cidVersionOptionName = "cid-version"
hashOptionName = "hash"
inlineOptionName = "inline"
idHashLimitOptionName = "id-hash-limit"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--inline-limit.

@@ -120,6 +123,8 @@ You can now check what blocks have been created by:
cmdkit.BoolOption(fstoreCacheOptionName, "Check the filestore for pre-existing blocks. (experimental)"),
cmdkit.IntOption(cidVersionOptionName, "CID version. Defaults to 0 unless an option that depends on CIDv1 is passed. (experimental)"),
cmdkit.StringOption(hashOptionName, "Hash function to use. Implies CIDv1 if not sha2-256. (experimental)").WithDefault("sha2-256"),
cmdkit.BoolOption(inlineOptionName, "Inline small objects using identity hash. (experimental)"),
cmdkit.IntOption(idHashLimitOptionName, "Identity hash maxium size. (experimental)").WithDefault(64),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Maximum inline block size" (or something like that).


Also, should this imply --inline?

if inline {
fileAdder.CidBuilder = cidutil.InlineBuilder{
Builder: fileAdder.CidBuilder,
Limit: idHashLimit}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💢

    Limit: idHashLimit,
}

go fmt is usually pretty good about being pedantic about things like this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran go format and it was happy. Very well I will change it.

@ghost ghost assigned Stebalien Aug 22, 2018
@@ -123,7 +123,7 @@ You can now check what blocks have been created by:
cmdkit.BoolOption(fstoreCacheOptionName, "Check the filestore for pre-existing blocks. (experimental)"),
cmdkit.IntOption(cidVersionOptionName, "CID version. Defaults to 0 unless an option that depends on CIDv1 is passed. (experimental)"),
cmdkit.StringOption(hashOptionName, "Hash function to use. Implies CIDv1 if not sha2-256. (experimental)").WithDefault("sha2-256"),
cmdkit.BoolOption(inlineOptionName, "Inline small blcoks into CID. (experimental)"),
cmdkit.BoolOption(inlineOptionName, "Inline small blocks into CIDs. (experimental)"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Np. I didn't want to bother you with it (and then made more work for myself by forgetting about GitCop 🙄).

License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
@kevina
Copy link
Contributor Author

kevina commented Aug 23, 2018

Fixed the failed test. It was not obvious what was going on see: #5398

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final comment from @whyrusleeping. On further thought, I'd like to get the IPFS in browsers working group to comment on this.

@@ -120,6 +123,8 @@ You can now check what blocks have been created by:
cmdkit.BoolOption(fstoreCacheOptionName, "Check the filestore for pre-existing blocks. (experimental)"),
cmdkit.IntOption(cidVersionOptionName, "CID version. Defaults to 0 unless an option that depends on CIDv1 is passed. (experimental)"),
cmdkit.StringOption(hashOptionName, "Hash function to use. Implies CIDv1 if not sha2-256. (experimental)").WithDefault("sha2-256"),
cmdkit.BoolOption(inlineOptionName, "Inline small blocks into CIDs. (experimental)"),
cmdkit.IntOption(inlineLimitOptionName, "Maximum block size to inline. (experimental)").WithDefault(64),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping says: don't add this right now, just make it 64. If users complain, we can revisit it later; this is the most conservative approach and should be good enough for pretty much all cases. That fine with you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to know the reasons for this. Is it the change in the API string because of the use of WithDefault() or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning is simply "fewer knobs" (or "be conservative"). That is, we don't really need this knob so we might as well add it later when needed (when a user presents a use-case for changing the default).

Copy link
Contributor Author

@kevina kevina Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily agree with that philosophy but won't object strongly.

I can see a case of wanting either 32 or 64 and don't think there is a clear answer. I am okay with limiting it to either 32 or 64 if you think that will be better.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User here looking forward to this feature, I'm mirroring a large-ish dataset using filestore to avoid disk duplication, in this case there is a few files being updated often which is causing issues with missing data, these are 41-42 bytes in size. example: http://distfiles.gentoo.org/experimental/timestamp.x
So minimal size here is 64 bytes - but I don't see any good reason to not add the option for the user to change this. Not adding this option could also result in assumptions being made about the actual max of ident hashes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we may actually need to define a limit on the maximum CID size and 64 bytes is really long (see my comment below). However, having to define a new block for a ~40 byte file really does suck.

Let's wait for the browser working group to comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 64 bytes is long, but if it is part of a directory entry, the user is ever unlikely to need to actual CID and it won't show up as part of the URL. All the more reason to keep this option.

I could change the default to 32 though so it won't create a problem. When someone knows what they are doing they can increase it to suit there needs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless a user just adds a file directly. E.g., /ipfs/z3Fxd5vkFxAu4YbpXLBj8EvUCEeXLhNRGaYpEL43EHt99kaB82LfGbEVBqUJPaSZsHtJANxV8EGDZ

Copy link
Contributor Author

@kevina kevina Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless a user just adds a file directly.

Correct. So make the default 32 to avoid increasing the size of hashes by default. When a user has a directory with lots of small files they want inline they can increase the limit .

Copy link
Member

@Stebalien Stebalien Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to talk to the browser people. I know there were talks about supporting https://CID.dweb.link and that won't work for long CIDs. The current plan is ipfs://CID and there may be length restrictions.

(nvm, didn't see the response from lidel)

@Stebalien
Copy link
Member

@lidel, @kyledrake

This change allows us to inline small blocks into CIDs. For performance, we'd like to be able to inline up to N bytes of data into the CID itself.

If --inline is passed (which will likely be the default once we hit 1.0), this patch will inline up to 64 bytes of data. Unfortunately, this makes CIDs (especially base32 CIDs) quite large:

  • zB7EoRYfU6DytqNStc9iYQs7JpDECadvProBjtcWGfP6NuxHUXRAfo2nDyGmZEQrsrhkH44mAHaFqGDTENFyVwcwhP5PF
  • bafkqaqdbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqqu

If we instead, e.g., limit the length to 32, we'd get:

  • z3bBQDkAHfWSimWfRNLCdZzu4iRV9B1B8GBntm9Ema16Ai6FdEh
  • bafkqailbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmefa

The benefits are:

  1. Shorter.
  2. Fits in a DNS component (e.g., a subdomain).

The drawbacks are:

  1. Shorter.
  2. We may switch to sha2-512 eventually (64 bytes long) so this may buy us nothing.

So, the question is: in the short term, how important is it to fit CIDs into a single subdomain?

This discussion is a continuation of multiformats/cid#21 which we had never really finished.

@lidel
Copy link
Member

lidel commented Aug 23, 2018

So, the question is: in the short term, how important is it to fit CIDs into a single subdomain?

@Stebalien It is not important. Don't worry about cid-in-subdomain, pick the best inline size for the job.

Longer answer:

  • I assume you are worried about DNS name length limits from RFC 1034. It defines two limits, one is per domain segment/label (63 characters) and another for entire FQDN (255 characters). So.. if someone exposes a subdomain-gateway under already-long DNS zone we hit the problem even without inlining data into root cid 🙃

  • We want to be able to use cid in subdomains short-term, but in practice we will be using only the root one. Everything else (including cids with inlined data) is hidden behind a path, just like @kevina noted.
    In URL context we only care about the root cid and mid-term future root cid will probably remain to be DNS-safe cidv1b32(sha2-256). Hopefully, by the time we need to switch default to longer hashes that do not fit in 63 characters we will either have an alternative(s) to DNS and/or a native protocol handler in browsers.

@kevina
Copy link
Contributor Author

kevina commented Aug 23, 2018

@Stebalien based on your @lidel and @NiKiZe comments I believe even more strongly that we should keep the --inline-limit option as there is no singe best value.

If we are going to consider enabling --inline by default at some point, we should likely change the default value to 32. This way using it won't increase the size of hashes and create possible problem when a user adds a single small file.

I am okay with limiting the value to a small set of fixed values (32, 64, maybe 128), but I see that as an arbitrary limitation. Bases on your comment at multiformats/cid#21 (comment), values in between 32 and 64 might be useful.

I am also okay with instead setting a maxim value. 64 might be okay, but 128 would increase the utility of this feature. Please do note, that we don't limit the size of CIDs anywhere else and I have in fact performed (informal) test with large CIDs and didn't ran into any problems, including (I believe) bitswap. When CIDs start getting extremely large we might run into filesystem limitations if they are ever stored, but 128 will not hit this limit. As the code is written now users can still create a Cid with the identity hash by using ipfs add --hash=id and we do not impose any limits when created that way.

If we still decide not to provide this option we should stick with 64, but I recommend we provide it.

@Stebalien
Copy link
Member

@lidel

It is not important

The issue is that this also applies to directories (it applies to all blocks). For example, if I create a directory and put a single index.html file in it (with some data), the entire directory will fit in an inline CID. On one hand, this is awesome; on the other, the CID becomes:

  • /ipfs/z5xqj8QmdfuAV5bZXhCUgnRaJNTPvY14aqwhZGRjnMxWUtsWFir1qgSBm3YnAperq9rztt7D7byfxXQwf7fkU

Will this cause a problem? The last time I discussed this with someone we discussed multiple sub-domains (break it up with dots every N bytes) but I'm not sure if that's a reasonable solution.

Please do note, that we don't limit the size of CIDs anywhere else and I have in fact performed (informal) test with large CIDs and didn't ran into any problems,

My primary concern is security. When coding, we tend to assume that CIDs are short so I worry that an attacker could cause problems by introducing large CIDs. If we limit the size of the CIDs we create, we can add hard limits later on without breaking everything. However, I think the performance limit here will be more like 1KiB (and we can just say "don't be absurd").

You've convinced me to keep the size option. Personally, I'd default to either 32 or 45 (it doesn't have to be a multiple of 2). The latter makes base32 encoded CIDs 80 characters long which I consider to be a reasonable upper limit.

45 byte max: /ipfs/bafkqallbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwcylbmfqwccq

@kevina
Copy link
Contributor Author

kevina commented Aug 24, 2018

I am leaning towards 32, just because at that size CIDs are guaranteed not to increase. I can be convinced otherwise and don't have a strong objection to some other number.

@lidel
Copy link
Member

lidel commented Aug 24, 2018

The last time I discussed this with someone we discussed multiple sub-domains (break it up with dots every N bytes) but I'm not sure if that's a reasonable solution.

It adds an ugly conversion step, but it is much better than not supporting long cids at all. If we have to convert, we could look into using encoding similar to one described in Human Friendly Names and putting every word as a separate DNS label (but that could hit 255 characters fast..).

Added the idea to subdomain notes at ipfs/in-web-browsers#89 (comment)

The issue is that this also applies to directories (it applies to all blocks).

Ouch, I did not think about inlining directories.

In that context setting default inline-limit to a value bigger than 32 will increase the surface for UX issues a bit:

  • copy&paste of inlined root cids into a location bar to open subdomain-gateway does not work
  • users are really confused why hashes generated with default settings have way different lengths

If we leave an option to change the default via --inline-limit, I'd go with 32 as well. It feels like a safe default considering we will be seeing cidv1b32(sha-256) for some time and want to avoid UX issues in subdomain phase.

(Still, I don't have strong feelings about this, we could work around issues introduced by 64 if needed).

@kyledrake
Copy link

Agreed with @lidel. Not a big deal, do what you want to here. The important thing for future browser support is cidv1b32 with go-ipfs. Once we can use the address bar, I believe the hashes can be very long and it won't cause any issues. The subdomain stuff is a short-term hack for security origins on the HTTP gateway, ideally we'll never even have to use it.

@NiKiZe
Copy link

NiKiZe commented Aug 24, 2018

As long as the option to change it is there, the default size won't be an issue.
Just a few notes, IMHO restricting this to be compatible with DNS names should not be a primary concern:

  • DNS names will be used in a minority of usecases (?)
  • Using long hashes if intended for DNS usage is a user problem, not developer problem. (just use proper documentation)
  • If I want to avoid DNS usage this is a feature not a bug
    If this feature was created for only being used in DNS names this would be different, but the usecases are far wider than that.

@kevina
Copy link
Contributor Author

kevina commented Aug 24, 2018

I think the consensuses is to for now, go with a conservative default of 32. This is not a very useful size, but if you have a use for the feature you can increase the default size.

At some point we should impose length limits on CIDs to keep things from getting out of hand, but we can do that later. For now I advice anyone using this feature to avoid setting the default too high to avoid running into this limit in the future. I can not see us imposing a limit smaller then 64 so that is very safe. Slightly larger values such as 128 may also be okay. From what @Stebalien told be going up to 1 KiB may even be okay, but I see that as really pushing it if the actual CID needs to be displayed.

License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
@Stebalien
Copy link
Member

DNS names will be used in a minority of usecases

The issue here is the gateway (which is, itself, a stop-gap). The only way to put different IPFS "websites" in different origins is to use sub-domains.


@kevina SGTM.

@Stebalien Stebalien merged commit 93c4f19 into master Aug 24, 2018
@ghost ghost removed the status/in-progress In progress label Aug 24, 2018
@Stebalien Stebalien deleted the kevina/inline-cids branch August 24, 2018 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants