Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce 'lenient' mode for hostname validation. #122

Closed

Conversation

remusao
Copy link
Collaborator

@remusao remusao commented Mar 16, 2018

This PR should address the long-standing #73 issue. It is now possible to enable a more permissive hostname validation using the lenientHostnameValidation option:

> var tldjs = require('tldjs');
> tldjs.isValid('foo.bar_baz.com')
false
> var customTld = tldjs.fromUserSettings({ lenientHostnameValidation: true });
> customTld.isValid('foo.bar_baz.com')
true

@remusao remusao force-pushed the accept-extra-chars-is-valid branch 2 times, most recently from a6a3f1b to 2c29345 Compare March 16, 2018 13:20
@thom4parisot
Copy link
Owner

thom4parisot commented Mar 16, 2018

Good thinking and thank you! 🙂 👍

  • I'm pondering if the setting should be only global (like you implemented) or global +local (at a function level with an extra options parameter).

    I'd be curious to hear more from @ikari-pl, @ZLightning, @caffeinewriter and @nebulade on this. They are users who expressed some interest on this topic in a_b_c.domain.com — Neither domain, nor publicSuffix? (but valid) #73.

  • When it is lenient, does it means it is less compliant with the RFC and closer to the real world implementations?

    I'm pondering if we should follow the de facto usage (lenient) or the RFC (not lenient).

Also, I wonder if that would make a difference for the @brave-intl team (@NejcZdovc and @mrose17 mostly as they depend on it in bat-publisher).

@remusao
Copy link
Collaborator Author

remusao commented Mar 16, 2018

Regarding the settings, I was thinking that since we already have a way to specify it via the factory, it could be a good starting point. No strong opinion on that though.

By lenient I meant more permissive. Which means that by default, underscores are not allowed in hostnames (strict mode, which is the current behavior). When lenientHostnameValidation is set to true, then underscores are allowed. This was based on the discussion at #73, but it could very well be that other constraints are relaxed in the future?

@remusao
Copy link
Collaborator Author

remusao commented Mar 19, 2018

@oncletom Since this PR also introduces some options, shall we open an issue about the API and the way to specify options and have a more high-level discussion there?

@remusao remusao force-pushed the accept-extra-chars-is-valid branch from 2c29345 to d4ef41b Compare March 19, 2018 07:50
@remusao remusao force-pushed the accept-extra-chars-is-valid branch from d4ef41b to 4d99d8e Compare March 19, 2018 11:00
@ZLightning
Copy link

I like the idea of lenient mode, but I think the implementation should include an option for only allowing the lenient mode on subdomains of the TLD/public suffix. Another option could be 3 modes, strict, lenient, and sloppy where sloppy would allow even base domains to contain non-standard characters, lenient would test that the base domain is RFC compliant and skip validating only the subdomain portion, and strict would test the whole FQDN for RFC compliance.

For general purposes, validating the base domain would still be of some importance. Registrars generally follow the RFCs on valid characters, though I am not sure all public suffixes would validate the subdomains users are allowed to create.

I believe the default should be what I describe as "lenient" mode, with "strict" and "sloppy" being options.

@ikari-pl
Copy link

ikari-pl commented Apr 9, 2018

I do support the idea of the fix, of course, as the problem was pretty annoying when it hit me.

It feel weird to me, though, that I suddenly have to create an instance of tldjs where I didn't. It could be, as @oncletom suggested, a per-call option. It would feel cleaner to only add this in the code where I need a specific mode, and would look more verbose in the app logic.

Either way, happy to see the discussion here.

@remusao
Copy link
Collaborator Author

remusao commented Apr 14, 2018

Thanks a lot for your feedback!

@ZLightning I understand that having more flexibility would be needed in some cases. On the other hand, we need to think about what goals tld.js has. We had a similar discussion regarding URL parsing, which is not a trivial task. In the end we decided to keep the good-enough implementation from Node.js but allow users of tld.js to plug their own hostname extraction functions to customize to their needs. Regarding domain validation, I wonder if a similar approach would make sense? The current function is simple but maybe good enough for most needs. Also, it is relatively straightforward to use, fast to run, and should work in most cases. When this is not enough, we could consider plugging another validation function to be used in tld.js instead (which could come from a validation library for example). In fact, to go a step further, it would make sense to allow overriding all functions from the public API which do not have to do with public suffix/tld extraction, but are provided for convenience (isValid, isIp, extractHostname).

@ikari-pl I agree with you, creating a new instance of tld.js to allow any form of customization is cumbersome. On the other hand, we need to be careful, as options applied to one function might impact the behavior of other functions of tld.js, in which case applying options globally by instantiating a new tld.js makes it clear that it might impact different parts of the library. In the end this is a trade-off, and maybe we need to allow both: give an extra argument to isValid, and also a global option which would change the behavior of isValid, as well as all other methods of the API depending on isValid.

What about this:

  • We keep the same API, but some function start accepting optional arguments to customize their behavior (e.g.: isValid accepts some option for lenient mode, as suggested by @ikari-pl)
  • fromUserSettings will also accept options to customize the behavior, and will customize methods the same way users can: by using their optional arguments.

Because these changes could introduce some breaking changes (not all of them of course), maybe this could be done in two times?

  1. Introduce optional options for some methods such as isValid could be done transparently, without breaking compatibility. That would be a low hanging-fruit and could already allow people to benefit from more granular customization of individual methods of tld.js.
  2. We could discuss deeper changes in the API, as initiated by @oncletom in API design considerations for tld.js@3.0 #124 where breaking changes can be considered.

What do you think?

@thom4parisot
Copy link
Owner

thom4parisot commented May 15, 2018

I'm all in favour of going step by step and to do so, to start by function arguments (rather than via a factory argument) 👍

I don't know how many people assume it's strict or lenient at the moment — from the comments I'd say lenient — and a semver bump is a good occasion to clarify the API intents and to set the good defaults.

@7c
Copy link

7c commented Sep 24, 2018

I think it is essential to distinguish between domain part and the host part. There are possibly no underscore domains available even though RFC would allow but there are tons of real world _ hostnames out there. Especially from big providers. I personally think you should allow _ in host part without any setting by default. I think we all want to use this wonderful repo in real world. I am parsing billions of logs daily with it and had to make some workarounds to allow _ in host.

@remusao
Copy link
Collaborator Author

remusao commented Sep 24, 2018

@taskinosman Thanks for your comment. As mentioned in a previous comment, if you need this change quickly you could give tldts a shot. By default it will allow _ in both host and domain parts, would that work for you? If not, we can work to make the behavior more accurate.

@remusao remusao closed this Jan 10, 2019
@remusao remusao deleted the accept-extra-chars-is-valid branch January 10, 2019 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants