Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to strip Unicode from entry filenames #1135

Merged
merged 13 commits into from
Mar 27, 2018
Merged

Conversation

tech4him1
Copy link
Contributor

@tech4him1 tech4him1 commented Feb 23, 2018

- Summary

The PR adds slug as a global config option.

The slug option allows you to change how filenames for entries are created and sanitized. For modifying the actual data in a slug, see the per-collection option below.

slug accepts multiple options:

  • encoding
    • unicode (default): Sanitize filenames (slugs) according to RFC3987 and the WHATWG URL spec. This spec allows non-ASCII (or non-Latin) characters to exist in URLs.
    • ascii: Sanitize filenames (slugs) according to RFC3986. The only allowed characters are (0-9, a-z, A-Z, _, -, ~).
  • clean_accents: Set to true to remove diacritics from slug characters before sanitizing. This is often helpful when using ascii encoding.

Closes #1012.
Also sanitizes media file slugs, closing #1196.

- Test plan

Added tests of functions. Manually tested all three options.

- Description for the changelog

Add option to strip Unicode from entry filenames.

- A picture of a cute animal (not mandatory but encouraged)

@decaporg decaporg deleted a comment from verythorough Feb 23, 2018
@verythorough
Copy link
Contributor

verythorough commented Feb 23, 2018

Deploy preview for cms-demo ready!

Built with commit 0d12e6b

https://deploy-preview-1135--cms-demo.netlify.com

@@ -60,6 +60,12 @@ public_folder: "/images/uploads"

Based on the settings above, if a user used an image widget field called `avatar` to upload and select an image called `philosoraptor.png`, the image would be saved to the repository at `/static/images/uploads/philosoraptor.png`, and the `avatar` field for the file would be set to `/images/uploads/philosoraptor.png`.

## Slug Type

By default, filenames (slugs) for entries created in the CMS are sanitized according to RFC3987 and the WHATWG URL spec. This spec allows non-ASCII (or non-Latin) characters to exist in URLs. However, for maximum compatibility, you can also set a different slugification option:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I mention that RFC3987 is still technically a draft, not an "official standard"?

@tech4him1 tech4him1 changed the title WIP: Add option to strip Unicode from entry filenames Add option to strip Unicode from entry filenames Feb 24, 2018
@tech4him1 tech4him1 mentioned this pull request Feb 24, 2018
@tech4him1 tech4him1 force-pushed the tech4him1-patch-1 branch 2 times, most recently from b81abe5 to 4cd81ac Compare February 24, 2018 00:38
@vencax
Copy link
Contributor

vencax commented Feb 24, 2018

If it converts all accent chars to asci (not strip them out) and if it apply to filenames of media as well, I am ok with that.

@tech4him1
Copy link
Contributor Author

@vencax Yes, you would set slug_type: latin in your config, and it would do both.

Copy link
Collaborator

@talves talves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM: Seems to work as expected. I will run some more to make sure, but nothing glaring on first run.

@rdela
Copy link

rdela commented Feb 25, 2018

@tech4him1
#1135 (comment)

Options are:

  • iri (default): Keeps Unicode characters in slugs, according the the IRI draft spec (RFC3987) and the WHATWG URL spec. (Does anyone have a better name for this option?)

After looking through IRIs RfC 3987 and WHATWG URL Standard I would say iri works unless you think url makes more sense, looks like WHATWG uses “(The) URL Standard” everywhere, so standard might be another option, but given comment below maybe that would be confusing for now.

https://github.com/netlify/netlify-cms/pull/1135/files#r170398959

Should I mention that RFC3987 is still technically a draft, not an "official standard"?

Linking to IRIs RfC 3987 2.1. Summary of IRI Syntax and/or WHATWG URL Standard 4.3. URL writing and/or https://github.com/whatwg/url somewhere in a code comment or docs might be a cool move if you have not done so already (just glanced through the PR quickly).

Copy link

@rdela rdela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔗🎉

@tech4him1
Copy link
Contributor Author

@rdela Thanks for the review (I updated the docs to add links). On option naming, the other choice that I could think of would be to name it default or normal (like Jekyll's options: https://jekyllrb.com/docs/templates/#options-for-the-slugify-filter).

@rdela
Copy link

rdela commented Feb 25, 2018

@tech4him1 Never a bad idea to follow Jekyll lead because of popularity, sheerly from GH pages use alone. default works best there, normal not as specific to me, and not in link:

The default is default.

That said, no other reason not to use iri, first instinct and all, plus maybe some people end up encouraged to look at the RfC and educate themselves a bit like I was.

@rdela
Copy link

rdela commented Feb 25, 2018

…which is now way easier thanks to doc links


- `iri` (default): Keeps Unicode characters in slugs, according the the IRI draft spec ([RFC3987](https://tools.ietf.org/html/rfc3987)) and the [WHATWG URL spec](https://url.spec.whatwg.org/).
- `latin`: Removes accents/diacritics from slug, then strips out all non-valid URL characters and periods (see `ascii` below).
- `ascii`: Strips out all characters except valid URI chars (RFC3986) or periods (0-9, a-z, A-Z, `_`, `-`, `~`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If latin is effectively ascii + accent removal, it'd be clearer to just state that. Also, we should order these options from most to least permissive, which just mean swapping the last two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If latin is effectively ascii + accent removal, it'd be clearer to just state that.

Agreed, can you suggest wording?

Also, we should order these options from most to least permissive, which just mean swapping the last two.

I originally ordered them in that way, but I'm not sure. You're going to end up with more chars left in latin than in a pure RFC3986 format (ascii option), since you're converting accented chars instead of just stripping them out.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you're converting accented chars instead of just stripping them out.

Then why does it not read:

Converts accent/diacritic characters from slug to ASCII equivalents, then strips out all non-valid URL characters and periods (see `ascii` below).

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdela That is what it does -- my wording needs to be cleared up there.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe they are in the right order after all and latin just needs better description like example suggestion above?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaces accent/diacritic characters from slug with ASCII equivalents, then strips out all non-valid URL characters and periods (see `ascii` below).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does make sense in that it is removing the diacritics and leaving the base chars intact

Oh yeah I can see that meaning now

Copy link
Contributor

@erquhart erquhart Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to wonder if we should do this a little different. This boils down to URI/3986 or IRI/3987, with the option to strip accents. Maybe a two-fold config approach:

slug: iri

# or

slug:
  - protocol: iri
  - strip_accents: true

Would stripping accents ever make sense when using IRI protocol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erquhart Actually, I think that's the best option. Stripping accents is really going to be independent of the format.

Copy link
Contributor

@erquhart erquhart Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, so most folks can just do slug: uri or slug: iri, but others can provide object if they need tighter control.

Btw I love that pattern and am expecting to use it more as we expand and restructure the APIs - the pattern being config options accepting a primitive value for simplicity, but also an object for fine-grained settings.

**Example**

``` yaml
slug_type: "latin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slug_type is a bit general, and could be confusing when we finally do more with slugs and potentially have more configuration options for slugs. What do you think about slug_characters or slug_chars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe slug_charset, or is that too much overlap with HTTP/HTML type stuff?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slug-encoding?

Copy link
Contributor

@erquhart erquhart Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any of the above could work, I shied away from really specific terms like encoding because I'm not sure it's technically accurate, and yeah "charset" kind of has a specific meaning too. At any rate, just something accurate that isn't as broad as "type".

The spec refers to itself as a "protocol", so maybe slug_protocol. Accurate but possibly completely opaque to most folks 🤷‍♀️.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your worry about technical accuracy, but with the prevalence across languages of terms like url_encoded, I think slug-encoding/slug_encoding feels most naturally understandable.
(It also would be nice if we consistently knew whether to use - or _, but it's too late for that! 😝)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention used in Netlify CMS is underscore.

Encoding sounds great, I'm with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention used in Netlify CMS is underscore.

...except when it's hyphen, like yaml-frontmatter, or camel case, like valueField. ;)

Underscore is certainly the most common, though, and good to know that's the standard going forward. (I wonder if that makes sense to document in the contributor docs?)

Copy link
Contributor Author

@tech4him1 tech4him1 Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the options will be unicode and ascii now, instead of iri/uri. @verythorough @erquhart Do you think slug: encoding still makes sense for that?

@tech4him1 tech4him1 self-assigned this Feb 27, 2018
@tech4him1
Copy link
Contributor Author

tech4him1 commented Mar 17, 2018

@verythorough I'd appreciate a docs review on this as well! 😄

@tech4him1 tech4him1 removed their assignment Mar 18, 2018
@@ -81,6 +88,24 @@ describe('sanitizeSlug', ()=> {
).toEqual('This-that-one_or.the~other-123');
});

it('should remove accents if set', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'should remove accents with clean_accents set'

export function sanitizeSlug(str, options = Map()) {
const encoding = options.get('encoding', 'unicode');
const stripDiacritics = options.get('clean_accents', false);
const replacement = options.get('sanitize_replacement', '-');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally undocumented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if we wanted to wait until someone actually had a valid use case for it -- validating it in src/actions/config.js would take a bit of effort. Thoughts?

Copy link
Contributor

@erquhart erquhart Mar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, I'm fine with it as is, was just curious.

Copy link
Contributor

@erquhart erquhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a couple of comments, but this looks solid!

@erquhart erquhart merged commit cd10a71 into master Mar 27, 2018
@erquhart erquhart deleted the tech4him1-patch-1 branch March 27, 2018 22:56
@tech4him1
Copy link
Contributor Author

@vencax This has been released as part of v1.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants