Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internationalization of page titles #140

Open
replaid opened this issue Oct 23, 2022 · 8 comments
Open

Internationalization of page titles #140

replaid opened this issue Oct 23, 2022 · 8 comments

Comments

@replaid
Copy link

replaid commented Oct 23, 2022

As a wiki author working in a language with a non-Latin script, I want to be able to link to a page like [[Гильдии]] (the equivalent of [[Guilds]] in Russian). Currently wiki strips out all the non-Latin characters from page titles, so Гильдии converts to a slug that is the empty string.

In the specific case of the Russian language, the alphabet is very phonetic, so many Russian websites have software to solve this problem by mapping the Cyrillic letters to Latin letters or clusters of Latin letters, in this case Gil'dii. However, it seems likely that such a "Russian mode" would not be the best solution for wiki.

It looks to me like there is a fork in the road. I will call the two general paths I see "incompatible slug" and "compatible slug" as a best effort to describe this.

Incompatible slug

We could opt to present non-Latin page title characters directly in the URLs and let the backend map those to and from some kind of encoding for the filenames (or even filenames in a subset of UTF-8). This is the general approach Wikipedia takes. When someone links to [[Гильдии]], Wikipedia URLencodes that Unicode page title into the link like <a href="/wiki/%D0%93%D0%B8%D0%BB%D1%8C%D0%B4%D0%B8%D0%B8">. Users see /wiki/Гильдии in the URL bar. I don't know what exactly happens on the back end.

I suspect this approach has a different texture than the current approach of Federated Wiki. I think it would represent a change in direction and would have repercussions that would upset the "just enough" ethos that characterizes this project.

Compatible slug

Alternatively, we could convert non-Latin letters to some slug that is compatible with the current backend.

This is pretty much exactly how domain names with non-Latin letters are handled: they are mapped to Latin letters using Punycode (in which Гильдии is lowercased and encodes to xn--c1aclbap3j), and are then compatible with existing DNS. Slugs that are not based primarily on Latin characters are unreadable to humans, but they are unambiguously decodable to a lowercased version of the non-Latin input.

If we were to add Punycode encoding to the asSlug method, I think a lot of things would just work, especially based on how well wiki works when I create a page by typing [[по-русски]]: this discards all the letters and just creates the slug -, but this works fine as long as I don't make another such page.

I have not yet dug into the code for sitemaps and searches, but I would imagine that this code would need to become aware of Punycode-decoding slugs that begin with xn--. But once that's in, I think it would be quite transparent.

  1. Author links to [[Гильдии]]
  2. Slug is calculated to be xn--c1aclbap3j
  3. Click to create the page, the file is created with filename xn--c1aclbap3j
  4. Reader types гил into the search bar
  5. Search code has seen xn--c1aclbap3j in the sitemap information and decoded that to гильдии for search matching purposes
  6. гил matches as a substring of гильдии just as guil matches as a substring of guilds for a page named Guilds.
  7. Reader is given a link to the page with slug xn--c1aclbap3j and clicks it
  8. Wiki sees the Punycode and while the page loads displays the slug in lowercased non-Latin characters like гильдии, which is replaced with Гильдии when the page title loads

So the only real user-visible wart is the presence of the Punycode in the link itself.

This could even be a stepping stone to eventually keeping the client experience in native languages in future steps.

I may be missing other uses of the slug that need to be accounted for.

This issue is an offshoot of conversations at fedwiki/wiki-client#103 and #139.

@replaid
Copy link
Author

replaid commented Oct 25, 2022

Question: Will changing the slug format break searching?
Answer: Apparently it will not.

I was concerned that changing the slug format was going to break searching, but it doesn’t seem that it will.

Searching uses JavaScript’s toLowerCase, and that works in non-Latin scripts. And I haven’t yet found a case where searching depends on the slug—when I create a page with a two-word name, a Latin-script word like Good and a non-Latin-script word like Время, either the page appears in search results or not depending on what wiki has loaded, whether I search using the Latin-script word (that appears in the slug) or the non-Latin-script word (that is not in the slug).

Current behavior:

  1. link to [[Good Время]]
  2. slug gets written as good- (Punycode would be xn--good--lwev0d8a5l)
  3. searching for good or Время or even время brings up the source page if it’s within 1 hop of the default page. None of these work if not.
  4. create the new page
  5. I search for good or Время or even время and the new page comes up
  6. Other wiki sites behave as in step 3: neither word brings up the page as a search term until the site with the page joins the neighborhood, at which point either word works.

@replaid
Copy link
Author

replaid commented Oct 26, 2022

The next question is compatibility.

I have found a couple important things:

  • At page creation the Unicode page name goes over the wire and the slug does not!
  • Slugs are decoded from the Unicode page name both in the client and in the server. It’s not canonicalized to just one place.

This suggests to me that it might be good to decrease the use of the slug elsewhere and only use the narrower representation for the persistence layer. Then if we have the full version in most places, we can derive information about the old encoding to handle that case as needed.

What I mean is that if there’s a page [[Ramón]], the legacy slug is computed to be ramn, and any better slug algorithm would have additional information, such as xn--ramn-sqa.

Page links are stored on disk as typed.

    {
      "type": "edit",
      "id": "40b42210652a8c46",
      "item": {
        "type": "paragraph",
        "id": "40b42210652a8c46",
        "text": "[[Good Время]]"
      },
      "date": 1666688912903
    },

This is also what goes over the wire in the server’s response when pages are requested. The uses of the slug are fewer than I thought.

Pages are stored on disk under a filename corresponding to their slug, which of course is currently the old slug that loses non-Latin-character information.

Exploring possible solution: as a first step, the server learns to respond to Unicode “slugs”

One way to tackle this could be to speak Unicode instead of slugs over the wire, while maintaining a similar lossy behavior when slugs are spoken over the wire. It might go like this:

The upgraded server renames all page files to some non-lossy encoding of the Unicode page title (Punycode, actual Unicode name, URLencoding) and checks this at startup.

The upgraded server maintains a list of old slugs for all its pages, computed from the Unicode page titles.

When receiving a request that matches an old slug, the server returns the page corresponding to that. Its behavior when multiple pages match is to return one of them, undefined, much as wiki’s current undefined behavior when multiple Unicode page names resolve to the same slug. So the upgraded server behaves just like current wiki when sent a slug from current wiki.

When receiving a request in upgraded wiki’s new non-lossy format, don’t interpret it, only map it to whatever non-lossy filename format is used and return the corresponding page (or return 404 if it doesn’t exist).

My remaining questions are related to other sites, both requesting their pages and interpreting their sitemap.json/site-index.json data.

@replaid
Copy link
Author

replaid commented Oct 27, 2022

I have gotten through my pass through all the variables I saw discussed, in using wiki, and in the code I saw. I think I have found a solution that will work well for humans, machines, and this project’s needs. So here is my proposal for what to do. I would appreciate the team’s feedback, especially @almereyda, @WardCunningham, and @paul90 who have been kindly discussing i18n with me in #139 and fedwiki/wiki-client#103.

I am ready to make a PR taking a stab at this if the team feels they’d be willing to merge a good implementation of it.

Thank you for your encouragement and willingness to look at this issue.

tl;dr

Unicode slugs with backward compatibility. Biggest impact outside the code base is renaming files in the wiki’s pages/ directory.

Summary of the proposal

  • there will be a new slug format (I’ll call this “Unislug” and the existing slug “OldSlug”)
  • the OldSlug is derivable from Unislug, which allows backward compatibility
  • the Unislug format will be (pseudocode) rfc3491Nameprep(whitespaceToDash(unicodePageTitle)) unicodeAwareCharacterRestrict(whitespaceToDash(unicodePageTitle))
  • the HTML links that are output by upgraded clients will be the URLencoded Unislug (this is what Wikipedia does here)
  • the upgraded server stores wiki pages in some encoding of Unislug, either URLencoding or Punycode would work, or maybe just Unislug itself if that turns out to work
  • the server will serve requests for OldSlug (with behavior nearly identical to current behavior, only differing in which page gets served when there’s a collision due to the very OldSlug limitations being addressed in this issue)
  • there will be new sitemap-i18n.json/site-index-i18n.json endpoints that provide Unislugs (and change nothing else), and existing sitemap.json/site-index.json endpoints will continue to speak OldSlug
  • new clients will try the new Unislug sitemap-i18n.json/site-index-i18n.json endpoints first, then fall back to the existing OldSlug endpoints if the Unislug endpoints 404

New slug format (“Unislug”)

Non-lossy slugs, which are called Unislugs here, are generated from the Unicode page title as follows:

  1. Convert whitespace characters to hyphens as is currently done
  2. Then apply RFC 3491 Nameprep a Unicode version of the current regex replacement to wiki page names (basically lowercasing and sanitization).

The result after this preprocessing will continue to have non-Latin characters.

The Unislug format has the property that existing OldSlugs can be derived from Unislugs, which creates the opportunity for thorough backward compatibility.

Code changes

What changes will be made to the upgraded server code?

The upgraded server renames all page files to some non-lossy encoding of Unislug (this might be Punycoded Unislug, the Unislug itself, or URLencoded Unislug). This can be a one-time process that writes an upgraded marker file somewhere when it’s been done. The server checks for this at startup, and if it’s not done it either does it or exits. Note that there may be many wikis that do not have any pages with non-Latin characters in page titles, and those may not need any changes, so we should certainly not bother upgrading users in that case.

The upgraded server maintains a list of OldSlugs for all its pages, computed from the Unislugs.

When receiving a request that matches an OldSlug on the list, the server returns the page corresponding to that. Its behavior when multiple pages match is to return one of them, undefined, much as wiki’s current undefined behavior when multiple Unicode page names resolve to the same slug. So the upgraded server behaves just like current wiki when current wiki requests an OldSlug.

Currently wiki writes a file for the first page to match a particular slug, but now it will be possible for wiki to have multiple pages matching a particular slug. First of all, this feels like undefined behavior that can’t reasonably considered part of a contract wiki makes. But even if so, serving the matching page whose file has the earliest created timestamp will very closely approximate wiki’s current behavior.

When receiving a request for upgraded wiki’s Unislug, don’t interpret it, only map it to whatever non-lossy filename format is used and return the corresponding page (or return 404 if it doesn’t exist).

What changes will be made to the upgraded client code?

Links to wiki page names are URLencoded in the HTML anchor tags, so that they display as the Unislugs including any non-Latin characters in the URL bar, just as happens on Wikipedia.

Sitemaps

sitemap.json

What’s currently in the sitemap.json endpoint’s output?

It’s an array of objects that have title, synopsis, date, slug, and a links hash whose keys are OldSlugs and whose values are wiki random IDs, apparently of journal events.

What slugs are published in the upgraded server’s sitemap.json?

This endpoint will keep publishing OldSlugs. A new sitemap-i18n.json will keep the same format but publish the non-lossy slugs.

How do legacy clients handle sitemap.json from the upgraded server?

They don’t do anything different or behave differently. They request the old endpoint and get unchanged output.

How do upgraded clients handle sitemap.json from the upgraded server?

They request the new endpoint (perhaps sitemap-i18n.json) and run it through the existing code paths.

How do upgraded clients handle legacy sitemap.json?

Upgraded clients request the new endpoint, and when that is a 404, they request the legacy endpoint and run it through the existing code. They should not fall back to the legacy endpoint when encountering non-404 errors.

site-index.json

This is related to sitemap.json and we should load the OldSlug or Unislug version of this depending on which sitemap endpoint we loaded.

What’s currently in the site-index.json endpoint’s output?

An object with attributes including documentIds and index. documentIds is a hash mapping integers to OldSlugs.

What slugs are published in the upgraded server’s site-index.json?

This endpoint will keep publishing OldSlugs. A new site-index-i18n.json will keep the same format but publish Unislugs.

How do legacy clients handle site-index.json from the upgraded server?

They don’t do anything different or behave differently. They request the old endpoint and get unchanged output.

How do upgraded clients handle site-index.json from the upgraded server?

They request the new endpoint (depending on what endpoint worked for loading the sitemap) and run it through the existing code.

How do upgraded clients handle legacy site-index.json?

When upgraded clients fall back to the legacy sitemap.json endpoint, they request the legacy endpoint for the index also. There should not be an attempt to upgrade or downgrade from the version decided for the sitemap, since the slug formats are incompatible and these two requests are related.

@replaid
Copy link
Author

replaid commented Oct 28, 2022

I dug in further and was glad to find that cooking down Unicode can be way simpler and much more like what wiki does today:

https://javascript.info/regexp-character-sets-and-ranges

asUnislug = (name) ->
  name.replace(/\s/g, '-').replace(
    /[^\p{Alphabetic}\p{Mark}0-9-]/gu, ''
  ).toLowerCase()

I hadn't looked into it before now, but it looks like these Unicode regexes are usable by 95% of browsers: CanIUse browser support statistics

This throws a SyntaxError in old Chrome and Firefox versions, and that would make sense in other browsers too. So this could fall back to the old slug processing in old browsers (need to test that this catches as expected in such browsers).

try {
  asUnislug(name)
} catch (e) {
  if (e instanceof SyntaxError) {
    asSlug(name);
  } else {
    throw(e);
  }
}

@WardCunningham
Copy link
Member

We do appreciate your effort here. Thanks.

@replaid
Copy link
Author

replaid commented Oct 29, 2022

OldSlug vs. Unislug comparison

Input OldSlug Unislug Unislug HTML link Unislug appearance in location bar
[[Peña]] pea peña /pe%C3%B1a.html /peña.html
[[Pea]] pea pea /pea.html /pea.html
[[RAMN]] ramn ramn /ramn.html /ramn.html
[[Ramén]] ramn ramén /ram%C3%A9n.html /ramén.html
[[Ramón]] ramn ramón /ram%C3%B3n.html /ramón.html
[[Гильдии]] (empty string) гильдии /%D0%B3%D0%B8%D0%BB%D1%8C%D0%B4%D0%B8%D0%B8.html /гильдии.html
[[По-русски]] - по-русски /%D0%BF%D0%BE-%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8.html /по-русски.html
[[Что-то]] - что-то /%D1%87%D1%82%D0%BE-%D1%82%D0%BE.html что-то.html

Resolving conflicts

  1. New wiki ensures all page files are renamed to Unislugs derived from the page title at startup. So
    • the Peña page currently having the filename pea is renamed to peña,
    • Ramón from ramn to ramón, and
    • По-русски from - to по-русски.
      The file name can be the literal Unislug string, its URLencoded equivalent, its Punycode equivalent, or anything else that technically works and unambiguously maps to and from the Unislug.
  2. New wiki builds a hashmap that translates OldSlugs to Unislugs like
    {
      "pea": "pea",
      "ramn": "ramón",
      "-": "по-русски"
    }
    resolving conflicts by mapping to the file with the oldest creation time among the files involved in the name collision, to emulate the existing wiki behavior for such collisions.
  3. New wiki renders the link using the URLencoded Unislug as shown in the above table, which results in the browser displaying the URL in the correct script.
  4. When the link is clicked, wiki-server checks its Unislug pages for a match, but if they don't contain a match, it tries the OldSlug mappings in step 2.
  5. The sitemap.json and site-index.json endpoints continue to present the OldSlug. New endpoints named something like sitemap-i18n.json and site-index-i18n.json provide the same data but with Unislugs.
  6. New wiki clients request the sitemap-i18n.json first for a given wiki entering the neighborhood, then fall back to sitemap.json if that is a 404. The site-index request goes along with whatever version was the result of the sitemap request.

I am unaware of anything that would break in this process. The collisions discussed are already present in current wiki. We continue to strictly limit the character set usable in slugs.

@replaid
Copy link
Author

replaid commented Oct 29, 2022

Got some feedback. Next actions:

  • look at viewing new pages in old client
  • look at forking new pages in old client
  • understand @dobbs thoughts on and insights from his static hosting efforts
  • see if I can find potential breakage in the plugins
  • make a new issue to summarize what has been learned in this issue, and close this one

@almereyda
Copy link

Best @replaid, many thanks for paving the way here. Looking forward to reading http://john.permakultura.wiki/guilds.html also in many other languages, when we finish through with #139.

Minor observations made me slow down during reading. They were:

  • If Unislug has a name that hints at its encoding, could OldSlug move past its relative perspective, and identify itself with similar semantics, like LatinSlug?
    Is UniSlug then also its sibling's more commonalised name?
  • Is it a sane way to consider trying to write sanitised Unicode file names to disk, and fall back to other -Slug approaches transparently?
    In which order would PunySlug, URLSlug and LatinSlug be tried?
  • Would UniSlug be restrictedd to UTF-8 Unicode codepoints, or also support UTF-16 and following for different normalisation forms?
  • Is there a way in which we could consider a Unicode-safe processing of slugs the new default, not having to rely on appended i18n fragments to associated resources?
    Could content negotiation, canonicalised paths for versioned APIs, or URL parameters step in here? As in:
    • Accept: application/ld+json
      and have it serve a plain JSON file for the client with an HTTP responce header referencing an IRI of a JSON-LD frame for reconstructing the implied @context,
    • /sitemap.json?v2 or
    • /v2/site-index.json?
  • In doing all of the above, are we switching from URLs to IRIs?
    UTF-8 would be a hard limit here, until RFC 3987 is amended or replaced.

With the beautiful examples that you collected, and with the anticipated changes that you stepped through for us, I am very confident that we can grow our local(-ised 🌐) wiki communities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants