Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta: Improve splitting of multipage output #2552

Merged
merged 3 commits into from
Jun 8, 2017

Conversation

sideshowbarker
Copy link
Contributor

This addresses whatwg/wattsi#27

wattsi already supports splitting out a separate multipage file from a heading
element at any level, for any heading element with a split-filename attribute.

So this change adds the split-filename attribute to a bunch more headings in
the interest of making the splits more logical and usable and also in the
interest of reducing the file size of some of the larger splits (before this the
change, the forms.html file was 1.1MB, and there were several other files that
were larger than 500KB).

@domenic
Copy link
Member

domenic commented Apr 18, 2017

Woah, this is a big change and will cause a lot of incoming links to redirect. I'm not against it necessarily, but my original thinking was that we've have special dev-edition-only splits, since the dev edition is less about "load as much as we can without hanging your browser" (my interpretation of today's multipage) and is more about "give me a reasonable split as if this were a book chapter" (see today's https://developers.whatwg.org/).

What do you think?

So excited you're taking on the overall issue, BTW!!

@domenic
Copy link
Member

domenic commented Apr 18, 2017

Whichever way we go, I do think that we might want to ape the table of contents at https://developers.whatwg.org/ more closely, e.g. keeping "Introduction" as a single page.

Copy link
Member

@zcorpan zcorpan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked how developers.whatwg.org currently splits, but offer a knee-jerk reaction.

source Outdated
@@ -24788,7 +24788,7 @@ interface <dfn>HTMLModElement</dfn> : <span>HTMLElement</span> {



<h3 split-filename="embedded-content" id="embedded-content">Embedded content</h3>
<h3 split-filename="img-picture-source" id="embedded-content">Embedded content</h3>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to not change this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-  <h3 split-filename="embedded-content" id="embedded-content">Embedded content</h3>
+  <h3 split-filename="img-picture-source" id="embedded-content">Embedded content</h3>

I'd prefer to not change this one.

OK, have restored it

source Outdated
@@ -101535,7 +101537,7 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {

<div w-nodev>

<h4><dfn>Tokenization</dfn></h4>
<h4 split-filename="tokenization"><dfn>Tokenization</dfn></h4>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to split up the parsing section? Having named character references split out seems OK but I think the rest should probably be one page, so it's easier to find things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+  <h4 split-filename="tokenization"><dfn>Tokenization</dfn></h4>

Do we really want to split up the parsing section? Having named character references split out seems OK but I think the rest should probably be one page, so it's easier to find things.

OK, yeah agreed. Have changed it back to that. I think for the case of the parsing section it makes sense to keep it in one file, even though it ends up being very large.

source Outdated
@@ -31531,7 +31532,7 @@ interface <dfn>HTMLTrackElement</dfn> : <span>HTMLElement</span> {
</div>

<!--TOPIC:Video and Audio-->
<h4>Media elements</h4>
<h4 split-filename="media">Media elements</h4>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this better together with video-and-audio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+  <h4 split-filename="media">Media elements</h4>

Isn't this better together with video-and-audio?

That’s another case where the file size excessive with them combined

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@@ -25611,7 +25611,7 @@ the time Maria had stuck her tongue out...&lt;/p></pre>
</div>


<h4>Images</h4>
<h4 split-filename="images">Images</h4>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better together with embedded-content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+  <h4 split-filename="images">Images</h4>

Why? With those two together in the same file, the size is 1MB+. Do we not care about that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still about half of the average web page. 😁

I guess we do care but a conflicting goal is to have related things together. I could go either way here though.

@sideshowbarker
Copy link
Contributor Author

Woah, this is a big change and will cause a lot of incoming links to redirect.

Yeah it’s not good to add disruption unless there’s real value to it. In this case, spending time looking at the existing splits, it clear to me at least that they’re not adequate and never really have been. At least not if our goal here is to optimize for the needs of the people choosing to read and use the multipage spec.

I'm not against it necessarily, but my original thinking was that we've have special dev-edition-only splits,

It seems strange to me to have the dev spec split out differently than the full spec, but if that’s what we want I am happy to hack something in to wattsi to handle it

since the dev edition is less about "load as much as we can without hanging your browser" (my interpretation of today's multipage)

Yeah, I personally think that’s not a super-admirable choice we’ve been making for the readers of the spec.

and is more about "give me a reasonable split as if this were a book chapter" (see today's https://developers.whatwg.org/).

I guess I’m at the point where after having seen a lot of hours in the past put into producing the dev version but not seeing very many people really seeming to make it their version of choice but instead just reading the full spec, I am at the point where I would sorta prefer to make the full spec more usable to everybody (by improving the information design as we did in the link-element section, and by trying harder to not end up in the multipage with multiple 1MB+ files among the split files)—instead of trying to solve some problems for some people by making a secondary version.

@domenic
Copy link
Member

domenic commented Apr 18, 2017

Yeah, I personally think that’s not a super-admirable choice we’ve been making for the readers of the spec.

Can you say more?

To me most implementers and readers of the full spec are better served by single-page, but that hangs browsers, so it's only usable by those who are willing to take a hit (e.g. people like us who leave it open in a tab pretty constantly).

For that audience, the multipage is then basically "singlepage split minimally so that it doesn't hang browsers". So from this point of view, fewer splits are better.

It seems strange to me to have the dev spec split out differently than the full spec, but if that’s what we want I am happy to hack something in to wattsi to handle it

Well, let's definitely discuss a bit more. It may be my perspective on the multipage is off.

To me the dev edition is a pretty different beast. It's meant to be more like an online "book", IMO; something like https://doc.rust-lang.org/book/. So to me the splits there represent logical "chapters". Indeed, I have fond memories of reading some HTML 4.01 reference books; if we had a nice developer edition of the spec at that time, I'm sure I would have loved that.

Whereas, per my above perspective, the splits on the multipage version are more just about mitigating the file size.

I'm curious if anyone else sees things that way.

I guess I’m at the point where after having seen a lot of hours in the past put into producing the dev version but not seeing very many people really seeming to make it their version of choice but instead just reading the full spec, I am at the point where I would sorta prefer to make the full spec more usable to everybody (by improving the information design as we did in the link-element section, and by trying harder to not end up in the multipage with multiple 1MB+ files among the split files)—instead of trying to solve some problems for some people by making a secondary version.

I think there's room for both. I remember the dev edition being very active, especially around about when I was getting in to HTML. It was definitely where I linked people for many years, until I came to realize it was lagging. It contains just the right content for a web developer to really read it straight though (again, like a book), especially in parts like the HTML element definitions, where it hides away all the implementation algorithms but gives central focus to the semantics and examples and such. I think there's a good niche for us to work on here where we make the dev edition something we're proud to point people to, and people reference often.

But yes, we should definitely make the original spec better too!! The link element PR is a great example of that. And the fact that we're building from the same source should help benefit both goals at once.

That leaves us with the issue discussed above: IMO 1 MB+ files are not really a problem for original-spec multipage readers. But I'm curious to hear what others think.

@annevk
Copy link
Member

annevk commented Apr 19, 2017

My main perspective on multipage at the moment is that I like that it's stable (the redirects tend to break somehow) and also that I mostly don't use it because dfn.js doesn't work across pages.

@sideshowbarker
Copy link
Contributor Author

I mostly don't use [multipage] because dfn.js doesn't work across pages

Yeah, agreed that’s a big priority. I’ll make time this weekend to look at finally implementing it

@sideshowbarker
Copy link
Contributor Author

since the dev edition is less about "load as much as we can without hanging your browser" (my interpretation of today's multipage)

Yeah, I personally think that’s not a super-admirable choice we’ve been making for the readers of the spec.

Can you say more?

We seem to be making our own conclusions about who the readers of the spec are, and what they want—without any clear evidence to support the conclusions.

Lacking clear evidence, I believe it would be more friendly and accommodating to all readers—in the potentially very broad set of readers we have for the spec—to make the spec follow some general usability best practices for web documents. We know one common best practice is to keep file sizes of documents lower if possible—for a number of reasons, including to make the documents more usable on mobile and in general to reduce the page-load times instead of making readers wait. And we know another common best practice is to when possible, split documents in logical somewhat-discrete chunks that are easier to consume than larger documents with disparate parts.

To me most implementers and readers of the full spec are better served by single-page, but that hangs browsers, so it's only usable by those who are willing to take a hit (e.g. people like us who leave it open in a tab pretty constantly).

I think there are lot more readers of the spec than just us and the implementors and other people we know and interact with. I personally think many other people would be much better off reading MDN rather than trying to make their way through the spec to understand the requirements, but from what I can see no matter how great the information at MDN is, there are still a lot of people who turn the reading the spec to get information and so who we should be considering in trying to make the user experience as good as we possibly can for that broad range of people using it.

For that audience, the multipage is then basically "singlepage split minimally so that it doesn't hang browsers". So from this point of view, fewer splits are better.

I don’t know what evidence we have that the audience of people who see multipage as being "singlepage split minimally so that it doesn't hang browsers" is bigger or more important than the audience of general readers who want the spec to be more aligned with common best practices for usability of web documents.

IMO 1 MB+ files are not really a problem for original-spec multipage readers.

I think it’s reasonable to not see it as a problem but I wonder what the cutoff point on file size actually is. Clearly we consider a file size of 8MB to be too big to not provide an alternative for, because otherwise we wouldn’t be investing in making the multipage version.

That said, I’m not sure what the should otherwise be considered too big in multipage. But the split in this PR is proof that if we want, we can get the spec into logical somewhat-discrete chunks that are easier to consume, with the size of most files below 500KB.

Below is what the file-size results are after the splits in this PR.

mike@Darwin ~/workspace/html-build/output/multipage
$ ls -S
total 25472
-rw-r--r--  1 mike   2.7M Apr 19 22:34 fragment-links.json
-rw-r--r--  1 mike   628K Apr 19 22:34 parsing.html
-rw-r--r--  1 mike   522K Apr 19 22:34 input.html
-rw-r--r--  1 mike   482K Apr 19 22:34 media.html
-rw-r--r--  1 mike   434K Apr 19 22:34 canvas.html
-rw-r--r--  1 mike   391K Apr 19 22:34 indices.html
-rw-r--r--  1 mike   323K Apr 19 22:34 form-control-infrastructure.html
-rw-r--r--  1 mike   312K Apr 19 22:34 webappapis.html
-rw-r--r--  1 mike   255K Apr 19 22:34 rendering.html
-rw-r--r--  1 mike   245K Apr 19 22:34 form-elements.html
-rw-r--r--  1 mike   221K Apr 19 22:34 named-characters.html
-rw-r--r--  1 mike   213K Apr 19 22:34 images.html
-rw-r--r--  1 mike   203K Apr 19 22:34 text-level-semantics.html
-rw-r--r--  1 mike   197K Apr 19 22:34 document-metadata.html
-rw-r--r--  1 mike   187K Apr 19 22:34 obsolete.html
-rw-r--r--  1 mike   186K Apr 19 22:34 offline.html
-rw-r--r--  1 mike   185K Apr 19 22:34 interaction.html
-rw-r--r--  1 mike   185K Apr 19 22:34 links.html
-rw-r--r--  1 mike   182K Apr 19 22:34 browsing-the-web.html
-rw-r--r--  1 mike   180K Apr 19 22:34 iframe-embed-object.html
-rw-r--r--  1 mike   176K Apr 19 22:34 tables.html
-rw-r--r--  1 mike   162K Apr 19 22:34 scripting.html
-rw-r--r--  1 mike   162K Apr 19 22:34 workers.html
-rw-r--r--  1 mike   149K Apr 19 22:34 dnd.html
-rw-r--r--  1 mike   144K Apr 19 22:34 history.html
-rw-r--r--  1 mike   142K Apr 19 22:34 sections.html
-rw-r--r--  1 mike   142K Apr 19 22:34 interactive-elements.html
-rw-r--r--  1 mike   138K Apr 19 22:34 mdvocabs.html
-rw-r--r--  1 mike   135K Apr 19 22:34 dependencies.html
-rw-r--r--  1 mike   125K Apr 19 22:34 common-microsyntaxes.html
-rw-r--r--  1 mike   124K Apr 19 22:34 global-attributes.html
-rw-r--r--  1 mike   119K Apr 19 22:34 grouping-content.html
-rw-r--r--  1 mike   117K Apr 19 22:34 xhtml.html
-rw-r--r--  1 mike   113K Apr 19 22:34 system-state.html
-rw-r--r--  1 mike   112K Apr 19 22:34 index.html
-rw-r--r--  1 mike   108K Apr 19 22:34 browsers.html
-rw-r--r--  1 mike   108K Apr 19 22:34 forms.html
-rw-r--r--  1 mike   106K Apr 19 22:34 custom-elements.html
-rw-r--r--  1 mike    98K Apr 19 22:34 window-object.html
-rw-r--r--  1 mike    88K Apr 19 22:34 web-messaging.html
-rw-r--r--  1 mike    86K Apr 19 22:34 common-dom-interfaces.html
-rw-r--r--  1 mike    85K Apr 19 22:34 structured-data.html
-rw-r--r--  1 mike    81K Apr 19 22:34 dom.html
-rw-r--r--  1 mike    80K Apr 19 22:34 video-and-audio.html
-rw-r--r--  1 mike    79K Apr 19 22:34 syntax.html
-rw-r--r--  1 mike    75K Apr 19 22:34 content-models.html
-rw-r--r--  1 mike    73K Apr 19 22:34 embedded-content.html
-rw-r--r--  1 mike    69K Apr 19 22:34 network.html
-rw-r--r--  1 mike    66K Apr 19 22:34 microdata.html
-rw-r--r--  1 mike    64K Apr 19 22:34 animations-and-imagebitmap.html
-rw-r--r--  1 mike    63K Apr 19 22:34 server-sent-events.html
-rw-r--r--  1 mike    56K Apr 19 22:34 selectors.html
-rw-r--r--  1 mike    55K Apr 19 22:34 map-and-area.html
-rw-r--r--  1 mike    54K Apr 19 22:34 webstorage.html
-rw-r--r--  1 mike    53K Apr 19 22:34 elements.html
-rw-r--r--  1 mike    53K Apr 19 22:34 dynamic-markup-insertion.html
-rw-r--r--  1 mike    44K Apr 19 22:34 conformance-for-authors.html
-rw-r--r--  1 mike    40K Apr 19 22:34 timers-and-user-prompts.html
-rw-r--r--  1 mike    39K Apr 19 22:34 origin.html
-rw-r--r--  1 mike    39K Apr 19 22:34 iana.html
-rw-r--r--  1 mike    39K Apr 19 22:34 references.html
-rw-r--r--  1 mike    37K Apr 19 22:34 quick-intro-to-html.html
-rw-r--r--  1 mike    37K Apr 19 22:34 common-idioms.html
-rw-r--r--  1 mike    35K Apr 19 22:34 acknowledgements.html
-rw-r--r--  1 mike    34K Apr 19 22:34 sandboxing.html
-rw-r--r--  1 mike    34K Apr 19 22:34 edits.html
-rw-r--r--  1 mike    32K Apr 19 22:34 infrastructure.html
-rw-r--r--  1 mike    30K Apr 19 22:34 introduction.html
-rw-r--r--  1 mike    28K Apr 19 22:34 svg-and-mathml.html
-rw-r--r--  1 mike    27K Apr 19 22:34 fetching-resources.html
-rw-r--r--  1 mike    26K Apr 19 22:34 conformance-classes.html
-rw-r--r--  1 mike    26K Apr 19 22:34 urls.html
-rw-r--r--  1 mike    26K Apr 19 22:34 structure.html
-rw-r--r--  1 mike    22K Apr 19 22:34 extensibility.html
-rw-r--r--  1 mike    21K Apr 19 22:34 comms.html
-rw-r--r--  1 mike    18K Apr 19 22:34 semantics.html

@annevk
Copy link
Member

annevk commented Apr 19, 2017

-rw-r--r-- 1 mike 2.7M Apr 19 22:34 fragment-links.json wow, we really need something better than this. A server-side oracle in the form of a Python script that just keeps this all in memory?

@annevk
Copy link
Member

annevk commented Apr 19, 2017

FWIW, I think you have convinced me, but I also think that maybe we should prioritize fragment-links.js and dfn.js over splitting disruption since they'll become even more of a bottleneck.

@domenic
Copy link
Member

domenic commented Apr 19, 2017

I want to respond to the rest of the thread soon, but before I forget,

(the redirects tend to break somehow)

I don't think this is super-accurate. We had a brief one-week-ish period where they were broken as we were re-jiggering fragment-links.json, which may have damaged your trust in the redirect system, but apart from that period it seems 100% reliable to me.

@annevk
Copy link
Member

annevk commented Apr 19, 2017

Maybe it's just something in Firefox. https://html.spec.whatwg.org/multipage/#the-worker's-lifetime for instance doesn't work. It's probably something between the address bar doing weird things with special URL code points and the lookup script not decoding them or some such.

@sideshowbarker
Copy link
Contributor Author

I also think that maybe we should prioritize fragment-links.js and dfn.js over splitting disruption since they'll become even more of a bottleneck

Yeah if we are going to have multipage at all, I think the lack of dfn.js in multipage is a bigger pain point for readers who are trying to actually use multipage as an information tool, closer to the way they can with the single-page version.

And thinking about the fact the even-more-common use-case for the single-page version is to do full-text search of the whole spec, I realized it would be pretty useful to also add some kind of full-text search utility to multipage output. So #2565 is an attempt at doing that.

@domenic
Copy link
Member

domenic commented Apr 21, 2017

I guess I am also convinced of the general idea of splitting more. But let me suggest a slightly less aggressive split. I will do so via uploading a patch onto this branch since that's easier to illustrate things, but please feel free to debate it.

@domenic
Copy link
Member

domenic commented Apr 21, 2017

With my patch here is the split:

$ ls -s -S -w 1 -h
total 13M
2.7M fragment-links.json
628K parsing.html
548K media.html
524K input.html
436K canvas.html
384K indices.html
324K form-control-infrastructure.html
312K webappapis.html
300K dom.html
256K rendering.html
244K form-elements.html
224K named-characters.html
216K images.html
204K semantics.html
204K text-level-semantics.html
192K microdata.html
188K obsolete.html
188K offline.html
188K interaction.html
184K links.html
184K browsing-the-web.html
184K infrastructure.html
180K iframe-embed-object.html
176K tables.html
164K workers.html
164K scripting.html
152K dnd.html
148K history.html
144K interactive-elements.html
144K sections.html
128K common-microsyntaxes.html
120K xhtml.html
120K grouping-content.html
116K system-state.html
112K index.html
112K browsers.html
108K forms.html
108K custom-elements.html
104K introduction.html
100K window-object.html
 92K web-messaging.html
 88K common-dom-interfaces.html
 88K structured-data.html
 84K semantics-other.html
 80K syntax.html
 76K embedded-content.html
 72K web-sockets.html
 64K imagebitmap-and-animations.html
 64K origin.html
 64K server-sent-events.html
 56K image-maps.html
 56K webstorage.html
 56K dynamic-markup-insertion.html
 44K urls-and-fetching.html
 40K timers-and-user-prompts.html
 40K iana.html
 40K references.html
 36K acknowledgements.html
 36K edits.html
 32K embedded-content-other.html
 24K comms.html

Copy link
Contributor Author

@sideshowbarker sideshowbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@domenic domenic force-pushed the sideshowbarker/multipage-splitting-improve branch from 2120ca3 to 91a5084 Compare May 1, 2017 17:56
@domenic
Copy link
Member

domenic commented May 2, 2017

Some offline discussion with @annevk concluded that we should not merge this until we have multipage dfn.js working (which is in progress!! whatwg/wattsi#46). As such, tagging it as "do not merge yet".

@domenic domenic added the do not merge yet Pull request must not be merged per rationale in comment label May 2, 2017
inikulin pushed a commit to HTMLParseErrorWG/html that referenced this pull request May 9, 2017
inikulin pushed a commit to HTMLParseErrorWG/html that referenced this pull request May 9, 2017
@sideshowbarker
Copy link
Contributor Author

Some offline discussion with @annevk concluded that we should not merge this until we have multipage dfn.js working (which is in progress!! whatwg/wattsi#46). As such, tagging it as "do not merge yet"

Now that we got multipage dfn.js merged, seems like this is ready to merge at this point too?

@zcorpan zcorpan removed the do not merge yet Pull request must not be merged per rationale in comment label Jun 8, 2017
@domenic domenic merged commit 7ad824b into master Jun 8, 2017
@domenic domenic deleted the sideshowbarker/multipage-splitting-improve branch June 8, 2017 21:13
alice pushed a commit to alice/html that referenced this pull request Jan 8, 2019
36degrees added a commit to 36degrees/content that referenced this pull request Jul 2, 2021
The current link (https://html.spec.whatwg.org/multipage/forms.html#attr-input-disabled) does not point to an anchor that exists on the page. The term 'disabled' does not appear within `forms.html` at all.

I think the correct destination should be https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#attr-fe-disabled.

I think this has been incorrect since the multipage splitting was changed in whatwg/html#2552 when `forms.html` was split up and `form-control-infrastructure.html` was introduced.
ericwbailey pushed a commit to mdn/content that referenced this pull request Jul 2, 2021
The current link (https://html.spec.whatwg.org/multipage/forms.html#attr-input-disabled) does not point to an anchor that exists on the page. The term 'disabled' does not appear within `forms.html` at all.

I think the correct destination should be https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#attr-fe-disabled.

I think this has been incorrect since the multipage splitting was changed in whatwg/html#2552 when `forms.html` was split up and `form-control-infrastructure.html` was introduced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants