Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send custom serializer through options object to enable XHTML output #605

Closed
danburzo opened this issue Jul 24, 2020 · 6 comments
Closed

Comments

@danburzo
Copy link
Contributor

danburzo commented Jul 24, 2020

I'm trying to bundle some web pages filtered through Readability as EPUB, and the format requires the content to be XHTML. However, an element's .innerHTML returns a HTML, not XHTML serialization.

I would like to propose a serializer option, that defaults to serializer: el => el.innerHTML but which can be swapped to serializer: (new window.XMLSerializer()).serializeToString when returning data here:

readability/Readability.js

Lines 2056 to 2065 in 52ab9b5

return {
title: this._articleTitle,
byline: metadata.byline || this._articleByline,
dir: this._articleDir,
content: articleContent.innerHTML,
textContent: textContent,
length: textContent.length,
excerpt: metadata.excerpt,
siteName: metadata.siteName || this._articleSiteName
};

@gijsk
Copy link
Contributor

gijsk commented Jul 24, 2020

I don't really understand - Readability doesn't control the DOM's innerHTML output itself. This depends on the DOM implementation used for the DOM you pass readability, right?

Unfortunately readability has to be able to work with a simulated DOM so we cannot assume window or XMLSerializer exists.

@gijsk
Copy link
Contributor

gijsk commented Jul 24, 2020

Put differently: can you solve your problem by making sure the DOM you pass readability is an XHTML DOM (as implemented by jsdom or similar) ?

@danburzo
Copy link
Contributor Author

danburzo commented Jul 24, 2020

Unfortunately, as far as I can tell, jsdom can only decide at instantiation time whether a document should be treated as HTML or XML/XHTML via the contentType option. However, loading a normal HTML as XML/XHTML understandably fails due to syntax mismatch. Therefore it's only at serialization-time that we can create the XHTML representation.

For a HTML document, .innerHTML returns the content formatted as HTML (which is invalid XHTML, and causes the EPUB to stop displaying content). A solution to this problem is to allow Readability to get a custom serializer for the content:

return { 
   title: this._articleTitle, 
   byline: metadata.byline || this._articleByline, 
   dir: this._articleDir, 
   content: options.serializer ? options.serializer(articleContent) : articleContent.innerHTML, 
   textContent: textContent, 
   length: textContent.length, 
   excerpt: metadata.excerpt, 
   siteName: metadata.siteName || this._articleSiteName 
}

And call Readability with the appropriate option:

let dom = new JSDOM('content');
let serializer = new dom.window.XMLSerializer();
new Readability(dom.window.document, {
  serializer: el => serializer.serializeToString(el)
});

There may be a way to hijack the innerHTML implementation to point to XMLSerializer, but I'm not sure what effect it has on Readability's function, since it's used throughout the code.

Additionally, if there really is no use for decoupling the serialization from the original DOM, I think an inefficient but workable solution is to first obtain the HTML DOM, serialize all of it as XHTML, then re-parse it as XHTML to pass it to Readability...

Later edit: Updated code snippet.

@danburzo
Copy link
Contributor Author

Additionally, if there really is no use for decoupling the serialization from the original DOM

In all honesty, up until I set to produce EPUBs by hand, I had not been aware of the XHTML requirement — it's only when I opened the file in Apple Books that I saw the error:

Screenshot 2020-07-24 at 22 31 06

That being said, I don't think producing EPUBs is a far-fetched use-case for Readability in a Node.js environment.

@gijsk
Copy link
Contributor

gijsk commented Jul 24, 2020

We assign back to innerHTML in some cases inside readability, so only changing the serializer (and not the parser) gets messy, I suspect...

Unfortunately, as far as I can tell, jsdom can only decide at instantiation time whether a document should be treated as HTML or XML/XHTML via the contentType option. However, loading a normal HTML as XML/XHTML understandably fails due to syntax mismatch. Therefore it's only at serialization-time that we can create the XHTML representation.

Right, but you could solve this just as easily from the consumer by doing something like:

let dom = new JSDOM('content');
let serializer = new dom.window.XMLSerializer();
let serializedXHTML = serializer.serialize(dom)
let xDOM = new JSDOM(serializedXHTML, {contentType: someOptionToAlwaysUseXHTML});
// then invoke readability with xDOM

It's maybe slightly less performant given you're parsing twice, but it should already work, right?

Anyway, I could see us accepting a patch, but I'm afraid I don't have cycles to write one for you...

@danburzo
Copy link
Contributor Author

I've created a PR with the proposed solution.

And, for completeness, the workaround code with the current API:

let html_dom = new JSDOM('My cat: <img>');
let xhtml_dom = new JSDOM(
  new html_dom.window.XMLSerializer().serializeToString(html_dom.window.document),
  { contentType: 'application/xhtml+xml' }
);
new Readability(xhtml_dom.window.document).parse();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants