-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send custom serializer through options object to enable XHTML output #605
Comments
I don't really understand - Readability doesn't control the DOM's Unfortunately readability has to be able to work with a simulated DOM so we cannot assume |
Put differently: can you solve your problem by making sure the DOM you pass readability is an XHTML DOM (as implemented by |
Unfortunately, as far as I can tell, For a HTML document, return {
title: this._articleTitle,
byline: metadata.byline || this._articleByline,
dir: this._articleDir,
content: options.serializer ? options.serializer(articleContent) : articleContent.innerHTML,
textContent: textContent,
length: textContent.length,
excerpt: metadata.excerpt,
siteName: metadata.siteName || this._articleSiteName
} And call Readability with the appropriate option: let dom = new JSDOM('content');
let serializer = new dom.window.XMLSerializer();
new Readability(dom.window.document, {
serializer: el => serializer.serializeToString(el)
}); There may be a way to hijack the Additionally, if there really is no use for decoupling the serialization from the original DOM, I think an inefficient but workable solution is to first obtain the HTML DOM, serialize all of it as XHTML, then re-parse it as XHTML to pass it to Readability... Later edit: Updated code snippet. |
In all honesty, up until I set to produce EPUBs by hand, I had not been aware of the XHTML requirement — it's only when I opened the file in Apple Books that I saw the error: That being said, I don't think producing EPUBs is a far-fetched use-case for Readability in a Node.js environment. |
We assign back to
Right, but you could solve this just as easily from the consumer by doing something like:
It's maybe slightly less performant given you're parsing twice, but it should already work, right? Anyway, I could see us accepting a patch, but I'm afraid I don't have cycles to write one for you... |
I've created a PR with the proposed solution. And, for completeness, the workaround code with the current API: let html_dom = new JSDOM('My cat: <img>');
let xhtml_dom = new JSDOM(
new html_dom.window.XMLSerializer().serializeToString(html_dom.window.document),
{ contentType: 'application/xhtml+xml' }
);
new Readability(xhtml_dom.window.document).parse(); |
I'm trying to bundle some web pages filtered through Readability as EPUB, and the format requires the content to be XHTML. However, an element's
.innerHTML
returns a HTML, not XHTML serialization.I would like to propose a
serializer
option, that defaults toserializer: el => el.innerHTML
but which can be swapped toserializer: (new window.XMLSerializer()).serializeToString
when returning data here:readability/Readability.js
Lines 2056 to 2065 in 52ab9b5
The text was updated successfully, but these errors were encountered: