-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Integrate a native PDF renderer #6861
Comments
Well, even with this library, it's a pretty massive undertaking you're talking about -- manually laying out text in a PDF. Not to mention math layout and the complexities that brings. |
This is out of scope. PDF is a different case than every other format Pandoc handles. Needing external dependencies to handle it makes perfect sense. From another perspective, all document formats that Pandoc handles require external dependencies to render.
Even markdown requires some form of text editor that handles things like line wrap (layout and typesetting) to or conversion to another format for rendering. Why should PDF be any different? The only difference is expecting a pre-renedered output with layout and typesetting work done already. Just because the final viewing step is separated from the layout and typesetting steps doesn't mean it should get special treatment. Pandoc is not a layout engine and does not do typesetting. It is a document format conversion tool. Trying to make it do layout and typesetting would be wildly out of scope, out of character, and frankly just not that feasible. If you want lightweight PDF renderers that do layout and typesetting there are lots to choose from. They all ave different strengths and weaknesses because this is a huge job with lots of decisions to make that are not part of the document content. Take it from someone who writes layout and typesetting tools, this is not something that should be shoehorned into Pandoc. |
Ok, I see your point. My usecase would be converting from Markdown to PDF. So just some headings and text blocks. Think contracts, letters, text only ebooks, …. I'd be happy with even the most basic implementation. |
PDF files are basically just PostScript with some fancy trappings. The same argument would apply to PostScript: in order to generate it you would have to convert raw document content (the Pandoc AST) to a rendered form that has all the physical shape (layout and typesetting) done. This requires things like canvas size, fonts, text shaping, line breaking, styling, and so on and so forth. None of these things are the purview of a document conversion tool.
Okay, so use a light weight layout engine. I don't think you realize how complex the "simple" cases you are talking about can be, but there are a number of options for doing page layout and typesetting whether from Markdown directly or from one of many formats that Pandoc converts to. |
Ok, I thought PS might also have some more high level constructs.
I think I tried out most of them by now, and all of them have some considerable issues. I guess |
I think the most promising Haskell library for this purpose is HPDF, which includes some functions that fill boxes with text. |
Well, maybe |
I fooled around a bit and got some text laid out with this: {-# LANGUAGE OverloadedStrings #-}
module Main where
import Graphics.PDF
import qualified Data.Text as T
import Data.List (intersperse)
import Debug.Trace
main :: IO ()
main = do
let rect = PDFRect 0 0 600 400
Just timesRoman <- mkStdFont Times_Roman
runPdf "test.pdf" standardDocInfo rect $ do
theDoc timesRoman
theDoc :: AnyFont -> PDF ()
theDoc font = do
page1 <- addPage Nothing
drawWithPage page1 $ drawing font
drawing :: AnyFont -> Draw ()
drawing font = do
let black = Rgb 0 0 0
let white = Rgb 1 1 1
let hsty = Font (PDFFont font 26) white black
let hrect = Rectangle (100 :+ 320) (500 :+ 360)
displayFormattedText hrect NormalParagraph hsty $ heading
let psty = Font (PDFFont font 16) white black
let prect = Rectangle (100 :+ 100) (500 :+ 300)
let vboxes = getBoxes NormalParagraph psty para
let verstate =
VerState { baselineskip = (12, 0.17, 0.0)
, lineskip = (3.0, 0.33, 0.0)
, lineskiplimit = 2
, currentParagraphStyle = NormalParagraph }
let (dr, newc, vboxes') = fillContainer
verstate
(mkContainer 50 300 100 100 1)
vboxes
dr
trace (show $ containerContentHeight newc) (return ())
let (dr', _, _) = fillContainer
verstate
(mkContainer 50 (300 - containerContentHeight newc - 20) 200 100 1)
vboxes'
dr'
-- displayFormattedText prect NormalParagraph psty $ para
heading :: TM StandardParagraphStyle StandardStyle ()
heading = do
paragraph $ do
startPara
sequence $ intersperse (glue 5 2 2)
(map txt $ take 2 $ T.words lorem)
endPara
para :: TM StandardParagraphStyle StandardStyle ()
para = do
setJustification FullJustification
setBaseLineSkip 20 1 1
paragraph $ do
startPara
sequence $ intersperse (glue 4 4 4 >> txt " ") (map txt $ T.words lorem)
endPara
lorem :: T.Text
lorem = "Nisi cömmodo arcu, vitae cursus neque ante sed elit. Sed sit amet erat. Phasellus luctus cursus risus. Phasellus ac felis. Proin nec eros quis ipsum pellentesque congue. Curabitur et diam sed odio accumsan cursus. Pellentesque ultricies. Quisque aliquam. Sed nisi velit, consectetuer eget, dictum ac, molestie a, magna. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Curabitur consequat leo et dui. Aenean ligula mi, dignissim ut, imperdiet tristique, interdum a, dolor." This shows how you can fill a rectangle as much as possible, and get a list of the remaining vboxes to fill another rectangle (which is what you need to do at a page break). |
feels to me like re-inventing tex ... starts easy and limited and in the end we get a pandoc-latex ... |
I think it is great that there is Haskell development in this area. It is good to have options. Thank you for pointing the Haskell library out. |
I agree it's out of scope for pandoc – better to leave this concern to a separate program – and we support already quite a few pdf-engines. And automatic layouting and typesetting is indeed a very difficult problem (kerning, widows, orphans, hyphenation using language dictionaries, etc. etc.), which is part of the reason TeX is still in use (of all the open source engines, it still produces the best typographic output). While pandoc happily supplies the semantic markup to those programs, people will always want to send layout instructions along as well. That's where a custom LaTeX template or CSS comes in. Personally, I feel CSS is a much nicer way to declaratively instruct a pdf engine on layout customizations – but browser vendors don't care about pages and care more about not doing too many passes (CSS flex-box takes 2 passes to layout, CSS grid 3) than optimal typography, and the other open source implementations are currently all still somewhat lacking. Anyway, guess that's not the OP's use-case either. So yes, what's wrong with |
I think it could be in-scope, potentially. I can see the advantages to being able to render PDF without external tools. One worry is that the original creator of HPDF hasn't done anything on the project since 2016. Someone else has taken it over and seems to be maintaining it now, so maybe that's okay (though I note they've disabled issues and PRs on the repository, not a great sign). But one might worry about depending on it. In my experimentation, the main stumbling block I see is with fonts. Using the built-in Times New Roman, Helvetica, and Courier (which probably only support the latin1 glyphs) is too limiting. I tried loading a type 1 font with the included functions, but had no success yet. This also requires file paths to .pfb and .afm files; we'd need something higher level that gets system fonts on all the major platforms. |
You make it sound like this would be bad. It's probably the best thing that could happen to tex 😛. |
well, don't get me wrong: i would very much welcome a pdf generator embedded in pandoc, especially since it would eliminate the need to install other tools. but in reality, this is quite an effort to do something that other (already existing) tools simply do better. my bet would be that this would start small with just a few features, but it would soon get attention and requests to do this or that and to support package xyz ... in the end, the quality of the generated pdfs will inevitably be compared to latex or other tools. i would rather stick to the good old unix tradition: a tool should do just one task and do it well. then it can be combined with other tools to achieve something bigger ... and using docker there is no need to maintain a latex installation anymore ... |
Well, I'm currently stuck on fonts. If HPDF allowed loading of TrueType fonts, then I think there'd be potential here. I can see the advantages of something that doesn't require external tools and is configurable in a simpler way than LaTeX. And my tinkering self likes the idea of controlling the whole typesetting from top to bottom. However, I can't even get type 1 fonts working, so I'm stuck. I don't know how hard it would be to add truetype to HPDF; maybe someone would like to take that on. |
One example of reinventing LaTeX is https://github.com/sile-typesetter/sile . Years ago someone mentioned they want to develop a pandoc writer to write to this language and convert to PDF. |
Interesting point from the SILE Manual:
Though looking at this screenshot, seems like SILE's typographic output quality is still somewhat lacking... |
@ickc / @tarleb Yup yup. I didn't mention it in the discussion above because it's not a candidate for a built-in typesetter in a Haskell environment, but when I say typesetting is more complicated than people think my opinion is based on considerable experience in trying to make it simpler! @mb21 Fair point — but allow me point out the that line space glitch is a known bug (see SILE issues №560 and №860) related to floating figures. That, and its corollary dropcaps (see SILE issue №394), are very sticky issues we haven't shaved off the rough edges from yet. The "three things people actually use" (and much more besides) work pretty well and there are several publishing companies using it exclusively for book publishing workflows (including drop-caps, but with extra care!). It's also been used for Unicode proposals and other tricky stuff. |
I believe that HPDF has all of these things. The one thing it doesn't have is reasonable font handling. |
Is a port of TeX really necessary? What about a simpler, text2pdf routine, with only paragraph and page breaking, without the hyphenation and box&glue, leaving the text left flushed? Otherwise, I fear it would be easier to implement pandoc in LuaTeX. |
I think the only concrete option mentioned above is to use HPDF. Is not a full port as you said which is unnecessary. Another thing to mention is the ability to cross compile to JS/web assembly is nice to have. Currently people has been able to cross compile to web assembly, and people in the past has been able to compile to JavaScript when pandoc has fewer non Haskell dependencies. One issue here is tracking non Haskell dependencies. |
Box and glue and hyphenation are already handled pretty well by HPDF. As noted above, the thing HPDF doesn't handle well is fonts. That's what blocks progress. EDIT: To expand on this: If you just use the latin1 character set and you don't mind using the standard fonts, HPDF is okay. But that's just not enough for pandoc. We need to support multilingual content and math characters. |
btw. depending on your use-case, you can also just export to HTML and the do print-to-PDF in a browser. That's basically what the "print" button in the PanWriter app's preview does (because it's an electron app, it ships with a browser). But yes, if for some reason you need this functionality bundled into the pandoc binary, then integrating HPDF into pandoc would probably be the best way... |
Hey, just been directed here while talking to Albert about bundled PDF support, and I'd like to also weigh in on the possibility of having a very simple built-in PDF converter. The "print"-button in Electron-apps is a subpar alternative, and standalone libraries for PDF-generation are actually really sparse. My use-case would be, since we're now bundling Pandoc with Zettlr, to offer a boiled-down PDF generation option for people who don't want to install anything additionally. And I think RStudio might also benefit from this …? Not sure though as I'm on Python … However, I see the issues with font handling (don't know Haskell, but Rust should be comparable from the pure mechanics of font-file handling) and that it's super difficult to get this working. So I can fully understand that this does not have priority, but if that comes one day, I'd be super happy! :) And, btw, to those fearing that this is could open the gates to a flood of feature requests, one could wrap the "new box, new image, new page" configuration options into a small API that can be controlled using LUA scripts …? This way everyone could implement their own logic if they so wish without putting any burden here! |
Could you elaborate on that? You think the quality of the PDF is lacking (not sure if we could hope for anything better by rolling our own), or is it because of the UX of the print dialog, or...? |
It's the UX - if you want to produce a PDF you expect to press a button and be done. With Electron's print abilities, however, this requires opening a separate window, and overall I think that "printing" is not completely the same as rendering. It does work, but it's, as I said, somehow subpar. |
I would love to have a native PDF writer in pandoc. In addition to the issue with the truetype fonts, I would like to point out a few more things that will need to be considered:
Of course some of that customization can be left out, provided using filters or using external programs, but this way the advantage of a built-in implementation is slowly lost. |
I think here we enter the realm which the integrated PDF writer should (in my opinion) not fulfill. Headers and Footers, yes, a basic title page, yes, but not much above that. For that, there are already a lot of good external writers out here. Remember: This issue is about adding something so that in many simple situations you don't need a full TeX installation. I guess, in general we should have a list of stuff that should be included and what won't be included, and which could be defined as just too much, because any additional feature here will add to the maintenance costs of the team, which we shouldn't overstrain. I think @jgm and @tarleb et al. should set this (because they have a better overview over what's possible and what not).
EPS is a dead and gruesomely complicated format (source: my mother is a graphics designer, nobody uses that anymore)
Guessing from my own experience, I'd say that would be a requirement for A LOT of people. Luckily, given good fonts, that can be done using fonts only (if I remember correctly, Pandoc already does this if MathJax/KaTeX is not available …?) |
Sure, but even basic styling and content adjustment is non trivial. If the CSS route is taken, the converter will need to be able to parse it and do something meaningful with it. For every thing that is not supported, someone will open a bug about it. I am not familiar with paged media, but can I specify that I want the page number on the header and a date on the footer? If not, where would that happen? Or that I want Chapter instead of Section (or Kapitel or Chapitre)
Maybe nobody uses it in graphics design, but in other areas it could be the format used by some older specialized software. Even if EPS is out of scope, SVG is also non trivial to parse and embed in a PDF. Asciidoctor-pdf leverages a 3kloc library for rendering SVGs into PDFs, but can't use the SVGs from draw.io directly without some minor adjustments. There are also other things I forgot to mention. The converter will need to be able to layout tables with all their complications, e.g. borders, column widths, multirow or multicolumn cells, page breaks in the middle of a table, page breaks in the middle of a tall row, etc. Footnotes seem to be also non-trivial, see pagedjs or asciidoctor-pdf. Some will also expect generated table of contents or lists of tables and figures.
In my opinion "simple situations" is difficult to define. For physical sciences math is part of simple, for other disciplines it is footnotes, for legal it could be support for 2 columns and for business customizable headers and footers or generation of table of contents. I don't want to be negative, but this is a big undertaking even if the scope is limited. Anyway, maybe the maintainers have a better idea on how to move this forward. I just wanted to give some of my experience of generating PDFs using asciidoctor as a user. |
Yes, but: If we go down the CSS route it could be extremely easy IF (!) and only if there is some Blink-port or something similar. If there isn't and we would need to implement CSS parsing ourselves (rather: The pandoc team because I'm dumb when it comes to Haskell) then it's gruesome, and I would advise against it.
Except, you clearly put that into the docs. Then you can easily close as "Out of scope; use LaTeX" or similar.
Exactly, but we should not support outdated formats (just as Pandoc, for instance, is not built in 32 bit, even though it would be trivial to implement) to not foster dependence on those.
Absolutely, but one format less is one format less to implement.
I totally agree. Which is why I explicitly stated that the Pandoc team should simply make a decision and then be done with it. After all, if such a bundled PDF writer comes, it's a nice thing of the Pandoc guys, not something that would be absolutely necessary. I think they will be able to do a good decision. In the end, many people would be happy even with a very simple, lossy, PDF generation at first. We shouldn't push boundaries too much here. But, alas, I'm not able to code anything for Pandoc, so I'll leave my two-cent-comments with this and keep looking forward to a simple integrated PDF writer :) |
If you want to style with CSS, just use the If we ever do add a native PDF renderer, it's going to be somewhat basic and limited. But I think it's a nonstarter unless HPDF gets better font support, and since HPDF seems to be a more or less dead project now, I don't see much hope of that. It is also an important point that laying out footnotes and tables is nontrivial. |
I note there's a font-related change in the latest version of HPDF: |
For future reference: My preferred way of rendering PDFs is now |
This is the one big thing I'm still missing from Pandoc: An easy, cross-platform way to generate PDFs, without having to rely on any external dependencies.
I understand that this will be massive undertaking, but I think even a simple implementation, which only supports to print some simple text or graphics would already be really helpful.
Rasterific (https://github.com/Twinside/Rasterific) looks like it could be a good library to achieve something like this.
The text was updated successfully, but these errors were encountered: