Make `Format` an enumerated type #547

anton-k · 2012-06-26T17:22:58Z

Format is a synonym for String. User have to look at the source code to find out right values for this type. (It can be "html" or "Html" or "latex" or "LaTeX" or "tex"). It's not clear wich from the docs alone. Maybe it's better to define a new data type:

data Format = FormatHtml | FormatTex | ...

The text was updated successfully, but these errors were encountered:

mpickering · 2014-11-07T19:02:28Z

+1 to this

ghost · 2015-03-04T08:28:19Z

I couldn't find any actual type synonym definition for Format.
I used ag -s type | ag -s Format and the results were useless.

Is the type synonym referred to a philosophical one? Or has it been fixed already?

timtylin · 2015-03-04T08:45:46Z

Format is defined in pandoc-types, specifically in Text.Pandoc.Definition.

Note that it already has a "it just works" instance of Eq that ignores case:

newtype Format = Format String
               deriving (Read, Show, Typeable, Data, Generic)

instance IsString Format where
  fromString f = Format $ map toLower f

instance Eq Format where
  Format x == Format y = map toLower x == map toLower y

I seem to recall that the reason Format is a String is mainly due to the way extensions are specified, where they are just concatenated onto Format using the + and - char. This is not just the way the CLI works, but also how the module itself exposes getReader and getWriter (i.e., the formats get passed through Text.Pandoc.parseFormatSpec).

I mean it's arguably a hack, but changing it to an actual sum type (in fact a product type when you consider the set of extensions) will definitely break backward compatibility, so this is most likely going to be a 2.0 thing.

ghost · 2015-03-04T09:47:18Z

Oh, so it's a different package. Thanks for the info.
I'll drop it for the time being.

jgm · 2015-03-04T17:07:00Z

+++ Tim T.Y. Lin [Mar 04 15 00:45 ]:

I seem to recall that the reason Format is a String is mainly due to the way extensions are specified, where they are just concatenated onto Format using the + and - char. This is not just the way the CLI works, but also how the module itself exposes getReader and getWriter (i.e., the formats get passed through Text.Pandoc.parseFormatSpec).

No, the extensions are not part of the string on Format.

I mean it's arguably a hack, but changing it to an actual sum type (in fact a product type when you consider the set of extensions) will definitely break backward compatibility, so this is most likely going to be a 2.0 thing.

Right. In principle, a sum type would be better. However, it's a big
change and would break lots of existing filters, so it's not clear it's
worth it.

timtylin · 2015-03-04T21:34:22Z

No, the extensions are not part of the string on Format.

Right, turns out getReader/getWriter takes a String and not a Format. Well that just makes it even more inconsequential then. Does anything directly use Format when interfacing with Pandoc, other than filters written in Haskell?

tarleb · 2019-02-05T07:28:16Z

One question is whether it should be possible to pass a custom Format, or whether Format can only contain known formats. I.e., should we use

data Format = Markdown | Docx | ReStructuredText | …

or rather

data KnownFormat = Markdown | Docx | ReStructuredText | …

data Format = Format KnownFormat
            | CustomFormat String

I can see arguments for both variants; most arguments in favor of a finite sum type are given above. On the other hand, we'd limit users in their ability to pass format information to filters, custom writers, and programs built on top of pandoc's library.

Personally, I lean towards a finite sum type, as I feel the advantages out-weight the slight loss in flexibility. The only real problem I see is how to handle unknown formats specifications during parsing: Should those be turned into a default format, or maybe a code block?

jgm · 2019-02-06T17:44:33Z

I'm not sure about the finite vs extensible question, but like you I lean towards finite.
The obvious approach would be to just omit raw content with an unknown format, with a log warning.

If we're thinking about this question, I think we might want to address a bigger issue about raw blocks. This has come up with ipynb. Jupyter notebook code cells will often generate output in multiple formats: for example, a table might be produced in text/latex and text/plain. The plain version is a fallback, so if you're converting to HTML, the HTML version will be used; if to LaTeX, the fallback would be to include the plain text version in a verbatim environment.

It's tough to handle this properly in pandoc. Given that the behavior of the reader is supposed to be independent of the writer, we can either (a) include both the HTML version as a raw block and the plain text version as a code block, with the result that you'll see TWO versions of the table when it's converted to HTML or (b) just include the HTML version, with the result that there will be no fallback when it's converted to LaTeX or other formats. A bad choice, which makes it impossible to fully emulate nbconvert.

One thing that would help here would be an AST element that includes content conditionally on the format. Something like this:

[ IfFormat HTML [RawBlock "<table>..."]
, IfFormat LaTeX [CodeBlock "..."]
]

With this kind of structure one could remove the Format specifier from the RawBlock itself.

But thinking about the fallback part of this, one sees a need for format specifications that encompass multiple formats, like HTML OR Markdown or NOT(HTML OR Markdown). (Format could perhaps be a Boolean algebra, https://hackage.haskell.org/package/cond-0.4.1/candidate/docs/Data-Algebra-Boolean.html)

mb21 · 2019-06-15T13:33:55Z

Jupyter notebook code cells will often generate output in multiple formats

Could you give a couple more examples? Is the fallback always plain-text? Or are the fallbacks at least ordered? Like try html but if you cannot do that try some format and if all else fails try plain text?

Just a thought: instead of going with a whole boolean algebra, the ipynb reader could also put in a Raw "ipynb" ... and then we would put in a pandoc filter (which would know what the input and output format is) that does the right thing. But yeah, maybe that's not actually better.

jgm · 2019-06-15T17:18:04Z

What I ended up doing is putting a little filter filterIpynbOutput in T.P.App; if --ipynb-output=best is selected, this tries to determine the best raw block to use, given the output format, and strips the others. So, a bit like your idea.

despresc · 2020-09-03T04:00:53Z

A few thoughts on a new Format type.

Having a Formats algebra to specify ranges of formats like in the stalled pull request is a good idea, as is having something like IfFormatBlock and IfFormatInline constructs (from this comment). I don't think the If* constructors remove the need for the Format in the Raw* constructors, though, based on current usage. In Writers.Markdown, as an example, the format of a RawBlock influences how it's rendered, not just whether or not it's rendered.

One outline of a design is to include something like this in pandoc-types:

module Text.Pandoc.Format where

-- Absolutely anything that might occur in Format right now is included. Requires a look through
-- the pandoc code base to get everything, I think.
data Format = HTML | HTML4 | HTML5 | EPUB | EPUB2 | EPUB3 | ...
  deriving (..., Enum, Bounded)

-- The Formats boolean algebra is just the normal one for Set Format.
newtype Formats = Formats (Set Format)

-- As a format specifier or selector, Formats x means "any of the formats in x".
matchesFormat :: Formats -> Format -> Bool
(Formats s) `matchesFormat` f = f `Set.member` s

anyOf :: [Format] -> Formats
anyOf = Formats . Set.fromList

anyFormat :: Formats
anyFormat = anyOf [minBound..maxBound]

notFormat :: Formats -> Formats
notFormat (Formats s) = Formats $ t `Set.difference` s
  where Formats t = anyFormat

-- and various other boolean operations on Formats

The Format type supports a sub-format relation, where x is a sub-format of y if a raw element of format x can always be included in an output format y. This (with helper functions) should make it easier to figure out when IfFormat* and Raw* elements should be rendered. The two functions below should represent that relation, the actual definitions requiring a look through pandoc to make sure they're accurate.

-- List the sub-formats of the given format
includesFormats :: Format -> Formats
includesFormats HTML = fromList [HTML, HTML4, HTML5, EPUB, EPUB2, EPUB3]
includesFormats HTML5 = fromList [HTML5, EPUB3]
includesFormats EPUB = fromList [EPUB, EPUB2, EPUB3]
-- etc.

-- List the super-formats of the given format
includedByFormats :: Format -> Formats
includedByFormats HTML = fromList [HTML]
includedByFormats HTML5 = fromList [HTML, HTML5]
includedByFormats EPUB = fromList [HTML, EPUB]
-- etc.

It would be simpler to have only concrete, fully-specified formats in Format (and maybe consolidate formats that are indistinguishable from each other), but that would complicate things for Writers.Markdown, which needs to be able to render a Format when writing a RawBlock or RawInline. That also means that Format can't easily be replaced by Formats in those constructors.

Having a "big" Format type should at least allow it to be used in places where Text is used currently, like reader specification in Reader.readers, or default extension selection in Extensions.

despresc · 2020-09-03T16:57:39Z

Currently, Format is used only by the writers to figure out how to render a RawBlock and RawInline. I have noticed a couple of things in pandoc that have implications for the sub-format relation:

all of the markdown* formats are equivalent to each other in the sub-format sense, in that any raw element in one markdown* format can always be included in the output for any other. The only way they differ seems to be in choosing default extensions (and that happens via a Text string, not a Format).
many output formats like commonmark*, epub*, slideous, and so on, are not related to any other format in the sub-format sense, even themselves: they are never included in any output at all.

If Format is to be used in more places, it might be helpful also to have a

toConcreteFormat :: Format -> Format
toConcreteFormat HTML = HTML5
toConcreteFormat HTML5 = HTML5
toConcreteFormat EPUB = EPUB3
-- etc.

that takes under-specified formats and chooses a default concrete one for them, like the --to option currently does.

despresc · 2020-09-03T21:47:49Z

Maybe a better way to define a sub-format is to say that x is a sub-format of y if whenever a raw element of format y can be included somewhere, a raw element of format x can be included in the same place and in the same way.

mb21 · 2020-09-04T08:25:34Z

I have the feeling there are a few different "sub-format" relations..

you have the RawInline and RawBlock AST elements, which enable you to include raw snippets of format X when doing -t X, but also include tex in markdown for example (while other writers would drop raw tex)
same writer, different extensions enabled:
- e.g. -t markdown_phpextra
- similarly, when doing pandoc -t html, it's a synonym for -t html5
different writer, but it uses another writer
- when doing -t epub, there is an epub writer, which however calls the html writer
- -t pdf uses either latex or html writer
- -t odt basically zips what-t opendocument would produce AFAIK

despresc · 2020-09-04T19:19:17Z

Yes, I think there are a few relevant relations. There are:

the Raw* one, so that writers can test how a Raw* element should be included, if at all
the IfFormatBlock and IfFormatInline one, once they exist, so that conditional rendering happens properly
the --to one, where some formats are aliases for other formats

I think jgm/pandoc-types#78 deals with the first two. The -t one can be solved by making sure Writers.writers is kept up-to-date, and maybe writing a toConcreteFormat :: Format -> Format function.

I think the writers using other writers as intermediates sorts itself out naturally from the perspective of the first two relations, based on the current pandoc behaviour. Right now it's stated in the manual that raw blocks need to use an html* format to be included in epub* output, and that the format to be included in -t pdf is whatever the engine is, so I think there's an expectation that the format used to render the document initially won't be the same as the final output format.

The formats representing different extensions problem should also hopefully be solved in that pull request, for instance by considering all the markdown* formats to be sub-formats of each other.

mpickering added the complexity:low label Dec 7, 2014

jgm mentioned this issue Jan 2, 2015

Task list (e.g. for GSOC) #1852

Closed

jgm added the AST change label Dec 9, 2016

jgm removed the complexity:low label Mar 9, 2017

jgm mentioned this issue Dec 4, 2018

Use ADT to represent input formats #5118

Closed

tarleb mentioned this issue Apr 18, 2019

Not reading image links in org-mode #5454

Open

jgm mentioned this issue Jun 14, 2019

List of projects #5581

Closed

9 tasks

tarleb linked a pull request Jun 15, 2019 that will close this issue

Use enumerable set of known formats jgm/pandoc-types#56

Open

jgm mentioned this issue Jan 13, 2020

roff->md conversion incorrectly converts \- to - #6041

Closed

despresc linked a pull request Sep 4, 2020 that will close this issue

Write new Format and Formats types, some helper functions jgm/pandoc-types#78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `Format` an enumerated type #547

Make `Format` an enumerated type #547

anton-k commented Jun 26, 2012

mpickering commented Nov 7, 2014

ghost commented Mar 4, 2015

timtylin commented Mar 4, 2015

ghost commented Mar 4, 2015

jgm commented Mar 4, 2015

timtylin commented Mar 4, 2015

tarleb commented Feb 5, 2019 •

edited

Loading

jgm commented Feb 6, 2019 •

edited

Loading

mb21 commented Jun 15, 2019 •

edited

Loading

jgm commented Jun 15, 2019

despresc commented Sep 3, 2020 •

edited

Loading

despresc commented Sep 3, 2020

despresc commented Sep 3, 2020 •

edited

Loading

mb21 commented Sep 4, 2020

despresc commented Sep 4, 2020

Make Format an enumerated type #547

Make Format an enumerated type #547

Comments

anton-k commented Jun 26, 2012

mpickering commented Nov 7, 2014

ghost commented Mar 4, 2015

timtylin commented Mar 4, 2015

ghost commented Mar 4, 2015

jgm commented Mar 4, 2015

timtylin commented Mar 4, 2015

tarleb commented Feb 5, 2019 • edited Loading

jgm commented Feb 6, 2019 • edited Loading

mb21 commented Jun 15, 2019 • edited Loading

jgm commented Jun 15, 2019

despresc commented Sep 3, 2020 • edited Loading

despresc commented Sep 3, 2020

despresc commented Sep 3, 2020 • edited Loading

mb21 commented Sep 4, 2020

despresc commented Sep 4, 2020

Make `Format` an enumerated type #547

Make `Format` an enumerated type #547

tarleb commented Feb 5, 2019 •

edited

Loading

jgm commented Feb 6, 2019 •

edited

Loading

mb21 commented Jun 15, 2019 •

edited

Loading

despresc commented Sep 3, 2020 •

edited

Loading

despresc commented Sep 3, 2020 •

edited

Loading