-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing non-Haskell dependencies (C dependencies) #4535
Comments
The following cabal files in pandoc's transitive dependencies contain c-sources stanzas.
|
Just to clarify, is the plan to provide pure Haskell implementation a substitute as an option or mandatory (completely removing those dependency)? I think some people will like to be able to compile with the higher performance C implementation when available. |
It would not be an option, probably.
But, honestly, the places where we use C libraries are
not the performance bottlenecks. You shouldn't see
much difference, e.g., because of the switch-out of the
YAML parser.
Kolen Cheung <notifications@github.com> writes:
… Just to clarify, is the plan to provide pure Haskell implementation a substitute as an option or mandatory (completely removing those dependency)? I think some people will like to be able to compile with the higher performance C implementation when available.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#4535 (comment)
|
If it is one way or the other, then I think pure Haskell with flexibility is better than a bit faster (honestly pandoc’s power is always its flexibility not performance, which is more important most of the time.) |
I have to revise what I said about the YAML parser. |
I've got a
|
I'm releasing skylighting 0.9, with no C dependencies. |
Hi, just trying to clarify about the YAML situation in pandoc. First a short summary,
So the question is which direction are we going? HsYAML or yaml? YAML 1.1 or 1.2? Thanks. Edit: add reply from below to summary, reformat summary a bit. |
Answer: I don't know. I like avoiding the C dependency. I don't know how much the performance issue is affecting users. For now, we stay with HsYAML. But I'd be happier with that if upstream were more responsive in looking at the performance issue. |
Running the command in #6084, pandoc 2.7.3 vs pandoc 2.11.3 differs from 1.42s to 16.84s on my computer. So the ratio is like a factor of dozen. So I don't know if the gap is big enough to make a difference—how close to C performance can we expect? Even if they can reduce it by a factor of 2, it would still be ~8s which to a human is not that discernible. Also, I think it is still roughly linear time (#6084 (comment)), so is the problem is really that problematic? Lastly the example over there probably is a practical upper bound (how big do we expect people injecting bibliography?) Less than a minute is still quite good in worst case scenario. Another idea: could we add a JSON reader in citeproc and ask people for large bibliography to inject JSON instead? Then people can preprocess YAML to JSON, and inject the JSON to pandoc/citeproc. (With the assumption that reading JSON should be much faster.) |
Yes, it's linear. But the difference between 25 seconds and 5 seconds is a significant one, probably enough to discourage anyone using a YAML bibliography that size. Maybe that's okay. They can always use CSL JSON (already possible), which is pretty fast. Bibtex will have performance intermediate between the two. |
I’d say a factor of 5 between Haskell and C is pretty good. Pandoc vs the fastest markdown parser probably has a much greater ratio. But on the other hand, the scenario to have a very big bibliography (where the user might only cite a few from it) disproportional to the length of what they may have in markdown seems more probable (comparing to having a huge markdown.) |
Reasons: - Performance: HsYAML is around 20 times slower in parsing large YAML files, such as bibliographies (#6084). An issue was submitted to HsYAML, but it hasn't gotten any attention. - HsYAML seems borderline unmaintained; it hasn't had a commit in over a year. - Unfortunately this goes back on our attempts to free ourselves from C dependencies (#4535). But I don't see a better alternative until a better pure Haskell parser is available. Closes #6084. Notes: - We've removed the FromYAML instances for all types that had them, since this is a HsYAML-specific typeclass [API change]. (The yaml package just uses From/ToJSON.) - Unlike HsYAML (in the configuration we were using), yaml parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values. Users may need to quote these when they are meant to be interpreted as strings. Similarly, 'null' is parsed as a YAML null value (and will be treated as an empty string by pandoc rather than the string 'null'). Quoting it will force it to be interpreted as a string. - Some tests had to be adjusted accordingly.
Reasons: - Performance: HsYAML is around 20 times slower in parsing large YAML bibliographies (#6084). - An issue was submitted to HsYAML, but it hasn't gotten any attention. HsYAML seems borderline unmaintained; it hasn't had a commit in over a year. - Unfortunately this goes back on our attempts to free ourselves from C dependencies (#4535). But I don't see a better alternative until a better pure Haskell parser is available. Closes #6084. Notes: - We've removed the FromYAML instances for all types that had them, since this is a HsYAML-specific typeclass [API change]. (The yaml package just uses From/ToJSON.) - Unlike HsYAML (in the configuration we were using), yaml parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values. Users may need to quote these when they are meant to be interpreted as strings. Similarly, 'null' is parsed as a YAML null value (and will be treated as an empty string by pandoc rather than the string 'null'). Quoting it will force it to be interpreted as a string. - Some tests had to be adjusted accordingly.
Reasons: - Performance: HsYAML is around 20 times slower in parsing large YAML bibliographies (#6084). - An issue was submitted to HsYAML, but it hasn't gotten any attention. HsYAML seems borderline unmaintained; it hasn't had a commit in over a year. - Unfortunately this goes back on our attempts to free ourselves from C dependencies (#4535). But I don't see a better alternative until a better pure Haskell parser is available. Closes #6084. Notes: - We've removed the FromYAML instances for all types that had them, since this is a HsYAML-specific typeclass [API change]. (The yaml package just uses From/ToJSON.) - Unlike HsYAML (in the configuration we were using), yaml parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values. Users may need to quote these when they are meant to be interpreted as strings. Similarly, 'null' is parsed as a YAML null value (and will be treated as an empty string by pandoc rather than the string 'null'). Quoting it will force it to be interpreted as a string. - Some tests had to be adjusted accordingly. - Pandoc now behaves better when the YAML metadata contains escaping errors: instead of just falling back on treating the section as a table, it raises a YAML parsing error.
Reasons: - Performance: HsYAML is around 20 times slower in parsing large YAML bibliographies (#6084). - An issue was submitted to HsYAML, but it hasn't gotten any attention. HsYAML seems borderline unmaintained; it hasn't had a commit in over a year. - Unfortunately this goes back on our attempts to free ourselves from C dependencies (#4535). But I don't see a better alternative until a better pure Haskell parser is available. Closes #6084. Notes: - We've removed the FromYAML instances for all types that had them, since this is a HsYAML-specific typeclass [API change]. (The yaml package just uses From/ToJSON.) - Unlike HsYAML (in the configuration we were using), yaml parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values. Users may need to quote these when they are meant to be interpreted as strings. Similarly, 'null' is parsed as a YAML null value (and will be treated as an empty string by pandoc rather than the string 'null'). Quoting it will force it to be interpreted as a string. - Some tests had to be adjusted accordingly. - Pandoc now behaves better when the YAML metadata contains escaping errors: instead of just falling back on treating the section as a table, it raises a YAML parsing error.
We should be much closer now that we've separated out Lua. Is there something left? Edit: I forgot that we switched back from HsYAML to yaml, so I guess that's the last roadblock now. |
On top of my head, probably yaml still is a C dependency because of performance issue. |
Following from this discussion there seems to be a value to locating non-Haskell dependencies and replacing them with Haskell.
@jgm has identified the following (comments his):
skylighting - depends on pcre - not clear how we can eliminate that without
a pure Haskell regex library that matches pcre's features. (Perhaps we could
provide a compiler flag to build without highlighting support, however.)
yaml - depends on libyaml - there are some proposals in Google Summer
of Code for writing a better pure Haskell yaml parser, so this may be
done in the relatively near future.
commonmark/gfm - depends on libcmark - I have been working on a pure
Haskell commonmark parser, which isn't yet published but which could
substitute for libcmark (at the cost of reduced performance).
https://github.com/jgm/commonmark-hs
confirm absence of other dependencies using HackageDB to chase down all of pandoc's transitive dependencies and exhaustively enumerate those that depend on foreign libraries.
The text was updated successfully, but these errors were encountered: