Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing non-Haskell dependencies (C dependencies) #4535

Open
3 of 4 tasks
aolney opened this issue Apr 5, 2018 · 14 comments
Open
3 of 4 tasks

Removing non-Haskell dependencies (C dependencies) #4535

aolney opened this issue Apr 5, 2018 · 14 comments

Comments

@aolney
Copy link

aolney commented Apr 5, 2018

Following from this discussion there seems to be a value to locating non-Haskell dependencies and replacing them with Haskell.

@jgm has identified the following (comments his):

  • skylighting - depends on pcre - not clear how we can eliminate that without
    a pure Haskell regex library that matches pcre's features. (Perhaps we could
    provide a compiler flag to build without highlighting support, however.)

  • yaml - depends on libyaml - there are some proposals in Google Summer
    of Code for writing a better pure Haskell yaml parser, so this may be
    done in the relatively near future.

  • commonmark/gfm - depends on libcmark - I have been working on a pure
    Haskell commonmark parser, which isn't yet published but which could
    substitute for libcmark (at the cost of reduced performance).
    https://github.com/jgm/commonmark-hs

  • confirm absence of other dependencies using HackageDB to chase down all of pandoc's transitive dependencies and exhaustively enumerate those that depend on foreign libraries.

@jgm
Copy link
Owner

jgm commented Jun 1, 2018

The following cabal files in pandoc's transitive dependencies contain c-sources stanzas.
In some cases they may only be used for testing or benchmarking.

time.cabal:72:    c-sources: lib/cbits/HsTime.c
time.cabal:179:    c-sources: test/unix/Test/Format/FormatStuff.c
aeson.cabal:156:    c-sources: cbits/unescape_string.c
aeson.cabal:166:  c-sources: cbits/unescape_string.c
cryptonite.cabal:240:  C-sources:         cbits/cryptonite_chacha.c
cryptonite.cabal:273:    C-sources:         cbits/decaf/p448/arch_ref64/f_impl.c
cryptonite.cabal:284:    C-sources:         cbits/decaf/p448/arch_32/f_impl.c
cryptonite.cabal:296:    C-sources: cbits/curve25519/curve25519-donna-c64.c
cryptonite.cabal:298:    C-sources: cbits/curve25519/curve25519-donna.c
cryptonite.cabal:315:    c-sources:      cbits/cryptonite_rdrand.c
cryptonite.cabal:321:    C-sources:       cbits/aes/x86ni.c
cryptonite.cabal:326:    C-sources:       cbits/aes/generic.c
cryptonite.cabal:331:    C-sources:      cbits/blake2/sse/blake2s.c
cryptonite.cabal:337:    C-sources:      cbits/blake2/ref/blake2s-ref.c
cryptonite.cabal:346:  C-sources:      cbits/argon2/argon2.c
basement.cabal:158:  c-sources:         cbits/foundation_mem.c
network.cabal:74:  c-sources: cbits/HsNet.c
zlib.cabal:89:      c-sources:   cbits/adler32.c cbits/compress.c cbits/crc32.c
integer-gmp.cabal:65:  c-sources:
hashable.cabal:64:  C-sources:
hashable.cabal:71:    c-sources:         cbits/getRandomBytes.c
hashable.cabal:141:  c-sources:
hashable.cabal:149:    c-sources:
hashable.cabal:154:      c-sources:
hashable.cabal:161:    c-sources:         cbits/getRandomBytes.c
primitive.cabal:59:  c-sources: cbits/primitive-memops.c
clock.cabal:74:      c-sources:         cbits/hs_clock_darwin.c
clock.cabal:76:      c-sources:         cbits/hs_clock_win32.c
cmark-gfm.cabal:82:    c-sources:         cbits/houdini_html_u.c
streaming-commons.cabal:61:  c-sources:           cbits/zlib-helper.c
text-short.cabal:51:     c-sources:        cbits/memcmp.c
text-short.cabal:65:  c-sources: cbits/cbits.c
ghc-prim.cabal:70:    c-sources:
process.cabal:66:    c-sources:
bytestring.cabal:129:  c-sources:         cbits/fpstring.c
bytestring.cabal:157:  c-sources:        cbits/fpstring.c
bytestring.cabal:178:  c-sources:        cbits/fpstring.c
bytestring.cabal:221:  c-sources:        cbits/fpstring.c
text.cabal:93:  c-sources:    cbits/cbits.c
text.cabal:173:  c-sources:      cbits/cbits.c
unix.cabal:133:    c-sources:
old-time.cabal:48:    c-sources:
hourglass.cabal:52:     c-sources:      cbits/unix.c
foundation.cabal:169:  c-sources:         cbits/foundation_random.c
criterion.cabal:77:  c-sources: cbits/cycles.c
criterion.cabal:79:    c-sources: cbits/time-osx.c
criterion.cabal:82:      c-sources: cbits/time-windows.c
criterion.cabal:84:      c-sources: cbits/time-posix.c
yaml.cabal:77:    c-sources:       c/helper.c
yaml.cabal:84:            c-sources:       libyaml/api.c,
binary.cabal:113:  c-sources: benchmarks/CBenchmark.c
hslua.cabal:103:    c-sources:          safer-api/safer-api.c
hslua.cabal:124:    c-sources:          lua-5.3.4/lapi.c
regex-pcre-builtin.cabal:73:  C-sources:
base.cabal:324:    c-sources:

jgm added a commit that referenced this issue Jun 29, 2018
yaml wraps a C library; HsYAML is pure Haskell.
Closes #4747.  Advances #4535.
@ickc
Copy link
Contributor

ickc commented Oct 16, 2018

Just to clarify, is the plan to provide pure Haskell implementation a substitute as an option or mandatory (completely removing those dependency)? I think some people will like to be able to compile with the higher performance C implementation when available.

@jgm
Copy link
Owner

jgm commented Oct 16, 2018 via email

@ickc
Copy link
Contributor

ickc commented Oct 16, 2018

If it is one way or the other, then I think pure Haskell with flexibility is better than a bit faster (honestly pandoc’s power is always its flexibility not performance, which is more important most of the time.)

@jgm
Copy link
Owner

jgm commented Jul 20, 2020

I have to revise what I said about the YAML parser.
See #6084 - we may need to go back to using yaml.

@jgm
Copy link
Owner

jgm commented Jul 27, 2020

I've got a newregex branch of skylighting now that uses a Haskell implementation of KDE regexes.
It's 2-4 times slower on benchmarks. (EDIT: added later, I've managed to improve the benchmarks quite a bit. Most are about 2x now.)
Profiling suggests that the time is spent on compiling rather than matching regexes.

21.1  13.4     -  Text.Regex.KDE.Compile compileRegex (665)
21.1  13.4     -    Text.Regex.KDE.Compile compileRegex.parser (665)
21.1  13.4   1.1      Data.Attoparsec.Internal.Types >>= (150410)
20.0  12.3   3.5        Data.Attoparsec.Internal.Types >>=.\ (372896)
15.2   8.6   7.8          Data.Attoparsec.Internal.Types >>=.\.succ' (250817)
 2.4     -     -            Text.Regex.KDE.Compile pRegex (1631)
 1.3    .2     -            Data.Attoparsec.Internal.Types fail (98371)
 1.3    .2    .3              Data.Attoparsec.Internal.Types fail.\ (98384)
  .5     -     -                Text.Regex.KDE.Compile pRegex (0)
  .3     -     -                Text.Regex.KDE.Compile pSuffix (0)
  .3     -     -                Data.Attoparsec.Internal.Types plus (0)
 1.3     -     -            Data.Attoparsec.Internal.Types pure (70980)
  .5     -     -            Data.Attoparsec.Internal.Types plus (20678)
  .5     -     -            Text.Regex.KDE.Compile char (0)
  .5    .2    .3            Text.Regex.KDE.Compile pRegexCharClass.getC (0)
  .3     -     -              Text.Regex.KDE.Compile char (0)
  .5     -     -            Text.Regex.KDE.Compile pSuffix (6817)
  .3    .3     -            Data.Attoparsec.Internal.Types many (0)
  .3    .3    .3              Data.Attoparsec.Internal.Types many.some_v (0)
  .8     -     -          Text.Regex.KDE.Compile pRegex (1229)
  .5    .2     -          Data.Attoparsec.Internal demandInput (0)
  .5    .2     -            Data.Attoparsec.Internal demandInput.\ (8577)
  .5    .2    .3              Data.Attoparsec.Internal.Types >>=.\.succ' (3825)
  .3     -     -                Text.Regex.KDE.Compile pRegex (0)


 2.4   2.4     -  Data.Attoparsec.Internal.Types plus (52569)
 2.4   2.4   1.6    Data.Attoparsec.Internal.Types plus.\ (121358)
  .8    .8    .6      Data.Attoparsec.Internal.Types plus.\.lose' (54707)
  .3    .3    .3        Text.Regex.KDE.Compile pRegexCharClass (0)


 1.9   1.9   1.9  Skylighting.Syntax.Php syntax (1)


 3.7   1.8   1.7  Text.Regex.KDE.Compile pRegexPart (2860)
 1.8    .2    .3    Text.Regex.KDE.Compile pRegexChar (2860)
 1.3     -     -      Data.Attoparsec.Internal.Types plus (11440)
  .3     -     -      Text.Regex.KDE.Compile pRegexAsciiChar (0)
  .3     -     -        Data.Attoparsec.Internal.Types fmap (0)
  .3     -     -    Text.Regex.KDE.Compile withGroupModifiers (0)
  .3     -     -      Text.Regex.KDE.Compile char (0)


 2.0   1.7     -  Data.Attoparsec.Internal.Types fmap (8365)
 2.0   1.7   1.1    Data.Attoparsec.Internal.Types fmap.\ (57527)
  .6    .6    .6      Data.Attoparsec.Internal.Types fmap.\.succ' (4518)
  .3     -     -      Text.Regex.KDE.Compile pRegexPart (0)

@jgm
Copy link
Owner

jgm commented Jul 31, 2020

I'm releasing skylighting 0.9, with no C dependencies.
Once pandoc depends on this, we'll have checked off the main three C dependencies listed above.
It would be worth checking more carefully to see if there are others downstream.

@ickc
Copy link
Contributor

ickc commented Dec 18, 2020

Hi, just trying to clarify about the YAML situation in pandoc.

First a short summary,

So the question is which direction are we going? HsYAML or yaml? YAML 1.1 or 1.2?

Thanks.

Edit: add reply from below to summary, reformat summary a bit.

@jgm
Copy link
Owner

jgm commented Dec 18, 2020

Answer: I don't know. I like avoiding the C dependency. I don't know how much the performance issue is affecting users. For now, we stay with HsYAML. But I'd be happier with that if upstream were more responsive in looking at the performance issue.

@ickc
Copy link
Contributor

ickc commented Dec 18, 2020

Running the command in #6084, pandoc 2.7.3 vs pandoc 2.11.3 differs from 1.42s to 16.84s on my computer. So the ratio is like a factor of dozen. So I don't know if the gap is big enough to make a difference—how close to C performance can we expect? Even if they can reduce it by a factor of 2, it would still be ~8s which to a human is not that discernible.

Also, I think it is still roughly linear time (#6084 (comment)), so is the problem is really that problematic?

Lastly the example over there probably is a practical upper bound (how big do we expect people injecting bibliography?) Less than a minute is still quite good in worst case scenario.

Another idea: could we add a JSON reader in citeproc and ask people for large bibliography to inject JSON instead? Then people can preprocess YAML to JSON, and inject the JSON to pandoc/citeproc. (With the assumption that reading JSON should be much faster.)

@jgm
Copy link
Owner

jgm commented Dec 18, 2020

Yes, it's linear. But the difference between 25 seconds and 5 seconds is a significant one, probably enough to discourage anyone using a YAML bibliography that size. Maybe that's okay. They can always use CSL JSON (already possible), which is pretty fast. Bibtex will have performance intermediate between the two.

@ickc
Copy link
Contributor

ickc commented Dec 18, 2020

I’d say a factor of 5 between Haskell and C is pretty good. Pandoc vs the fastest markdown parser probably has a much greater ratio.

But on the other hand, the scenario to have a very big bibliography (where the user might only cite a few from it) disproportional to the length of what they may have in markdown seems more probable (comparing to having a huge markdown.)

jgm added a commit that referenced this issue Oct 27, 2021
Reasons:

- Performance: HsYAML is around 20 times slower in parsing
  large YAML files, such as bibliographies (#6084).
  An issue was submitted to HsYAML, but it hasn't gotten
  any attention.
- HsYAML seems borderline unmaintained; it hasn't had a
  commit in over a year.
- Unfortunately this goes back on our attempts to free ourselves
  from C dependencies (#4535).  But I don't see a better alternative
  until a better pure Haskell parser is available.

Closes #6084.

Notes:

- We've removed the FromYAML instances for all types that had
  them, since this is a HsYAML-specific typeclass [API change].
  (The yaml package just uses From/ToJSON.)
- Unlike HsYAML (in the configuration we were using), yaml
  parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values.
  Users may need to quote these when they are meant to be
  interpreted as strings.  Similarly, 'null' is parsed as
  a YAML null value (and will be treated as an empty string
  by pandoc rather than the string 'null').  Quoting it will
  force it to be interpreted as a string.
- Some tests had to be adjusted accordingly.
jgm added a commit that referenced this issue Oct 27, 2021
Reasons:

- Performance: HsYAML is around 20 times slower in parsing
  large YAML bibliographies (#6084).
- An issue was submitted to HsYAML, but it hasn't gotten
  any attention.  HsYAML seems borderline unmaintained; it hasn't
  had a commit in over a year.
- Unfortunately this goes back on our attempts to free ourselves
  from C dependencies (#4535).  But I don't see a better alternative
  until a better pure Haskell parser is available.

Closes #6084.

Notes:

- We've removed the FromYAML instances for all types that had
  them, since this is a HsYAML-specific typeclass [API change].
  (The yaml package just uses From/ToJSON.)
- Unlike HsYAML (in the configuration we were using), yaml
  parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values.
  Users may need to quote these when they are meant to be
  interpreted as strings.  Similarly, 'null' is parsed as
  a YAML null value (and will be treated as an empty string
  by pandoc rather than the string 'null').  Quoting it will
  force it to be interpreted as a string.
- Some tests had to be adjusted accordingly.
jgm added a commit that referenced this issue Oct 27, 2021
Reasons:

- Performance: HsYAML is around 20 times slower in parsing
  large YAML bibliographies (#6084).
- An issue was submitted to HsYAML, but it hasn't gotten
  any attention.  HsYAML seems borderline unmaintained; it hasn't
  had a commit in over a year.
- Unfortunately this goes back on our attempts to free ourselves
  from C dependencies (#4535).  But I don't see a better alternative
  until a better pure Haskell parser is available.

Closes #6084.

Notes:

- We've removed the FromYAML instances for all types that had
  them, since this is a HsYAML-specific typeclass [API change].
  (The yaml package just uses From/ToJSON.)
- Unlike HsYAML (in the configuration we were using), yaml
  parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values.
  Users may need to quote these when they are meant to be
  interpreted as strings.  Similarly, 'null' is parsed as
  a YAML null value (and will be treated as an empty string
  by pandoc rather than the string 'null').  Quoting it will
  force it to be interpreted as a string.
- Some tests had to be adjusted accordingly.
- Pandoc now behaves better when the YAML metadata contains
  escaping errors: instead of just falling back on treating
  the section as a table, it raises a YAML parsing error.
jgm added a commit that referenced this issue Oct 27, 2021
Reasons:

- Performance: HsYAML is around 20 times slower in parsing
  large YAML bibliographies (#6084).
- An issue was submitted to HsYAML, but it hasn't gotten
  any attention.  HsYAML seems borderline unmaintained; it hasn't
  had a commit in over a year.
- Unfortunately this goes back on our attempts to free ourselves
  from C dependencies (#4535).  But I don't see a better alternative
  until a better pure Haskell parser is available.

Closes #6084.

Notes:

- We've removed the FromYAML instances for all types that had
  them, since this is a HsYAML-specific typeclass [API change].
  (The yaml package just uses From/ToJSON.)
- Unlike HsYAML (in the configuration we were using), yaml
  parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values.
  Users may need to quote these when they are meant to be
  interpreted as strings.  Similarly, 'null' is parsed as
  a YAML null value (and will be treated as an empty string
  by pandoc rather than the string 'null').  Quoting it will
  force it to be interpreted as a string.
- Some tests had to be adjusted accordingly.
- Pandoc now behaves better when the YAML metadata contains
  escaping errors: instead of just falling back on treating
  the section as a table, it raises a YAML parsing error.
@tarleb
Copy link
Collaborator

tarleb commented Oct 22, 2022

We should be much closer now that we've separated out Lua. Is there something left?

Edit: I forgot that we switched back from HsYAML to yaml, so I guess that's the last roadblock now.

@ickc
Copy link
Contributor

ickc commented Oct 22, 2022

On top of my head, probably yaml still is a C dependency because of performance issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants