parsers,lisp: speed up token lookup #3741

dtikhonov · 2023-05-31T04:56:44Z

No description provided.

masatake · 2023-05-31T06:07:02Z

parsers/lisp.c

@@ -149,106 +149,94 @@ static int  lisp_hint2kind (const vString *const hint)
 	return k;
 }

-/* TODO: implement this in hashtable. */
+/* Lookup code produced by gperf version 3.1 */


This is helpful for maintenance.
Could you write more about this?

I wonder how to generate the table.

@masatake I put a comment into the code. I think that adding gperf to the required toolchain is too much of a pain. Do you?

Give me time to answer. I have some ideas about this change.

In the build process of ctags, we use two code generators: packcc and optlib2c.

packcc is developed at an external project. With "git subtree," the
source code of packcc is integrated into our source tree. On UNIX-alike
systems, packcc is built before building ctags. packcc is run for
generating .c and .h files. If I understand
correctly, on Windows+msvc , packcc is not built. So parsers requiring
packcc are not built on Windows+msvc. The *.peg files under peg/ are
input for packcc. Generated .c and .h are not in our git repo.

optlib2c is a code generator. It is in our source tree (misc/optlib2c)
and we maintain it as part of ctags. It is written in Perl.
optlib2c CAN generate .c files from optlib/*.ctags files.
We put both .ctags files and generated .c files into our git repo.
So, unlike .peg based parsers, .ctags parsers are built on any platform
including Windows+msvc.

I decided to put .c files derived from .ctags into our git repo because
I know optlib2c generates the same C code for the same .ctags file.

I decided not to put .c and .h files derived from .peg into our git repo because
I should not expect packcc generates the same code for the same .peg file; packcc
is developed at the external project.

Even .ctags parser, there is a limitation. As far as not modifying a .ctags
parser, one can build the .ctags parser even on Windows+msvc. However,
one cannot modify .ctags because the build script for Windows+msvc has no
rule to run optlib2c.

Conclusion:

optlib2c based parsers (.ctags parsers or optlib parsers) can be built
on Windows+msvc unless modified. If modified, it is hard to
build the parser on the platform.

packcc based parsers (.peg parsers) cannot be built on Windows+msvc.

We can enhance the build scripts on Windows+msvc. However, it is
not a topic here.

What should we do about gperf?

In my understanding, gperf is a rather stable program.
So we can expect it generates the same C code for the same input.
So, like optlib2c, we can put generated code into our git source tree.
That means Windows+msvc users can build the gperf based parser.
However, unlike optlib2c and packcc, we don't put the source code
of gperf into our source tree. One wanting to modify a gperf based parser
must install gperf in one's workspace.

So what we have to do:

TODO1. we must choose a version of gperf all developers and contributors use
to ensure the expectation.

TODO2. we must provide a container file (Dockerfile, Contaienrfile),
or a container image to run the version of gperf.

TODO3. we must deploy a detector on our CI environments for comparing
the code generated by the version of gperf and code submitted by
a contributor. If a pull request includes code generated by an unexpected version of gperf installed in the contributor's workspace, merging
should be prevented automatically.

Above TODO items are about our build-infrastructure.

TODO4. A generated code and manually written code must be separated at the file level.

lisp.c: this file should not include any generated code.

lisp_gperf.c: this file should include only generated.

TODO5. An input for gperf (.gperf file) must be in our git repo.

lisp_def.gperf: one wanting to add a new keyword may edit this file.

lisp_gperf.c and lisp_def.gperf are the new source code for ctags.

TODO6. source.mak must be extended to accept the new files.
TODO7. The way to update the win32 build script must be provided.

Achieving these items may be possible but challenging.

Thank you for reading, and sorry for my broken English.
I burned out.

Thank you, @masatake, this is well detailed. It is a lot of work: which is why I integrated the generated code by hand.

I think this is too much work for small payoff. How important is it to be able to add keywords easily? Do we think that new Emacs Lisp keywords will be added with some frequency? If not, I think this PR is good enough. We can take the lazy approach: if new keywords are to be added, we can update the mechanism as you outlined above.

How important is it to be able to add keywords easily?

Just another critical question. Give me time to answer again. I have too many words about this.

Sorry to be late to answer.

I think this is too much work for small payoff.
Yes if we apply the technique only to the EmacsLisp parser.
No because we can apply the technique to many other parsers.

How important is it to be able to add keywords easily? Do we think that new Emacs Lisp keywords will be added with some frequency?

Conceptually, keywords will be added frequently.
Let me explain.

I have planned to unify parsers with S expression-based syntaxes like Lisp, EmacsLisp, Scheme, and Clojure for years:

Add parser classes, xml, sexp, ini, and toml #519

clojure parser needs improvement #988

What I have to do may be:

S expression-based meta parser. It is a library to develop a concrete parser specialized to a real language.

Rewrite Lisp, EmacsLisp, Scheme, and Clojure based on the meta parser.

If the way to add a keyword is not sophisticated, it will be an issue in realizing the plan.

This is not only the reason. Developing in-language DSL is a viral technique among Lisp programmers. They may want to make tags for their own def-something.
See https://docs.ctags.io/en/latest/running-multi-parsers.html?highlight=agent#subparser-tagging-definitions-of-higher-upper-level-language.

Universal Ctags provides a concept, subparser, to satisfy these needs.
If a programmer knows C language, one can add a subparser specialized to one's DSL.
A subparser is a set of callback routines called from a base parser.
Moose parser is an example of subparser. It is called from Perl parser.

The concept of subparser is not enough for real demands.
Only a few people want to write C code for extracting tags from one's DSL input.
Only a few people like me want to pay the cost.
The greatest and the most loved feature of ctags may be --regex-=....
It will be nice if ctags provides a way to define a subparser for a Lisp-based parser from the command line like --Sexp-=...
In this case, a user may define a keyword from the command line.

In these grand stories, it is YES for the question "keywords will be added with some frequency?".

However, what I wrote is not more than a plan. It has not been realized for ... 7 years.
The current development should be limited to plans.

I can only ask you to make a way to add a keyword (1) easy, (2) well-documented, or (3) integrated into our build system.

Thank you for reading.

Thank you for the explanation. I understand what you are trying to accomplish. I will give this some thought.

codecov · 2023-05-31T07:14:38Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.02 ⚠️

Comparison is base (b3e1ad7) 82.99% compared to head (156c835) 82.98%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3741      +/-   ##
==========================================
- Coverage   82.99%   82.98%   -0.02%     
==========================================
  Files         226      226              
  Lines       55039    54994      -45     
==========================================
- Hits        45680    45635      -45     
  Misses       9359     9359

Impacted Files	Coverage Δ
parsers/lisp.c	`85.12% <100.00%> (-4.71%)`	⬇️

... and 6 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

parsers,lisp: speed up token lookup

0f3e009

masatake reviewed May 31, 2023

View reviewed changes

dtikhonov added 2 commits May 31, 2023 12:13

Make more maintainable

45214ed

Update comment: no code changes

156c835

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsers,lisp: speed up token lookup #3741

parsers,lisp: speed up token lookup #3741

dtikhonov commented May 31, 2023

masatake May 31, 2023 •

edited

Loading

dtikhonov May 31, 2023

masatake May 31, 2023

masatake Jun 7, 2023

dtikhonov Jun 7, 2023

masatake Jun 7, 2023

masatake Jun 23, 2023

dtikhonov Jun 24, 2023

codecov bot commented May 31, 2023 •

edited

Loading

parsers,lisp: speed up token lookup #3741

Are you sure you want to change the base?

parsers,lisp: speed up token lookup #3741

Conversation

dtikhonov commented May 31, 2023

masatake May 31, 2023 • edited Loading

Choose a reason for hiding this comment

dtikhonov May 31, 2023

Choose a reason for hiding this comment

masatake May 31, 2023

Choose a reason for hiding this comment

masatake Jun 7, 2023

Choose a reason for hiding this comment

dtikhonov Jun 7, 2023

Choose a reason for hiding this comment

masatake Jun 7, 2023

Choose a reason for hiding this comment

masatake Jun 23, 2023

Choose a reason for hiding this comment

dtikhonov Jun 24, 2023

Choose a reason for hiding this comment

codecov bot commented May 31, 2023 • edited Loading

Codecov Report

masatake May 31, 2023 •

edited

Loading

codecov bot commented May 31, 2023 •

edited

Loading