Escape leading spaces in tag names instead of stripping them #1580

b4n · 2017-10-17T05:32:16Z

Since #1567 leading spaces are stripped from tag names, but this breaks legitimate use of leading spaces in tag names (and actually changed behavior with unproblematic leading characters like \n (LF) or \t (HT)).

Instead, escape those characters when rendering instead. It also makes more sense to handle this when rendering as the effective problem is with the output format only.

@techee I'm asking for your review here merely for the HTML parser changes, but they should be straightforward.

masatake · 2017-10-17T05:35:40Z

@b4n, do you have any real example that "legitimate use of leading spaces in tag names" ?

b4n · 2017-10-17T05:38:05Z

main/field.c

+		 * pseudo-tags when sorting.  Anything with a lower byte value is
+		 * escaped by renderEscapedString() already. */
+		unexpected_byte = *s;
+		vStringPut (b, '\\');


The escape sequence itself is at your discretion, here I merely prefixed with a backslash, but it might make more sense to output \x20 for space and \x21 for an exclamation mark (their hex escape sequence, like we would have \x1f for Unit Separator (US)).

b4n · 2017-10-17T05:41:51Z

@masatake Yes: for example JavaScript allows any string as an object property name, like this:

var obj = {
  ' ': function() {}, // one space
  '\t': function() {}, // one tab
};

obj[' ']();
obj['\t']();

And I wanted to add tests for this when importing your signature normalization code to make sure it wasn't normalizing something it shouldn't be normalizing.

b4n · 2017-10-17T05:53:22Z

(I'll fix the test failure, sorry)

b4n · 2017-10-17T06:09:58Z

main/field.c

-			break;
-		}
-		else if (c == '\\')
-			break;


@masatake why no warning when there is a \? I wasn't emitting a warning for the scope "\\Broken\tContext" in the self-test as it stopped at the leading \ even though there is a \t, but I hardly see the point here; should I still make my change behave the same? I would rather think that this was an implementation detail and that emitting the warning in this case is the expected behavior, what do you think?

if ((c > 0x00 && c <= 0x1F) || c == 0x7F) { verbose ("Unexpected character (0 < *c && *c < 0x20) included in a tagEntryInfo: %s\n", base); verbose ("File: %s, Line: %lu, Lang: %s, Kind: %c\n", tag->inputFileName, tag->lineNumber, getLanguageName(tag->langType), tag->kind->letter); verbose ("Escape the character\n"); break; } else if (c == '\\') break; else continue;

c in ((c > 0x00 && c <= 0x1F) || c == 0x7F) is not printable, so ctags makes the warning.
Passing such c to the main part from a parser may be a bug. So ctags make the warning.

In other hand \ is a printable. Passing \ to the main part from a parser is valid. There for
ctags doesn't emit a warning. Actually \ is a separator in php. However, ctags must convert \' to \. breakin(c == '\')makes it possible to passsincluding\` to renderEscapedString called at the
end of renderEscapedName.

Do I answer your question?

I read your code and understand how you mean "implementation detail".
You are correct. Handling the backslash here is just for optimization.

coveralls · 2017-10-17T06:37:36Z

Coverage decreased (-0.01%) to 85.957% when pulling b4b7a4a on b4n:name-escape into ccac4df on universal-ctags:master.

masatake · 2017-10-17T06:56:39Z

var obj = {
  ' ': function() {}, // one space
  '\t': function() {}, // one tab
};

obj[' ']();
obj['\t']();

Oh, I see.
When I wrote the original patch, I just assumed the following case:

<h1>
HELLO
</h1>

or

<H1> Hello</H1>

In these case I think it is allowed to remove the prefixed whitespaces.
I will remember my original intention and I update this comment tonight in JST.

masatake · 2017-10-18T03:23:04Z

I caught a cold. Please, wait for awhile.

coveralls · 2017-10-18T03:36:47Z

Coverage decreased (-0.006%) to 85.963% when pulling 7dfde3e on b4n:name-escape into ccac4df on universal-ctags:master.

masatake · 2017-10-19T20:46:36Z

Units/parser-javascript.r/js-odd-method-names.d/input.js

@@ -0,0 +1,10 @@
+var object = {
+  '!hello': function(){},
+  ' hello': function(){},


Could you add a key which starts from \ ?

masatake · 2017-10-19T20:48:32Z

Other than the one item about test case, LGTM.

What I have to do

(1) update format section of ctags.1. (the highest priority), and
(2) add a test case derived from js-odd-method-names.d with --_output-format=e-ctags.

b4n · 2017-11-01T07:26:29Z

Could you add a key which starts from '' ?

Done, plus a few other odd characters. I also fixed the conflicts that appeared since last time I touched this.

coveralls · 2017-11-01T07:33:19Z

Coverage decreased (-0.006%) to 85.992% when pulling f24820b on b4n:name-escape into 66fcc1d on universal-ctags:master.

coveralls · 2017-11-01T07:33:22Z

Coverage decreased (-0.006%) to 85.992% when pulling f24820b on b4n:name-escape into 66fcc1d on universal-ctags:master.

masatake · 2017-11-01T09:41:25Z

@b4n, thank you. LGTM.

I think we may have to define version 3 of tags file format. The current readtags command may not work
well with the output of u-ctags. Though we have

!_TAG_OUTPUT_MODE	u-ctags	/u-ctags or e-ctags/

in tags file but the data structure of readtags doesn't have field for storing tags output mode. Just format version number can be stored to the data structure...

techee · 2017-11-01T21:57:24Z

@b4n Sorry for the delay. I had a look at the patches and they look good to me (the HTML part is trivial). I had a look at the rest as well and makes sense - I'm not so familiar with the ctags file format and don't know what other applications expect there to be but escaping ! and seems good enough to fix the problem.

b4n · 2017-11-15T03:36:05Z

@techee thanks :)

@masatake Should I try and update readtags to handle the amendment to the format? Should I update some docs too maybe?

b4n · 2017-11-15T08:16:59Z

I just added a commit updating the format documentation, and a another one (76a2697) with a proposal for a possibly more sensible escaping scheme. The rationale here is that we already have \xHH-style escaping (so there's little to no modification for a client to read it), and that this kind of escaping is fairly common for programmers anyway (so it's easier on readers). But of course I can drop it if you prefer the previous state.

coveralls · 2017-11-15T08:18:04Z

Coverage increased (+0.05%) to 86.05% when pulling 76a2697 on b4n:name-escape into 66fcc1d on universal-ctags:master.

masatake · 2017-11-15T09:01:24Z

@masatake Should I try and update readtags to handle the amendment to the format? Should I update some docs too maybe?

REALLY, REALLY, appreciated. Especially about docs/format.rst . Any level of helps are welcome.
Currently we are using _TAG_OUTPUT_MODE ptag for notifying the output format to client tools:

[yamato@master]~/var/util-linux% u-ctags -o - --extras=+p sys-utils/lsns.c
u-ctags -o - --extras=+p sys-utils/lsns.c
!_TAG_FILE_FORMAT	2	/extended format; --format=1 will not append ;" to lines/
!_TAG_FILE_SORTED	1	/0=unsorted, 1=sorted, 2=foldcase/
!_TAG_OUTPUT_MODE	u-ctags	/u-ctags or e-ctags/
!_TAG_PROGRAM_AUTHOR	Universal Ctags Team	//
...

I have not wanted to updating _TAG_FILE_FORMAT to 3 because it may require more work for thinking about compatibilities.
However, I recognize we cannot escape from thinking about them because the data structure defined in readtags.[ch] doesn't have a field for keeping _TAG_OUTPUT_MODE. It has only a field for _TAG_FILE_FORMAT.

We have to think about version 3 and define it.
We have to do 3 things.

make ctags command emits !_TAG_FILE_FORMAT 3 by default.
Version 2 output mode must be supported, too. It means --format=level option must be updated, too.
u-ctags of --output-format=u-ctags|e-ctags|etags|xref|json option means version 3 format.
e-ctags of the option means version 2.
Test cases are needed.
Extend readtags to accept version 3 format.
Update document.
Both docs/format.rst and man/ctags.1.rst .
ctags.1.rst will be written based on format.rst.

The most important one is docs/format.rst.

b4n · 2017-11-24T02:39:38Z

@masatake I just added a commit updating readtags so it can handle escape sequences. As it's done, it handles them both in the "query" string (I did not touch QUALIFIER part) and in the tag name. AFAICT readtags didn't handle any escaping before, not even the Exuberant-CTags ones, so I had to add the whole idea preprocessing might be needed.

I'm not sure it's the best way to do it, but it was the easiest without rewriting a lot of code. Please test and comment. BTW, this as said readtags was already not handling some valid output, so I don't think fixing it necessarily has to be part of this PR, so I can move it out to another one after if you prefer.

b4n · 2017-11-24T02:42:22Z

@masatake

man/ctags.1.rst

Does that mention the escaping anywhere? I couldn't find it at all, so I didn't update it yet.

This reverts commit 2bcd419.

This is an alternative fix for universal-ctags#1141 instead of stripping which was initially implemented in universal-ctags#1567, but stripping prevents legitimate tag names with whitespaces.

b4n · 2017-11-24T03:40:42Z

Rebased against master to fix conflicts; no changes.

coveralls · 2017-11-24T03:44:26Z

Coverage decreased (-0.008%) to 86.066% when pulling d349eee on b4n:name-escape into 157918c on universal-ctags:master.

masatake · 2017-11-24T19:00:47Z

Sorry to be late. I still be in busy. But I don't want to be a blocker.

I'm introducing ctags-tags.5 man page. It is dedicated to explain tags file format.
About version 1, and 2, I will move description in "TAG FILE FORMAT" section as "overview".
I will copy the contents
from http://docs.ctags.io/en/latest/format.html to ctags-tags.5 man page as "detail".

What we have done is this project will be explained as version 3 format.

@b4n, please look at docs/format.rst ? escaping is explained in the file. As I wrote I will move the
content of format into ctags-tags.5 man page. However, it is not ready yet. So I would like you to
update docs/format.rst.

When I modified format.rst, I took two approaches. One is putting EXCEPTION inline. Here EXCEPTION
means incompatible output of Universal-ctags. Another one is Exceptions section at the end of document. I would like you to add what you did in this PR in the two approaches.
They will be the main part of description of version 3 format in the future.

b4n · 2017-11-24T19:11:07Z

@masatake no worries, take your time.

@b4n, please look at docs/format.rst ? escaping is explained in the file.

I know, and I I updated it already (maybe not all of it, I'll check with your last remark), but as you mentioned needing to update the man page I checked it, and couldn't find anything in it. So if it's just for future, all is good.

coveralls · 2017-11-24T19:15:24Z

Coverage decreased (-0.008%) to 86.066% when pulling 87ceba7 on b4n:name-escape into 157918c on universal-ctags:master.

coveralls · 2017-11-24T19:22:48Z

Coverage decreased (-0.008%) to 86.066% when pulling d7fd404 on b4n:name-escape into 157918c on universal-ctags:master.

coveralls · 2017-11-24T19:30:05Z

Coverage decreased (-1.8%) to 84.25% when pulling 185fdb9 on b4n:name-escape into 157918c on universal-ctags:master.

coveralls · 2017-11-24T21:35:58Z

Coverage decreased (-1.8%) to 84.25% when pulling 7ebe53b on b4n:name-escape into 157918c on universal-ctags:master.

This is more in line with the other escaping rules already present than adding 2 new specific and uncommon sequences.

coveralls · 2017-11-25T08:09:34Z

Coverage decreased (-0.008%) to 86.066% when pulling 7195e6a on b4n:name-escape into 157918c on universal-ctags:master.

b4n · 2017-11-25T08:10:28Z

OK, I updated the Exceptions section of format.rst so it's up-to-date.

I removed the changes to readtags and plan on submitting them in another PR, because of the testing mess I couldn't yet figure out for Windows. As said earlier, I don't think it's a requirement for this very PR, so sounds reasonable as a separate one (esp. once this one is dealt with, although I could first submit the part that is unrelated to the changes here if you prefer).

Escape leading spaces in tag names instead of stripping them

b4n assigned masatake Oct 17, 2017

b4n requested review from masatake and techee October 17, 2017 05:32

b4n commented Oct 17, 2017

View reviewed changes

masatake approved these changes Oct 19, 2017

View reviewed changes

masatake requested changes Oct 19, 2017

View reviewed changes

b4n force-pushed the name-escape branch 3 times, most recently from a79ac51 to f24820b Compare November 1, 2017 07:24

masatake approved these changes Nov 1, 2017

View reviewed changes

masatake assigned b4n and unassigned masatake Nov 1, 2017

b4n added 6 commits November 23, 2017 18:42

html: Strip leading whitespaces in header tags as well as trailing ones

4b4bf44

html: Remove redundant condition

34b7ed4

Revert "main: trim whitespace prefix from tag name"

93e255b

This reverts commit 2bcd419.

Escape leading spaces and exclamation marks in the name field

26b8988

This is an alternative fix for universal-ctags#1141 instead of stripping which was initially implemented in universal-ctags#1567, but stripping prevents legitimate tag names with whitespaces.

Add a test case for various uncommon leading characters in tag names

a33f509

Update format documentation to reflect current state

fadc2d2

b4n force-pushed the name-escape branch from 6aa00cf to d349eee Compare November 24, 2017 03:39

Escape leading space and exclamation mark with their hex value

7195e6a

This is more in line with the other escaping rules already present than adding 2 new specific and uncommon sequences.

b4n force-pushed the name-escape branch from 7ebe53b to 7195e6a Compare November 25, 2017 08:05

b4n merged commit 7195e6a into universal-ctags:master Nov 28, 2017

b4n added a commit that referenced this pull request Nov 28, 2017

Merge pull request #1580 from b4n/name-escape

248cffc

Escape leading spaces in tag names instead of stripping them

b4n mentioned this pull request Nov 29, 2017

WIP: Update readtags to support escape sequences and fix & improve its testing #1605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escape leading spaces in tag names instead of stripping them #1580

Escape leading spaces in tag names instead of stripping them #1580

b4n commented Oct 17, 2017

masatake commented Oct 17, 2017

b4n Oct 17, 2017

b4n commented Oct 17, 2017

b4n commented Oct 17, 2017

b4n Oct 17, 2017

masatake Oct 19, 2017

masatake Oct 19, 2017

coveralls commented Oct 17, 2017

masatake commented Oct 17, 2017

masatake commented Oct 18, 2017

coveralls commented Oct 18, 2017

masatake Oct 19, 2017 •

edited

Loading

masatake commented Oct 19, 2017

b4n commented Nov 1, 2017 via email

coveralls commented Nov 1, 2017

coveralls commented Nov 1, 2017

masatake commented Nov 1, 2017

techee commented Nov 1, 2017

b4n commented Nov 15, 2017

b4n commented Nov 15, 2017

coveralls commented Nov 15, 2017

masatake commented Nov 15, 2017 •

edited by b4n

Loading

b4n commented Nov 24, 2017

b4n commented Nov 24, 2017

b4n commented Nov 24, 2017

coveralls commented Nov 24, 2017

masatake commented Nov 24, 2017

b4n commented Nov 24, 2017

coveralls commented Nov 24, 2017

coveralls commented Nov 24, 2017

coveralls commented Nov 24, 2017

coveralls commented Nov 24, 2017

coveralls commented Nov 25, 2017

b4n commented Nov 25, 2017

Escape leading spaces in tag names instead of stripping them #1580

Escape leading spaces in tag names instead of stripping them #1580

Conversation

b4n commented Oct 17, 2017

masatake commented Oct 17, 2017

b4n Oct 17, 2017

Choose a reason for hiding this comment

b4n commented Oct 17, 2017

b4n commented Oct 17, 2017

b4n Oct 17, 2017

Choose a reason for hiding this comment

masatake Oct 19, 2017

Choose a reason for hiding this comment

masatake Oct 19, 2017

Choose a reason for hiding this comment

coveralls commented Oct 17, 2017

masatake commented Oct 17, 2017

masatake commented Oct 18, 2017

coveralls commented Oct 18, 2017

masatake Oct 19, 2017 • edited Loading

Choose a reason for hiding this comment

masatake commented Oct 19, 2017

b4n commented Nov 1, 2017 via email

coveralls commented Nov 1, 2017

coveralls commented Nov 1, 2017

masatake commented Nov 1, 2017

techee commented Nov 1, 2017

b4n commented Nov 15, 2017

b4n commented Nov 15, 2017

coveralls commented Nov 15, 2017

masatake commented Nov 15, 2017 • edited by b4n Loading

b4n commented Nov 24, 2017

b4n commented Nov 24, 2017

b4n commented Nov 24, 2017

coveralls commented Nov 24, 2017

masatake commented Nov 24, 2017

b4n commented Nov 24, 2017

coveralls commented Nov 24, 2017

coveralls commented Nov 24, 2017

coveralls commented Nov 24, 2017

coveralls commented Nov 24, 2017

coveralls commented Nov 25, 2017

b4n commented Nov 25, 2017

masatake Oct 19, 2017 •

edited

Loading

masatake commented Nov 15, 2017 •

edited by b4n

Loading