Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape leading spaces in tag names instead of stripping them #1580

Merged
merged 7 commits into from
Nov 28, 2017

Conversation

b4n
Copy link
Member

@b4n b4n commented Oct 17, 2017

Since #1567 leading spaces are stripped from tag names, but this breaks legitimate use of leading spaces in tag names (and actually changed behavior with unproblematic leading characters like \n (LF) or \t (HT)).

Instead, escape those characters when rendering instead. It also makes more sense to handle this when rendering as the effective problem is with the output format only.

@techee I'm asking for your review here merely for the HTML parser changes, but they should be straightforward.

@masatake
Copy link
Member

@b4n, do you have any real example that "legitimate use of leading spaces in tag names" ?

main/field.c Outdated
* pseudo-tags when sorting. Anything with a lower byte value is
* escaped by renderEscapedString() already. */
unexpected_byte = *s;
vStringPut (b, '\\');
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escape sequence itself is at your discretion, here I merely prefixed with a backslash, but it might make more sense to output \x20 for space and \x21 for an exclamation mark (their hex escape sequence, like we would have \x1f for Unit Separator (US)).

@b4n
Copy link
Member Author

b4n commented Oct 17, 2017

@masatake Yes: for example JavaScript allows any string as an object property name, like this:

var obj = {
  ' ': function() {}, // one space
  '\t': function() {}, // one tab
};

obj[' ']();
obj['\t']();

And I wanted to add tests for this when importing your signature normalization code to make sure it wasn't normalizing something it shouldn't be normalizing.

@b4n
Copy link
Member Author

b4n commented Oct 17, 2017

(I'll fix the test failure, sorry)

main/field.c Outdated
break;
}
else if (c == '\\')
break;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masatake why no warning when there is a \? I wasn't emitting a warning for the scope "\\Broken\tContext" in the self-test as it stopped at the leading \ even though there is a \t, but I hardly see the point here; should I still make my change behave the same? I would rather think that this was an implementation detail and that emitting the warning in this case is the expected behavior, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ((c > 0x00 && c <= 0x1F) || c == 0x7F)
		{
			verbose ("Unexpected character (0 < *c && *c < 0x20) included in a tagEntryInfo: %s\n", base);
			verbose ("File: %s, Line: %lu, Lang: %s, Kind: %c\n",
				 tag->inputFileName, tag->lineNumber, getLanguageName(tag->langType), tag->kind->letter);
			verbose ("Escape the character\n");
			break;
		}
		else if (c == '\\')
			break;
		else
			continue;

c in ((c > 0x00 && c <= 0x1F) || c == 0x7F) is not printable, so ctags makes the warning.
Passing such c to the main part from a parser may be a bug. So ctags make the warning.

In other hand \ is a printable. Passing \ to the main part from a parser is valid. There for
ctags doesn't emit a warning. Actually \ is a separator in php. However, ctags must convert \' to \. breakin(c == '\')makes it possible to passsincluding\` to renderEscapedString called at the
end of renderEscapedName.

Do I answer your question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read your code and understand how you mean "implementation detail".
You are correct. Handling the backslash here is just for optimization.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 85.957% when pulling b4b7a4a on b4n:name-escape into ccac4df on universal-ctags:master.

@masatake
Copy link
Member

var obj = {
  ' ': function() {}, // one space
  '\t': function() {}, // one tab
};

obj[' ']();
obj['\t']();

Oh, I see.
When I wrote the original patch, I just assumed the following case:

<h1>
HELLO
</h1>

or

<H1> Hello</H1>

In these case I think it is allowed to remove the prefixed whitespaces.
I will remember my original intention and I update this comment tonight in JST.

@masatake
Copy link
Member

I caught a cold. Please, wait for awhile.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.006%) to 85.963% when pulling 7dfde3e on b4n:name-escape into ccac4df on universal-ctags:master.

@@ -0,0 +1,10 @@
var object = {
'!hello': function(){},
' hello': function(){},
Copy link
Member

@masatake masatake Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a key which starts from \ ?

@masatake
Copy link
Member

Other than the one item about test case, LGTM.

What I have to do

(1) update format section of ctags.1. (the highest priority), and
(2) add a test case derived from js-odd-method-names.d with --_output-format=e-ctags.

@b4n b4n force-pushed the name-escape branch 3 times, most recently from a79ac51 to f24820b Compare November 1, 2017 07:24
@b4n
Copy link
Member Author

b4n commented Nov 1, 2017 via email

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.006%) to 85.992% when pulling f24820b on b4n:name-escape into 66fcc1d on universal-ctags:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.006%) to 85.992% when pulling f24820b on b4n:name-escape into 66fcc1d on universal-ctags:master.

@masatake masatake assigned b4n and unassigned masatake Nov 1, 2017
@masatake
Copy link
Member

masatake commented Nov 1, 2017

@b4n, thank you. LGTM.

I think we may have to define version 3 of tags file format. The current readtags command may not work
well with the output of u-ctags. Though we have

!_TAG_OUTPUT_MODE	u-ctags	/u-ctags or e-ctags/

in tags file but the data structure of readtags doesn't have field for storing tags output mode. Just format version number can be stored to the data structure...

@techee
Copy link
Contributor

techee commented Nov 1, 2017

@b4n Sorry for the delay. I had a look at the patches and they look good to me (the HTML part is trivial). I had a look at the rest as well and makes sense - I'm not so familiar with the ctags file format and don't know what other applications expect there to be but escaping ! and seems good enough to fix the problem.

@b4n
Copy link
Member Author

b4n commented Nov 15, 2017

@techee thanks :)

@masatake Should I try and update readtags to handle the amendment to the format? Should I update some docs too maybe?

@b4n
Copy link
Member Author

b4n commented Nov 15, 2017

I just added a commit updating the format documentation, and a another one (76a2697) with a proposal for a possibly more sensible escaping scheme. The rationale here is that we already have \xHH-style escaping (so there's little to no modification for a client to read it), and that this kind of escaping is fairly common for programmers anyway (so it's easier on readers). But of course I can drop it if you prefer the previous state.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) to 86.05% when pulling 76a2697 on b4n:name-escape into 66fcc1d on universal-ctags:master.

@masatake
Copy link
Member

masatake commented Nov 15, 2017

@masatake Should I try and update readtags to handle the amendment to the format? Should I update some docs too maybe?

REALLY, REALLY, appreciated. Especially about docs/format.rst . Any level of helps are welcome.
Currently we are using _TAG_OUTPUT_MODE ptag for notifying the output format to client tools:

[yamato@master]~/var/util-linux% u-ctags -o - --extras=+p sys-utils/lsns.c
u-ctags -o - --extras=+p sys-utils/lsns.c
!_TAG_FILE_FORMAT	2	/extended format; --format=1 will not append ;" to lines/
!_TAG_FILE_SORTED	1	/0=unsorted, 1=sorted, 2=foldcase/
!_TAG_OUTPUT_MODE	u-ctags	/u-ctags or e-ctags/
!_TAG_PROGRAM_AUTHOR	Universal Ctags Team	//
...

I have not wanted to updating _TAG_FILE_FORMAT to 3 because it may require more work for thinking about compatibilities.
However, I recognize we cannot escape from thinking about them because the data structure defined in readtags.[ch] doesn't have a field for keeping _TAG_OUTPUT_MODE. It has only a field for _TAG_FILE_FORMAT.

We have to think about version 3 and define it.
We have to do 3 things.

  1. make ctags command emits !_TAG_FILE_FORMAT 3 by default.
    Version 2 output mode must be supported, too. It means --format=level option must be updated, too.
    u-ctags of --output-format=u-ctags|e-ctags|etags|xref|json option means version 3 format.
    e-ctags of the option means version 2.
    Test cases are needed.

  2. Extend readtags to accept version 3 format.

  3. Update document.
    Both docs/format.rst and man/ctags.1.rst .
    ctags.1.rst will be written based on format.rst.

The most important one is docs/format.rst.

@b4n
Copy link
Member Author

b4n commented Nov 24, 2017

@masatake I just added a commit updating readtags so it can handle escape sequences. As it's done, it handles them both in the "query" string (I did not touch QUALIFIER part) and in the tag name. AFAICT readtags didn't handle any escaping before, not even the Exuberant-CTags ones, so I had to add the whole idea preprocessing might be needed.

I'm not sure it's the best way to do it, but it was the easiest without rewriting a lot of code. Please test and comment. BTW, this as said readtags was already not handling some valid output, so I don't think fixing it necessarily has to be part of this PR, so I can move it out to another one after if you prefer.

@b4n
Copy link
Member Author

b4n commented Nov 24, 2017

@masatake

man/ctags.1.rst

Does that mention the escaping anywhere? I couldn't find it at all, so I didn't update it yet.

@b4n
Copy link
Member Author

b4n commented Nov 24, 2017

Rebased against master to fix conflicts; no changes.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.008%) to 86.066% when pulling d349eee on b4n:name-escape into 157918c on universal-ctags:master.

@masatake
Copy link
Member

Sorry to be late. I still be in busy. But I don't want to be a blocker.

I'm introducing ctags-tags.5 man page. It is dedicated to explain tags file format.
About version 1, and 2, I will move description in "TAG FILE FORMAT" section as "overview".
I will copy the contents
from http://docs.ctags.io/en/latest/format.html to ctags-tags.5 man page as "detail".

What we have done is this project will be explained as version 3 format.

@b4n, please look at docs/format.rst ? escaping is explained in the file. As I wrote I will move the
content of format into ctags-tags.5 man page. However, it is not ready yet. So I would like you to
update docs/format.rst.

When I modified format.rst, I took two approaches. One is putting EXCEPTION inline. Here EXCEPTION
means incompatible output of Universal-ctags. Another one is Exceptions section at the end of document. I would like you to add what you did in this PR in the two approaches.
They will be the main part of description of version 3 format in the future.

@b4n
Copy link
Member Author

b4n commented Nov 24, 2017

@masatake no worries, take your time.

@b4n, please look at docs/format.rst ? escaping is explained in the file.

I know, and I I updated it already (maybe not all of it, I'll check with your last remark), but as you mentioned needing to update the man page I checked it, and couldn't find anything in it. So if it's just for future, all is good.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.008%) to 86.066% when pulling 87ceba7 on b4n:name-escape into 157918c on universal-ctags:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.008%) to 86.066% when pulling d7fd404 on b4n:name-escape into 157918c on universal-ctags:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.8%) to 84.25% when pulling 185fdb9 on b4n:name-escape into 157918c on universal-ctags:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.8%) to 84.25% when pulling 7ebe53b on b4n:name-escape into 157918c on universal-ctags:master.

This is more in line with the other escaping rules already present than
adding 2 new specific and uncommon sequences.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.008%) to 86.066% when pulling 7195e6a on b4n:name-escape into 157918c on universal-ctags:master.

@b4n
Copy link
Member Author

b4n commented Nov 25, 2017

OK, I updated the Exceptions section of format.rst so it's up-to-date.

I removed the changes to readtags and plan on submitting them in another PR, because of the testing mess I couldn't yet figure out for Windows. As said earlier, I don't think it's a requirement for this very PR, so sounds reasonable as a separate one (esp. once this one is dealt with, although I could first submit the part that is unrelated to the changes here if you prefer).

@b4n b4n merged commit 7195e6a into universal-ctags:master Nov 28, 2017
b4n added a commit that referenced this pull request Nov 28, 2017
Escape leading spaces in tag names instead of stripping them
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants