Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Update readtags to support escape sequences and fix & improve its testing #1605

Closed
wants to merge 30 commits into from

Conversation

b4n
Copy link
Member

@b4n b4n commented Nov 29, 2017

Following what I started in #1580, this PR aims at updating readtags to properly support the current format, as well as fixing & improving its automated test cases (roundtrip).

@b4n b4n added the readtags label Nov 29, 2017
@b4n b4n requested a review from masatake November 29, 2017 23:48
@b4n b4n changed the title WIP: Update readtags to support escap sequences and fix & improve its testing WIP: Update readtags to support escape sequences and fix & improve its testing Nov 29, 2017
@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 84.325% when pulling 2b51abe on b4n:readtags-update-and-test into 248cffc on universal-ctags:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 84.325% when pulling 0ac3a99 on b4n:readtags-update-and-test into 248cffc on universal-ctags:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 84.325% when pulling 0d502b3 on b4n:readtags-update-and-test into 248cffc on universal-ctags:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 84.325% when pulling 0d502b3 on b4n:readtags-update-and-test into 248cffc on universal-ctags:master.

* don't stop all processing when one input line has no escaping
* don't stop expansion after an unhandled escape sequence
@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 84.328% when pulling 168eb2a on b4n:readtags-update-and-test into 248cffc on universal-ctags:master.

@b4n
Copy link
Member Author

b4n commented Dec 2, 2017

OK I don't know how to avoid the issue yet, e.g. :: leads to readtags receiving ;, leading /es to C:/…, etc. I tried escaping with \ and ^, doesn't seem to help.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 84.328% when pulling 0ed2741 on b4n:readtags-update-and-test into 248cffc on universal-ctags:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 84.328% when pulling 5d05ee6 on b4n:readtags-update-and-test into 248cffc on universal-ctags:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 84.328% when pulling 341db19 on b4n:readtags-update-and-test into 248cffc on universal-ctags:master.

@b4n
Copy link
Member Author

b4n commented Dec 2, 2017

OK I improved the situation with the last commit after finding https://github.com/msys2/msys2/wiki/Porting#user-content-filesystem-namespaces, but it's still not perfect: there's encoding issues, and some exclude patterns break legitimate stuff (like * and /). All in all, there's still 14 tests failing on MSYS2.

I don't really know what I can do, but add the ability for readtags to read the tag name from a file instead of the command line. It's not very nice because I can't really see a real use for it, but it would solve the whole problem at once I guess.

@masatake
Copy link
Member

(I will work on this item in the new year day.)

masatake added a commit to masatake/ctags that referenced this pull request Dec 26, 2017
…put-format=u-ctags|e-ctags option

To record meta characters like tab in a name, we have designed new file format.

The file format itself was already introduced. TAG_OUTPUT_MODE pseudo tag was used
as marker telling which format is used in a tag file: Exuberant-ctags(e-ctags)
compatible format or Universal-ctags(u-ctags) file format. u-ctags was the default
format. A user could choose the format with --output-format option.

However, we recognize that the hack using TAG_OUTPUT_MODE pseudo doesn't work well
with readtags command. A data structure of readtags doesn't have a field for
recording TAG_OUTPUT_MODE. It means readtags cannot switch the file format version
dynamically. The pseudo tags can be used in readtags for the purpose, recognizing
the file format is only TAG_FILE_FORMAT.

This commit introduces following SELF INCOMPATIBLE changes:

1. --output-format=u-ctags and --output-format=e-ctags are removed.
   They are unified to --output-format=ctags.

2. TAG_OUTPUT_MODE pseudo tag is removed.
   Instead TAG_FILE_FORMAT=3 is introduced.
   TAG_FILE_FORMAT=2 is compatible with e-ctags.
   TAG_FILE_FORMAT=3 is default. A user can choose one of the formats
   with --format=3 or --format=2.

The new version 3 may break many client tools.
If you are just user of such client tools, use --format=2 option to generate
tags file with compatible file format.
If you are a developer of such client tools, please, consider to support
--format=3.

The rest of work are:

1. more Tmain test cases,
2. more precise documentation about the format 3.
3. updating readtags to support format 3.

About readtags, the most of all work is done by @b4n.
I must merge his changes(universal-ctags#1605).

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
masatake added a commit to masatake/ctags that referenced this pull request Dec 27, 2017
…put-format=u-ctags|e-ctags option

To record meta characters like tab in a name, we have designed new file format.

The file format itself was already introduced. TAG_OUTPUT_MODE pseudo tag was used
as marker telling which format is used in a tag file: Exuberant-ctags(e-ctags)
compatible format or Universal-ctags(u-ctags) file format. u-ctags was the default
format. A user could choose the format with --output-format option.

However, we recognize that the hack using TAG_OUTPUT_MODE pseudo doesn't work well
with readtags command. A data structure of readtags doesn't have a field for
recording TAG_OUTPUT_MODE. It means readtags cannot switch the file format version
dynamically. The pseudo tags can be used in readtags for the purpose, recognizing
the file format is only TAG_FILE_FORMAT.

This commit introduces following SELF INCOMPATIBLE changes:

1. --output-format=u-ctags and --output-format=e-ctags are removed.
   They are unified to --output-format=ctags.

2. TAG_OUTPUT_MODE pseudo tag is removed.
   Instead TAG_FILE_FORMAT=3 is introduced.
   TAG_FILE_FORMAT=2 is compatible with e-ctags.
   TAG_FILE_FORMAT=3 is default. A user can choose one of the formats
   with --format=3 or --format=2.

The new version 3 may break many client tools.
If you are just user of such client tools, use --format=2 option to generate
tags file with compatible file format.
If you are a developer of such client tools, please, consider to support
--format=3.

The rest of work are:

1. more Tmain test cases,
2. more precise documentation about the format 3.
3. updating readtags to support format 3.

About readtags, the most of all work is done by @b4n.
I must merge his changes(universal-ctags#1605).

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
masatake added a commit to masatake/ctags that referenced this pull request Dec 29, 2017
…put-format=u-ctags|e-ctags option

To record meta characters like tab in a name, we have designed new file format.

The file format itself was already introduced. TAG_OUTPUT_MODE pseudo tag was used
as marker telling which format is used in a tag file: Exuberant-ctags(e-ctags)
compatible format or Universal-ctags(u-ctags) file format. u-ctags was the default
format. A user could choose the format with --output-format option.

However, we recognize that the hack using TAG_OUTPUT_MODE pseudo doesn't work well
with readtags command. A data structure of readtags doesn't have a field for
recording TAG_OUTPUT_MODE. It means readtags cannot switch the file format version
dynamically. The pseudo tags can be used in readtags for the purpose, recognizing
the file format is only TAG_FILE_FORMAT.

This commit introduces following SELF INCOMPATIBLE changes:

1. --output-format=u-ctags and --output-format=e-ctags are removed.
   They are unified to --output-format=ctags.

2. TAG_OUTPUT_MODE pseudo tag is removed.
   Instead TAG_FILE_FORMAT=3 is introduced.
   TAG_FILE_FORMAT=2 is compatible with e-ctags.
   TAG_FILE_FORMAT=3 is default. A user can choose one of the formats
   with --format=3 or --format=2.

The new "version 3" may break many client tools.
If you are just user of such client tools, use --format=2 option to generate
tags file with compatible file format.
If you are a developer of such client tools, please, consider to support
--format=3.

The rest of work are:

1. more Tmain test cases,
2. more precise documentation about the format 3.
3. updating readtags to support format 3.

About readtags, the most of all work is done by @b4n.
I must merge his changes(universal-ctags#1605).

docs/news.rst are rewritten by @b4n and @codebrainz.

Ussing stringfy for printing default version of file format in --help
message is suggested by @b4n.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
@masatake masatake removed their request for review December 29, 2017 16:45
@masatake masatake self-assigned this Dec 29, 2017
@masatake
Copy link
Member

I will take time for this important one.

@masatake
Copy link
Member

masatake commented Jan 3, 2018

@b4n, I'm reading your patches.
I will devide your patches into two groups; one group(A) gathers the part of patches handling escape sequence, and the other group(B) gathers the part of patches general purpose, nothing about escape sequence. I will merge B into master branch first.

misc/roundtrip Outdated
@@ -112,8 +114,10 @@ OLD_IFS="$IFS"
IFS='
'
set -x
for tags in $(find "$UNITS" -name expected.tags); do
for name in $(sed -e 's/^\([^ ]*\) .*/\1/' "$tags"); do
tagfiles="$(find "$UNITS" -name expected.tags)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first and the last double quote characters are needed.

Does the following code do the same?

tagfiles=$(find "$UNITS" -name expected.tags)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, doesn't matter

misc/roundtrip Outdated
for name in $(sed -e 's/^\([^ ]*\) .*/\1/' "$tags"); do
tagfiles="$(find "$UNITS" -name expected.tags)"
for tags in $tagfiles; do
tagnames="$(sed -e 's/^\([^ ]*\) .*/\1/' "$tags")"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as the last my comment.

@masatake
Copy link
Member

masatake commented Jan 3, 2018

I will merge following changes to master branch.
(The commit ids are changed because I make your branch be rebased on master.)

52c638a8 circleci: Install find for roundtrip
c079b0cc roundtirp: Improve message for failing tests
dc0bce22 readtags: Build with coverage support if requested
c7817168 roundtrip: Cleanup and fix so it actually does check something
85358ece roundtrip: Fix running on non-C locales
16260d46 testing: Define $(SILENT)
85525454 testing: Don't unconditionally silent test rules
                 roundtrip: Exit on command failure

misc/roundtrip Outdated
@@ -53,7 +53,7 @@ expandEscapeSequences()
:again
s/\\/__BACKSLASH__/
T out
s/__BACKSLASH__\\/__LASTBACKSLASH__/; t again
s/__BACKSLASH__\\/__LITBACKSLASH__/; t again
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you tell me why this change is needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not needed per se, but I changed the name for clarity as it's not only used for the last backslash but for all literal backslashes

@masatake
Copy link
Member

masatake commented Jan 6, 2018

@b4n, my understanding is that many changes are about \n and \t.
If we use GNU sed on MacOSX, can we make changes simpler?

@b4n
Copy link
Member Author

b4n commented Jan 6, 2018

@b4n, my understanding is that many changes are about \n and \t.
If we use GNU sed on MacOSX, can we make changes simpler?

Not really, having fixups here is just that I'm too used to GNU sed so I don't think about its extensions in supporting escape sequences; but ultimately the changes are the same just with a slightly different syntax (using the literal value rather than the GNU sed escape sequence).

@masatake
Copy link
Member

masatake commented Feb 6, 2018

This is privte note for reviewing task. I have been thinking about code for comparison.

My testing:

[yamato@master tmp]$ /usr/bin/ctags --version
/usr/bin/ctags --version
Exuberant Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert
  Compiled: Aug  2 2017, 19:41:49
  Addresses: <dhiebert@users.sourceforge.net>, http://ctags.sourceforge.net
  Optional compiled features: +wildcards, +regex
[yamato@master tmp]$ cat /tmp/foo.unknown 
cat /tmp/foo.unknown 
_a	b_ def
[yamato@master tmp]$ /usr/bin/ctags --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/foo.unknown 
/usr/bin/ctags --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/foo.unknown 
a	b	/tmp/foo.unknown	/^_a	b_ def$/;"	k
[yamato@master tmp]$ u-ctags --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/foo.unknown 
u-ctags --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/foo.unknown 
a\tb	/tmp/foo.unknown	/^_a	b_ def$/;"	k
[yamato@master tmp]$ u-ctags --format=2 --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/foo.unknown 
u-ctags --format=2 --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/foo.unknown 
a\tb	/tmp/foo.unknown	/^_a	b_ def$/;"	k

A tab character is put between a and b in input.unknown.

Let's see if a whitespace is placed instead of the tab char.

[yamato@master tmp]$ /usr/bin/ctags --format=2 --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/bar.unknown 
/usr/bin/ctags --format=2 --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/bar.unknown 
a b	/tmp/bar.unknown	/^_a b_ def$/;"	k
[yamato@master tmp]$ u-ctags --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/bar.unknown 
u-ctags --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/bar.unknown 
a b	/tmp/bar.unknown	/^_a b_ def$/;"	k
[yamato@master tmp]$ u-ctags --format=2 --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/bar.unknown 
u-ctags --format=2 --langdef=Z --regex-Z='/_(.*)_ def$/\1/k,kind,kinds/' --language-force=Z  -o - /tmp/bar.unknown 

e-ctags violates the v2 format. u-ctags --format=2 violates the format. u-ctags (version 3) violates the format.

I guess none seriously follows v2 format.
So thinking about following v2 format seriously in u-ctags doesn't make sense.
We should fix the output of v2 on-issue base.

Instead, we should refine v3 format as possible and make a reliable format.

o.k. About the tags generator side, I understand where I should go.

How about reader side?

I assume the original readtags works well with the output of e-ctags though the generator side doesn't follow the v2 format spec:-P.

So I think the original code of readtags code for handling e-ctags output as is.

My idea is that the code introduced by @b4n's patch should be activated for handling u-ctags v2 and v3 formats output only.

The good new is that tagFileInfo and TagFile data structures provide enough information to detect whether e-ctags, u-ctags v2 or u-ctags v3 generates the current output. In other word, using strcmp for e-ctags and tagcmp for u-ctags is my prefer.

Keeping the strcmp based code makes the source code (and test cases) complicated a bit.
This is trading off.


About v2 format.

format.rst which explains v2 tags format says:

{tagname}
	Any identifier, not containing white space..

However, as above output shows, e-ctags doesn't follow the v2 tags format at least using regex parser.

@masatake
Copy link
Member

masatake commented Feb 6, 2018

Comparing the output is the part of library. So I am conservative. However, about command line interface we can introduce changes, as we did in ctags command itself.

The question is whether we should print the result in escaped way or un'escaped way?
... This question is the out of scope of the changes. However, the output format has impacts of roundtrip test driver.

I will find more time for this issue.
Very sorry that it takes long time for merging this.

@b4n
Copy link
Member Author

b4n commented Feb 9, 2018

My idea is that the code introduced by @b4n's patch should be activated for handling u-ctags v2 and v3 formats output only.

As said, I think that my changes are backward compatible. And I think it is not only with the v2 format, but also the e-ctags format: the only thing that would risk being a problem is if e-ctags could output unescaped \es, in which case the updated code would erroneously try and decode it. However, IIRC e-ctags doesn't do so, in which case we should be safe.

@masatake
Copy link
Member

Thank you very much. I merged all your changes.

@masatake masatake closed this Feb 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants