Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question mark character not treated as special in path segment #39

Closed
dlong500 opened this issue Feb 26, 2021 · 23 comments
Closed

Question mark character not treated as special in path segment #39

dlong500 opened this issue Feb 26, 2021 · 23 comments

Comments

@dlong500
Copy link

What were you expecting to happen?

The string base/folder?/file1.txt should return base as the non-magic parent path

What actually happened?

globParent('base/folder?/file1.txt') returns base/folder?

Please provide the following information:

  • OS & version: Windows 10 20H2, also Ubuntu 16.04
  • node version (run node -v): 15.10
  • yarn version: 1.22.5

Additional information

I've been trying to track down the source of my issue where the del library wasn't deleting expected folders based on a pattern like the one mentioned above (after del switched to using fast-glob). This led me to follow the dependency tree from del to fast-glob to glob-parent.

Unless I'm completely mistaken about how this library works I would expect a question mark to be treated like other special characters. For example, globParent('base/folder*/file1.txt') returns base.

@phated
Copy link
Member

phated commented Mar 6, 2021

@dlong500 thanks for investigating this! It doesn't seem like we have a test for this case. If you'd want to send a test and possibly a fix, I can get that in our next major release. We are actually making a few fixes related to the globs not catching certain things, and we need to bump a major just in case someone relying on the mistaken behavior.

@dlong500
Copy link
Author

dlong500 commented Mar 7, 2021

@phated I can take a stab at it. So are you in agreement that my expected behavior is correct? It looks like the issue stems from is-glob returning false, and that module does have a strict flag that can be disabled. If I replace isGlob(str) with isGlob(str, { strict: false }) I get the expected result.

@phated
Copy link
Member

phated commented Mar 7, 2021

@dlong500 I'm not exactly sure. The question mark seems to be an "extglob" which usually only matters when prefixed to a glob type pattern with parens.

Why do you think that question mark should be considered a globbing pattern when not part of an extglob? I'd need something to reference to determine.

I don't think we should turn off strict mode.

@dlong500
Copy link
Author

dlong500 commented Mar 7, 2021

I'm going based off of the syntax specified in the fast-glob documentation where a question mark is supposed to "match any single character except slashes". The del module that I had been using for a while was working fine to delete files based on a similar pattern (something like base/folder?/file?.txt), and then they switched from glob to fast-glob in del v5 at which point things broke in my app.

Maybe there is another way I can specify the same type of pattern using extglobs or something, but all I know is it was working with del version previous to v5, and the fast-glob documentation also seems to indicate being able to use question marks as a single character wildcard. You can see more details on my initial issue in the del repo that I linked to up above.

@phated
Copy link
Member

phated commented Mar 7, 2021

Hmm, well major versions are supposed to contain breaking changes and if they weren't following actual glob "specs" (for as much as there is one) then I definitely could see how that wouldn't be compatible between different libraries. We've tried to stick to libraries that follow as closely to bash as possible.

@phated
Copy link
Member

phated commented Mar 7, 2021

According to the is-glob tests at https://github.com/micromatch/is-glob/blob/master/test.js#L54-L56, they don't expect questionmark to be a glob. However, in the picomatch docs at https://github.com/micromatch/picomatch#basic-globbing, they do treat it as you described.

I think there could be a case for getting that matcher added to is-glob's regexp. Can you open a PR adding the test and fix to the strict regexp and we can discuss then I can get it merged and released?

@dlong500
Copy link
Author

dlong500 commented Mar 7, 2021

Are you saying I should open a PR in the is-glob repo? I can try that, though I'm not sure how people would feel about changing the strict behavior since they have a flag to disable strict mode (which is specifically there to allow for flexibility since there isn't really an actual glob spec).

Maybe there should be a strict option in glob-parent that corresponds to the is-glob strict option? Of course that means modules like fast-glob would need to also add such an option so it could be cascaded down through the dependencies...

@phated
Copy link
Member

phated commented Mar 7, 2021

I'm a maintainer on is-glob. I checked the strict regexp and it just seems to be missing the generic questionmark path.

@dlong500
Copy link
Author

dlong500 commented Mar 8, 2021

I've had a tiny bit of time to work on this and it's easy to add the question mark character to the strict regex but that trips up a lot of the tests related to invalid regex patterns and capture groups. Without knowing the history of why all the existing tests are there I don't want to be messing with a bunch of other tests. What would you recommend is the best path forward?

@phated
Copy link
Member

phated commented Mar 8, 2021

Yeah, I'd guess it would be a breaking change and the tests would need to be updated. We are in the process of making a breaking release to this library, too, so it could be upstreamed here.

I was mostly asking for the implementation/breaking tests proposed as a discussion point. I can update all the broken tests if everyone agrees that ? should be treated as a glob over there.

@jonschlinkert
Copy link
Contributor

jonschlinkert commented Mar 9, 2021

But that result is correct. I assume that if you want ? to work the way it works in regular expressions, you are trying to conditionally match the r in folder in this example. The correct way to achieve this is with a brace expression or an extglob:

'base/folde{r,}/file1.txt'
'base/folde?(r)/file1.txt'

This should also return the correct result with is-glob. The decision to not treat ? as a special character was based on bash glob, and other glob implementations, and the fact that question marks are valid path characters in most file systems besides Windows.

I don't usually have strong opinions on these kinds of decisions, I just follow the logic and read up on prior work to see where that leads me. If the consensus is that ? should be a special character, then we can start implementing that. However, it's worth considering that whenever we decide to go against established conventions we're potentially taking on lots of other edge cases that other people have dealt with or avoided over the years.

Let me know your thoughts.

@phated
Copy link
Member

phated commented Mar 9, 2021

@jonschlinkert I'm fine with whatever we decide, but I'm curious why picomatch says that ? is a globbing pattern.

@dlong500
Copy link
Author

dlong500 commented Mar 9, 2021

I expect there can be some other way to accomplish what I'm after and I'm perfectly happy to leave things as-is if I can adjust my patterns/regex to work. What's confusing is the del library was working fine with the question mark representing a simple single character (non-extglob) wildcard prior to v5.

Here is a more detailed example of the type of thing I've been doing that works with del v4:
del('data/100-123?_files/@(0|1|2|3)')

This would delete folders like:

data/100-123A_files/0/
data/100-123A_files/1/
data/100-123A_files/2/
data/100-123A_files/3/
data/100-123B_files/0/
data/100-123B_files/1/
data/100-123B_files/2/
data/100-123B_files/3/

but not folders like:

data/100-123A_files/4/
data/100-123AX_files/0/

If I'm reading the docs correctly for extglobs, maybe I can achieve the same thing using a pattern like:
data/100-123@(?)_files/@(0|1|2|3)

Is that correct?

@jonschlinkert
Copy link
Contributor

I'm fine with whatever we decide, but I'm curious why picomatch says that ? is a globbing pattern.

Hmm, good point. Sometimes after I go back and forth on something so many times I forget what the final decision was. I'll go back and refresh myself on what the expected behavior, then I'll come back with my thoughts.

@dlong500 sorry for any confusion if I got this wrong. I'll also read over #39 (comment) and see if I have any suggestions or if this is indeed a bug.

@jonschlinkert
Copy link
Contributor

jonschlinkert commented Mar 9, 2021

Here is a refresher on what ? is supposed to do in various contexts:

  • ?(a|b|c) - an extglob that conditionally matches any of the characters inside the parentheses
  • \\? - matches a literal ?
  • ? matches any single character.

To conditionally match any character or nothing, you could do {,?}.

Specific number of characters

Since you can match a specific number of characters using ?, to match the paths in your example, you could do the following:

data/100-123{,?,??}_files/[0-3]/

min/max

However, you can use the following "trick" to use a regex quantifier to match any number of characters:

picomatch('data/100-123?\\{0,3}_files/[0-3]/', { unescape: true });

Essentially, the \\ before the first curly brace prevents picomatch's parser from evaluating the brace expression so that it will be compiled as literal characters, and passing the unescape option tells picomatch to remove the \\ when compiling.

Please let me know if this works for you.

edit: I just updated the globs to use [0-3] instead of @(0|1|2|3), but you can use either, or you can even do {0,1,2,3}. (FWIW, all three syntaxes work so that we can selectively toggle certain features on/off when there are conflicts).

@phated
Copy link
Member

phated commented Mar 9, 2021

Here is a refresher on what ? is supposed to do in various contexts:

  • ?(a|b|c) - an extglob that conditionally matches any of the characters inside the parentheses
  • \? - matches a literal ?
  • ? matches any single character.

So, does this mean that there is indeed a bug in is-glob because ? is always treated as a literal (without the escaping)?

@dlong500
Copy link
Author

dlong500 commented Mar 9, 2021

@jonschlinkert Thanks for your time looking into this, and sorry to be dense, but I'm still a bit confused by the examples you gave. I'm specifically trying to match a single character (not zero or more than one) in a path segment.

I want to match folders like:

data/100-123A_files/0/
data/100-123A_files/1/
data/100-123A_files/2/
data/100-123A_files/3/
data/100-123B_files/0/
data/100-123B_files/1/
data/100-123B_files/2/
data/100-123B_files/3/

But NOT the following folders:

data/100-123_files/0/
data/100-123AX_files/0/

Matching the 0-3 in the subfolders is working fine with an extglob (and as you mention can be done with other methods as well), but what is the simplest way to match the middle part of the path (100-123?_files) where the question mark represents exactly one character.

edit I tried using an extglob to match a single character like 100-123@(?)_files but that doesn't seem to match what I'm after using the fast-glob module.

@jonschlinkert
Copy link
Contributor

jonschlinkert commented Mar 9, 2021 via email

@dlong500
Copy link
Author

dlong500 commented Mar 10, 2021

I'm still perplexed about how to match only a single character (not zero or more than one) in a path segment. Using brace expansion requires a comma which means you can have it match:
zero and one character: {,?}
one and two characters: {?,??}
As far as I can tell the comma requirement prevents brace expansion from matching only a single wildcard character.

In theory using an extglob such as @(?) should match only one wildcard character but that doesn't seem to work with fast-glob using a string like 100-123@(?)_files. The only way I can get it to work is to put the entire path segment inside of the extglob like @(100-123?_files). I wonder about the performance of putting all that inside the extglob though. I'll stop bugging you after this. This single character wildcard thing is just frustrating me a bit.

EDIT Ok, I'm stupid. I realize now that you were using regex quantifiers--not brace expansion--in your picomatch example. The only thing I'll have to figure out is if that will work with the del library since I'm not directly using picomatch. I'm not sure if I can pass the unescape flag there or not. I'll do some testing tonight.

UPDATE So it doesn't look like I can pass any options from del (or even from fast-glob) to set the unescape flag in micromatch/picomatch. This is definitely frustrating. I wish there really was a definitive glob spec as it might make implementations a lot easier to learn and more compatible. As far as I can tell the only way I can get things to work with del or fast-glob as things stand now is by using the method I mentioned above where an entire path segment is inside an extglob like @(100-123?_files). Any feedback on whether that is a bad idea or not?

@jonschlinkert
Copy link
Contributor

Sorry for the late response. I'm not sure what else I can do to help but I'm happy to assist if you

Any feedback on whether that is a bad idea or not?

Probably not. It seems fine IMHO, but I wouldn't know without actually testing the generated regex against huge sets of files.

Meaning, like all code, it depends. Picomatch generates a regular expression from a user-defined pattern, so skepticism is always a healthy and natural default position. Here are a couple of things that might speed up your matches (you're probably already doing these things but in case it helps you or anyone else who sees this discussion):

  1. Compile once - don't recompile the regex on every match. Create a "matcher" function outside of any loops or iterators and call that function on each file.
  2. A/B test your patterns - when I am concerned about a regex being poorly optimized, I sometimes add the following code to a file then do informal benchmarks with a few different globs, including one that matches everything /./ and one that matches nothing /$^/, to see if the glob I'm using is causing performance issues.
    const start = new Date();
    process.on('exit', () => console.log(`Time: ${Date.now() - start}ms`));
    

Since globs are used in scripts and code that's executed inside CI/CD pipelines, thinking about performance is a good thing.

I wish there really was a definitive glob spec as it might make implementations a lot easier to learn and more compatible

Lol indeed. I wanted to start on this and I never got around to it. My goal was to create this guide to globbing then use that to inspire the major concepts that would need to be addressed in the spec.

There are many, many nuances and different similar implementations. Not just in terms of matchers (bash glob, wildmat, extglob, grep, gitignore, and many others -- not to mention regex and awk), but also differences in implementations across different languages. There are also many features that make sense when globbing files with JavaScript that do not make sense with bash, and vice versa.

If anyone wants to help tackle this, I'd be thrilled to help you get started.

In summary, it seems like the main blocker is how options are passed to fast-glob and del. I'm not very familiar with the internals of either, but I'd be happy to answer questions or help in whatever way I can.

@phated
Copy link
Member

phated commented Mar 17, 2021

I feel like we are doing 2 separate things in this issue right now.

  1. @jonschlinkert is trying to solve @dlong500's problem with no changes to this library or is-glob
  2. I am trying to figure out if this library and is-glob should actually be treating ? as a glob pattern that matches a single character (it currently rejects ? unless it is part of an extglob).

@jonschlinkert
Copy link
Contributor

jonschlinkert commented Mar 18, 2021 via email

@phated
Copy link
Member

phated commented May 3, 2021

I still am not sure if ? should actually be handled as a glob character by this library, but I'm going to close this out. If anyone has a more concrete example of why & how it should work, please comment here and we'll re-open!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants