-
-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] search strings containing umlaut fails to find any results #1535
Comments
This is likely a duplicate of #638 Is the search using U+75 and U+308(a "u" witha diaresis combining character in front of it", but the filename uses U+00FC (a single ü charachter) or vice versa? |
@tmccombs Yeah it would be vice versa. macOS stores filenames in normalization form NFD (D for decomposed), so the actual filenames will have combining characters while most everything else uses the precomposed characters. |
Oh I guess my info is out of date. That's true for HFS+, but APFS is normalization-insensitive rather than actually normalizing. So file paths will use whatever normalization you used to create the file, but you can access it by other normalizations too (kinda like how Finder still uses NFD though. |
@tavianator I'm on a case-sensitive APFS However it seems I can't pipe to |
What does |
Just copy-pasting from the OP shows what's happening:
tavianator@graphene $ echo 'Bestätigung' | xxd
00000000: 4265 7374 61cc 8874 6967 756e 670a Besta..tigung.
tavianator@graphene $ echo 'bestätigung' | xxd
00000000: 6265 7374 c3a4 7469 6775 6e67 0a best..tigung. The difference (apart from the case of I suspect if you manually search for the decomposed form, something like $ fd $'besta\xcc\x88tigung' it will find it. |
Yep, that's right! Cheers! |
I agree that this is not a good user experience. Unfortunately, it is also a very difficult problem to solve. The library we use for regex doesn't support normalization, and probably won't anytime soon. See rust-lang/regex#404 (comment). The workaround there of normalizing the regex and input is much easier said than done. Normalizing all the filenames significantly hurts performance. And normalizing the regex isn't as straightforward as normalizing the string of the regex. For example "ä?" Would need to be converted to "(a\u0308)?". Perhaps the best path would be to have an option to transform the regex to accept either equivalent form. So for example ä would be transformed into "(ä|a\u0308)". I'm not familiar enough with unicode to know how feasible that would be in general, or how to create those transformation tables. |
I think the worst case here is character classes like |
Here is a quick proof of concept for NFD-izing a regex: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=2ed2dfc074864bbffa5b85f685349d71 |
Perhaps we could just do the replacement on literals, and not worry about ranges? |
Checks
Describe the bug you encountered:
[I] ➜ ~ fd gung.html
findsas expected.
But
[I] ➜ ~ fd bestätigung
doesn't find anything, even if run with--unrestricted
.This seems not to be a Unicode issue as emoji containing files and folders are found properly.
Describe what you expected to happen:
Same output as first command
What version of
fd
are you using?fd 9.0.0
Which operating system / distribution are you on?
Great application nevertheless! Love it!
The text was updated successfully, but these errors were encountered: