-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trailing whitespace not removed after max_filename_length truncation #496
Comments
Thanks for pointing this out. It is indeed an issue. Specifically, in library.py, we call |
I think if you only call sanitize_path at the end, it could result in replacements which take the filename size back over the limit. One option could just be to remove trailing spaces from filenames after truncation. Other than that, perhaps a loop which cycles over sanitize_path and truncate until the filename is stable? |
Huh, that's a great point. Maybe special-casing the whitespace removal (i.e., don't make it part of the "replace" option) is the best tactic here. |
I made a temporary fix for this today by changing:
to
in util/init.py: truncate_path This would probably work as a permanent fix (unless you expect other characters to cause similar issues after path truncation). |
Thanks for the pointer, but yes, we also need to fix this for other replacements (e.g., trailing dots) |
Hello, Briefly, we are confused about how whitespace is left over after the truncation. In the python prompt, we tried to reproduce the bit of code that does the truncating (in truncate_path, in beets/util/init.py) with dummy variables: length = 4 # if MAX_FILENAME_LENGTH = 4 Here, ‘bin’ and ‘usr’ are less than 4 and have no trailing whitespace, and ‘funnnnnnnnnn’ is greater than 4 and gets truncated to ‘funn’ with no trailing whitespace. The only way we can generate trailing whitespace with this snippet of code is if one of comps already has trailing whitespace; e.g., if max length is 4 and c in comps is ‘a ‘, it truncates to ‘a ‘. But this possibility is unlikely, however, since Adrian previously stated that sanitize_path is called before truncate_path. Furthermore, these examples only consider whitespace -- how do “other replacements (e.g., trailing dots)” make their way into truncate_path if they go through sanitize_path first? Doesn't sanitize_path get rid of all the regular expressions in CHAR_REPLACE? Also, is there a way we can get detailed instructions for reproducing the bug? Thanks in advance! |
If you have a look at the destination function in library.py, the general process for generating the output filename is as follows:
Now, during step 3, only whitespace and dots at the end of each component are removed. If step 4 truncates the components, then the end of each component will be at a different point, and characters previously in the middle of the string may be moved to the end. If one of those characters is a dot or a space, then the bad character sequence is once again in the path. For example, the filename "/An Artist/2008. An Album/005. A Track" is perfectly legal, and would pass through sanitization unchanged. If you truncate it with a maximum length of 5, then you get "/An Art/2008./005. " which contains illegal characters at the end of the middle and final components. |
That's an excellent summary, @LordSputnik. This bug arises because of the order of the operations on the path. @dnnovo, perhaps a good first step would be to write a small test case that produces this behavior following @LordSputnik's recipe. |
Thank you for the very helpful summary, @LordSputnik! @sampsyo, we will write a small test as our first step. Thank you both. |
We are still working on making a test case that produces this behavior following @LordSputnik's recipe. It would be helpful to get instructions that are even more detailed, since we are still new to the software. It may be premature, but we would like to propose a fix: Since the only characters that should manifest from the behavior outlined by @LordSputnik are dots and whitespace and since they should only be on the right-hand side of the string, changing
to
might do the trick. rstrip() takes in characters to strip away from the right hand side of the string. For instance,
So in the example @LordSputnik illustrated
should truncate to (with max_filename_length = 5)
Thanks again for the help! |
You're right to think of On constructing a test case: how far have you gotten and what can I help explain in more detail? I'd be happy to help. |
Quick question to whomever. Why allow the user to entirely disable the default regex replacements in place of their own? I understand for customization purposes, but isn't it highly recommended in the documentation to keep the defaults because of the multi OS wide illegal characters. I was just wondering how a user could could benefit from not using them. I only ask because the way i was heading in fixing this bug i was thinking of making those particular regex default replacements immutable (still keeping the user defined replacements of course.) Thanks for the help! |
Some people do want this for purely aesthetic reasons. For example, if you know you're on Unix and never have to deal with Windows, you might want to name your artist folder "M.I.A." instead of "M.I.A_", which admittedly looks weird. |
Sorry about that Adrian. After reading the docs more thoroughly i saw your example you just mentioned above. A couple follow up questions though. For one (i probably should have stated this first) i'm actually still having trouble reproducing the bug so any help in that direction would be greatly appreciated. So far i've modified the max file length size to 5 (as suggested by LordSputnik) then uninstalled and reinstalled and confirmed that my change took hold by checking the modified file in the beets directory in the python site-packages. Then i attempt to import using the file structure suggested by LordSputnik "/An Artist/2008. An Album/005. A Track" which afterwards i see no changes. Hopefully that's clear if not i can definitely provided more details. Another more hypothetical question: Could you do something like this: Run the sanitize path (with either the default replacements or the user defined ones), then run truncate, then run the sanitize path one last time but with the condition that if there is user defined replacements make sure they are 1:1 replacements in order to keep the size. This is all hypothetical of course. Thanks again for any help provided! |
I added a unit test. See the above commit; the comments should illuminate the issue. |
Sorry for the delayed response. And thanks for the unit test. That helped a ton in me being able to recreate the issue. I'm having trouble of thinking of hypotheticals so i was wondering what you might think of the modified code segment below. This would also include some extra code in the destination function which would just be an additional call to sanitize path (with the optional passThrough parameter set equal to 2) after the call to truncate. I would have tested it myself with the test suite but am currently having difficulty running testall.py in pycharm because of an enum error. Sorry again if this is completely off base. This problem is definitely a little more complex then i first assumed. Thanks again for any help! |
Can you please explain the logic behind the The enum error is probably because you've installed the wrong enum package. You want |
Of course! It would seem though that you've guessed my logic. Essentially my idea is that in the second pass through the only items being replaced are less than or equal to (this needs to be changed in my above code) 1-character so the max length doesn't go over again. I think i understand what you're saying but could you give me an example potentially? As for the enum error i imagine you are correct and i'll attempt your suggested fix and report back. Sorry again for my slow uptake. I'm very much a novice when it comes to python. |
Here's the issue: what if there's a replacement that says It seems to me like the best strategy may be closer to doing truncation first, then replacement. Of course then there's a potential problem with replacements that then make the string longer, but perhaps we should just disallow those when the filename is long? Or some other fallback policy? And no worries about being a novice; everyone starts somewhere! We're happy to help. And that particular issue is legitimately confusing. |
That's what i thought you may have meant. I've created a temporary solution to that particular case (see image below.) Though i'm now stuck trying to think of how to solve the exact opposite case than your example above. For instance having xx$ replaced with yyyy. This would not trigger in the current code because the length of the replacement will always be more than the replaced characters. I'm going to think over this and try to think of a solution but of course any suggestions would be gladly appreciated. Disallowing for long file names seems almost too strict. A fallback policy's not a bad idea. Probably going to try and think of some possibilities for that as well. As for the testing suite. The enum34 trick seemed to solve the dilemma and i was able to run the testall.py file. Though i believe something may be wrong because all of the output in the console is red and it returned with an 'OK' status. Also it stated that it skipped 17 test. This may be normal but i figured i would ask anyways. |
Thanks! It could be that there's a fundamental conflict—we can't both honor the strict length limit and appear as though all replacements were applied. Fundamentally, replacements that make the string longer would seem to make it an impossible problem. So an easy-to-understand policy about when replacements can be honored may be a good idea. Yes, the test skips you saw are normal. In the future, could you please paste text with your code instead of screenshots? (GitHub markdown's fenced code blocks can help.) This makes it much easier to read and review. |
Wouldn't it be possible to set up a while loop to repeat the sanitize and truncate steps until the path is stable? I don't have the truncate/sanitize definitions in front of me, but something like:
|
Well, that's sort of what I meant by a fundamental conflict: a replacement that lengthens the string back to its original state could cause an infinite loop. For example, |
Ok, I've come up with a solution, which should work for any character set or path length. I've written Python code as well as the summary below, which I'll post here if that's OK with @dnnovo and @wootencl (as they might need to come up with the code themselves for their course). Here's the steps of the algorithm:
This deals with both the infinite loop problem by disabling the problem substitution. The next time sanitize path is called, the single remaining replacement at the end of the string will be performed, with the troublesome string being sanitized to "_". I aimed to make the solution general, so it shouldn't introduce any issues, and should be encoding/language agnostic. Having tested on the example I gave earlier in the thread, it seems to work. Effectively, this is the policy about honouring substitutions you spoke of @sampsyo. The user substitutions will be honoured until they prevent a stable path being generated, at which point they'll be standardised to "_". A possible alternative could be removing characters from a position besides the end of the string, but thinking that through, it didn't seem as reliable (eg - if one component is only made up of characters to be replaced, then no truncation could take place). |
Seems quite reasonable to me! Thanks for thinking deeply about this, @LordSputnik. Fortunately, weird loops like this should be rare, so a simple/obvious policy along these lines seems like the right solution. |
Sorry again for the delayed response. Slightly hectic work week. @LordSputnik, thanks for checking with us first, but please go ahead and move forward with the code implementation. That's an outstanding fix and definitely outside the thought process we were having with the problem. @dnnovo and I are just here to help with the project in any way we can and are happy that potentially our pursuit of the bug may have provoked some thoughts in other developers. @dnnovo hopes to commit some unit tests for this particular bug that will hopefully aid in the implementation of your fix @LordSputnik. Thanks again for all the help! |
…st_truncation_does_not_conflict_with_replacement test. Fixes beetbox#496.
…st_truncation_does_not_conflict_with_replacement test. Fixes beetbox#496.
Fixed issue #496 - sanitize/truncate bug
Just been updating all my path names to try to fit them onto an NTFS drive legally. To do this, I set "max_filename_length: 80". However, on a few folders, this seems to have set the folder name so that it ends with a space.
Since I haven't set a custom "replace: ..." parameter in my configuration, I believe that beets should be using the default character replacement set, which includes a regex for removing trailing spaces ("\s+$").
Therefore, I think that the character replacement isn't being applied to the path name again after it's been truncated.
The text was updated successfully, but these errors were encountered: