Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trailing whitespace not removed after max_filename_length truncation #496

Closed
LordSputnik opened this issue Jan 16, 2014 · 26 comments
Closed
Labels
bug bugs that are confirmed and actionable

Comments

@LordSputnik
Copy link
Contributor

Just been updating all my path names to try to fit them onto an NTFS drive legally. To do this, I set "max_filename_length: 80". However, on a few folders, this seems to have set the folder name so that it ends with a space.

Since I haven't set a custom "replace: ..." parameter in my configuration, I believe that beets should be using the default character replacement set, which includes a regex for removing trailing spaces ("\s+$").

Therefore, I think that the character replacement isn't being applied to the path name again after it's been truncated.

@sampsyo
Copy link
Member

sampsyo commented Jan 17, 2014

Thanks for pointing this out. It is indeed an issue.

Specifically, in library.py, we call sanitize_path (which does the replacement) before we call truncate_path inside destination. We should invert these.

@LordSputnik
Copy link
Contributor Author

I think if you only call sanitize_path at the end, it could result in replacements which take the filename size back over the limit. One option could just be to remove trailing spaces from filenames after truncation.

Other than that, perhaps a loop which cycles over sanitize_path and truncate until the filename is stable?

@sampsyo
Copy link
Member

sampsyo commented Jan 17, 2014

Huh, that's a great point. Maybe special-casing the whitespace removal (i.e., don't make it part of the "replace" option) is the best tactic here.

@LordSputnik
Copy link
Contributor Author

I made a temporary fix for this today by changing:

out = [c[:length] for c in comps]

to

out = [c[:length].strip() for c in comps]

in util/init.py: truncate_path

This would probably work as a permanent fix (unless you expect other characters to cause similar issues after path truncation).

@sampsyo
Copy link
Member

sampsyo commented Aug 27, 2014

Thanks for the pointer, but yes, we also need to fix this for other replacements (e.g., trailing dots)

@dnnovo
Copy link

dnnovo commented Feb 12, 2015

Hello,
I’m a member of a student group for a software engineering class, and we chose Beets as our open-source project and are thus new to the project. We previously posted this question in the mailing list (Carter Wooten, "Discussion of Bug #496"), and were told that it would be better to ask questions of this sort in the Issues Tracker.

Briefly, we are confused about how whitespace is left over after the truncation. In the python prompt, we tried to reproduce the bit of code that does the truncating (in truncate_path, in beets/util/init.py) with dummy variables:

length = 4 # if MAX_FILENAME_LENGTH = 4
a = [‘bin’, ‘usr’, ‘funnnnnnnnn’] # “a” being a dummy “comps”
out = [b[:4] for b in a] # “b” being a dummy “c”
out
[‘bin’, ‘usr’, ‘funn’]

Here, ‘bin’ and ‘usr’ are less than 4 and have no trailing whitespace, and ‘funnnnnnnnnn’ is greater than 4 and gets truncated to ‘funn’ with no trailing whitespace. The only way we can generate trailing whitespace with this snippet of code is if one of comps already has trailing whitespace; e.g., if max length is 4 and c in comps is ‘a ‘, it truncates to ‘a ‘. But this possibility is unlikely, however, since Adrian previously stated that sanitize_path is called before truncate_path. Furthermore, these examples only consider whitespace -- how do “other replacements (e.g., trailing dots)” make their way into truncate_path if they go through sanitize_path first? Doesn't sanitize_path get rid of all the regular expressions in CHAR_REPLACE?

Also, is there a way we can get detailed instructions for reproducing the bug?

Thanks in advance!

@LordSputnik
Copy link
Contributor Author

If you have a look at the destination function in library.py, the general process for generating the output filename is as follows:

  1. Generate the "ideal" filename, by substituting metadata information into the path template.
  2. Normalize unicode characters.
  3. Sanitize the path, replacing bad regular expressions in the path.
  4. Truncate the path, shortening each of the components to the desired length.

Now, during step 3, only whitespace and dots at the end of each component are removed. If step 4 truncates the components, then the end of each component will be at a different point, and characters previously in the middle of the string may be moved to the end. If one of those characters is a dot or a space, then the bad character sequence is once again in the path.

For example, the filename "/An Artist/2008. An Album/005. A Track" is perfectly legal, and would pass through sanitization unchanged. If you truncate it with a maximum length of 5, then you get "/An Art/2008./005. " which contains illegal characters at the end of the middle and final components.

@sampsyo
Copy link
Member

sampsyo commented Feb 14, 2015

That's an excellent summary, @LordSputnik. This bug arises because of the order of the operations on the path.

@dnnovo, perhaps a good first step would be to write a small test case that produces this behavior following @LordSputnik's recipe.

@dnnovo
Copy link

dnnovo commented Feb 15, 2015

Thank you for the very helpful summary, @LordSputnik! @sampsyo, we will write a small test as our first step. Thank you both.

@dnnovo
Copy link

dnnovo commented Feb 19, 2015

We are still working on making a test case that produces this behavior following @LordSputnik's recipe. It would be helpful to get instructions that are even more detailed, since we are still new to the software. It may be premature, but we would like to propose a fix:

Since the only characters that should manifest from the behavior outlined by @LordSputnik are dots and whitespace and since they should only be on the right-hand side of the string, changing

out = [c[:length] for c in comps]

to

out = [c[:length].rstrip(". ") for c in comps]

might do the trick.

rstrip() takes in characters to strip away from the right hand side of the string. For instance,

>>>'cat. dog..... '.rstrip(". ")
'cat. dog'

So in the example @LordSputnik illustrated

/An Artist/2008. An Album/005. A Track

should truncate to (with max_filename_length = 5)

/An Ar/2008/005

Thanks again for the help!

@sampsyo
Copy link
Member

sampsyo commented Feb 19, 2015

You're right to think of rstrip. But there's a subtlety here that the proposed fix misses: the replacements are user configurable using regular expressions. We can't hard-code just those two replacements (whitespace and dots); the user can disable those or provide new ones we need to support through the replacements argument to sanitize_path.

On constructing a test case: how far have you gotten and what can I help explain in more detail? I'd be happy to help.

@wootencl
Copy link

Quick question to whomever. Why allow the user to entirely disable the default regex replacements in place of their own? I understand for customization purposes, but isn't it highly recommended in the documentation to keep the defaults because of the multi OS wide illegal characters. I was just wondering how a user could could benefit from not using them. I only ask because the way i was heading in fixing this bug i was thinking of making those particular regex default replacements immutable (still keeping the user defined replacements of course.) Thanks for the help!

@sampsyo
Copy link
Member

sampsyo commented Feb 24, 2015

Some people do want this for purely aesthetic reasons. For example, if you know you're on Unix and never have to deal with Windows, you might want to name your artist folder "M.I.A." instead of "M.I.A_", which admittedly looks weird.

@wootencl
Copy link

Sorry about that Adrian. After reading the docs more thoroughly i saw your example you just mentioned above. A couple follow up questions though. For one (i probably should have stated this first) i'm actually still having trouble reproducing the bug so any help in that direction would be greatly appreciated. So far i've modified the max file length size to 5 (as suggested by LordSputnik) then uninstalled and reinstalled and confirmed that my change took hold by checking the modified file in the beets directory in the python site-packages. Then i attempt to import using the file structure suggested by LordSputnik "/An Artist/2008. An Album/005. A Track" which afterwards i see no changes. Hopefully that's clear if not i can definitely provided more details. Another more hypothetical question: Could you do something like this: Run the sanitize path (with either the default replacements or the user defined ones), then run truncate, then run the sanitize path one last time but with the condition that if there is user defined replacements make sure they are 1:1 replacements in order to keep the size. This is all hypothetical of course. Thanks again for any help provided!

sampsyo added a commit that referenced this issue Feb 25, 2015
@sampsyo
Copy link
Member

sampsyo commented Feb 25, 2015

I added a unit test. See the above commit; the comments should illuminate the issue.

@wootencl
Copy link

Sorry for the delayed response. And thanks for the unit test. That helped a ton in me being able to recreate the issue. I'm having trouble of thinking of hypotheticals so i was wondering what you might think of the modified code segment below. This would also include some extra code in the destination function which would just be an additional call to sanitize path (with the optional passThrough parameter set equal to 2) after the call to truncate. I would have tested it myself with the test suite but am currently having difficulty running testall.py in pycharm because of an enum error. Sorry again if this is completely off base. This problem is definitely a little more complex then i first assumed. Thanks again for any help!

screen shot 2015-03-09 at 10 29 30 pm

@sampsyo
Copy link
Member

sampsyo commented Mar 12, 2015

Can you please explain the logic behind the passThrough parameter? I can't really follow it without comments. Are you proposing to special-case the second pass to whitespace only (and other 1-character replacements)? If so, that's not quite sufficient—truncation could create a different kind of pattern, not just whitespace, that would need sanitation.

The enum error is probably because you've installed the wrong enum package. You want pip install enum34, not enum (which you probably now need to uninstall).

@wootencl
Copy link

Of course! It would seem though that you've guessed my logic. Essentially my idea is that in the second pass through the only items being replaced are less than or equal to (this needs to be changed in my above code) 1-character so the max length doesn't go over again. I think i understand what you're saying but could you give me an example potentially?

As for the enum error i imagine you are correct and i'll attempt your suggested fix and report back. Sorry again for my slow uptake. I'm very much a novice when it comes to python.

@sampsyo
Copy link
Member

sampsyo commented Mar 12, 2015

Here's the issue: what if there's a replacement that says xxxx$ needs to be replaced with yy? That is, four Xs at the end of the string should be replaced with two Ys. Then there's a filename that ends in ...aaaaxxxxaaaa that, when truncated, becomes ...aaaaxxxx. The replacement should trigger.

It seems to me like the best strategy may be closer to doing truncation first, then replacement. Of course then there's a potential problem with replacements that then make the string longer, but perhaps we should just disallow those when the filename is long? Or some other fallback policy?

And no worries about being a novice; everyone starts somewhere! We're happy to help. And that particular issue is legitimately confusing.

@wootencl
Copy link

That's what i thought you may have meant. I've created a temporary solution to that particular case (see image below.) Though i'm now stuck trying to think of how to solve the exact opposite case than your example above. For instance having xx$ replaced with yyyy. This would not trigger in the current code because the length of the replacement will always be more than the replaced characters. I'm going to think over this and try to think of a solution but of course any suggestions would be gladly appreciated. Disallowing for long file names seems almost too strict. A fallback policy's not a bad idea. Probably going to try and think of some possibilities for that as well.

As for the testing suite. The enum34 trick seemed to solve the dilemma and i was able to run the testall.py file. Though i believe something may be wrong because all of the output in the console is red and it returned with an 'OK' status. Also it stated that it skipped 17 test. This may be normal but i figured i would ask anyways.

screen shot 2015-03-12 at 11 03 00 am

@sampsyo
Copy link
Member

sampsyo commented Mar 12, 2015

Thanks!

It could be that there's a fundamental conflict—we can't both honor the strict length limit and appear as though all replacements were applied. Fundamentally, replacements that make the string longer would seem to make it an impossible problem. So an easy-to-understand policy about when replacements can be honored may be a good idea.

Yes, the test skips you saw are normal.

In the future, could you please paste text with your code instead of screenshots? (GitHub markdown's fenced code blocks can help.) This makes it much easier to read and review.

@LordSputnik
Copy link
Contributor Author

Wouldn't it be possible to set up a while loop to repeat the sanitize and truncate steps until the path is stable?

I don't have the truncate/sanitize definitions in front of me, but something like:

def finalize_path(path):
    previous_path = path
    path = sanitize(truncate(path))
    while previous_path != path:
        path = sanitize(truncate(path))

    return path

@sampsyo
Copy link
Member

sampsyo commented Mar 12, 2015

Well, that's sort of what I meant by a fundamental conflict: a replacement that lengthens the string back to its original state could cause an infinite loop. For example, xx to xxyy where truncation brings the string to ...xx.

@LordSputnik
Copy link
Contributor Author

Ok, I've come up with a solution, which should work for any character set or path length. I've written Python code as well as the summary below, which I'll post here if that's OK with @dnnovo and @wootencl (as they might need to come up with the code themselves for their course).

Here's the steps of the algorithm:

Inputs: path, length, replacements
1. Create a list of "path candidates". Initialize the list with 1 element, an empty string.
2. Loop until the path is equal to the last element in the path candidates list.
    a. Loop until the path has previously been entered into the path candidates list.
        i. Append the path variable to the candidates list.
        ii. Sanitize and truncate the path, assigning the output to the path variable.
    b. Replace all the values in the user replacements dictionary with "_". If any key is equal to "_", replace the value for that key with a blank string, "".

This deals with both the infinite loop problem by disabling the problem substitution. The next time sanitize path is called, the single remaining replacement at the end of the string will be performed, with the troublesome string being sanitized to "_". I aimed to make the solution general, so it shouldn't introduce any issues, and should be encoding/language agnostic.

Having tested on the example I gave earlier in the thread, it seems to work.

Effectively, this is the policy about honouring substitutions you spoke of @sampsyo. The user substitutions will be honoured until they prevent a stable path being generated, at which point they'll be standardised to "_".

A possible alternative could be removing characters from a position besides the end of the string, but thinking that through, it didn't seem as reliable (eg - if one component is only made up of characters to be replaced, then no truncation could take place).

@sampsyo
Copy link
Member

sampsyo commented Mar 13, 2015

Seems quite reasonable to me! Thanks for thinking deeply about this, @LordSputnik. Fortunately, weird loops like this should be rare, so a simple/obvious policy along these lines seems like the right solution.

@wootencl
Copy link

Sorry again for the delayed response. Slightly hectic work week. @LordSputnik, thanks for checking with us first, but please go ahead and move forward with the code implementation. That's an outstanding fix and definitely outside the thought process we were having with the problem. @dnnovo and I are just here to help with the project in any way we can and are happy that potentially our pursuit of the bug may have provoked some thoughts in other developers. @dnnovo hopes to commit some unit tests for this particular bug that will hopefully aid in the implementation of your fix @LordSputnik. Thanks again for all the help!

LordSputnik added a commit to LordSputnik/beets that referenced this issue Mar 17, 2015
…st_truncation_does_not_conflict_with_replacement test. Fixes beetbox#496.
LordSputnik added a commit to LordSputnik/beets that referenced this issue May 10, 2015
…st_truncation_does_not_conflict_with_replacement test. Fixes beetbox#496.
@sampsyo sampsyo closed this as completed in d07c8fd Jul 8, 2015
sampsyo added a commit that referenced this issue Jul 8, 2015
Fixed issue #496 - sanitize/truncate bug
sampsyo added a commit that referenced this issue Jul 8, 2015
sampsyo added a commit that referenced this issue Jul 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs that are confirmed and actionable
Projects
None yet
Development

No branches or pull requests

4 participants