-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
max_filename_length creates invalid unicode characters #1418
Comments
Thank you! This is related to #496, where truncation was also creating invalid filenames by bypassing the To resolve this, you could imagine doing one of two things:
The trouble with option 1 is that, on Unix, paths are byte strings—so a filesystem's maximum length is specified in bytes, not characters. (Not sure about Windows; it's probably characters there.) And the difficulty with 2 is that finding the character boundaries amounts to re-decoding back to Unicode. I'm sure there's something clever we can do to get this right, but I'm not quite sure what it is yet. 😃 |
I'm not sure if this is the correct solution (I am not a python developer), but perhaps it's a path towards a solution? See referenced commit. |
It's getting there, but since the truncation still happens after the original Unicode becomes bytes, it assumes a specific encoding (UTF-8). We should somehow do the truncation on the original Unicode string and only encode to bytes once. |
Do you plan make |
Very good point, @untitaker—without looking again, I think the right answer will be characters on Windows and bytes on Unix. In which case you're probably right—we could just be sure to use the same encoding we used originally to get to bytes. One trick though: in some encodings, it may be possible that half a character is actually a valid, but different, character. (I don't think it's possible in UTF-8, but it could happen elsewhere. Even in UTF-8, this could interact poorly with combining characters.) In which case we'd want to avoid encoding, splitting, and re-decoding for fear of losing information about character boundaries. I don't see an obvious way around that without encoding one character at a time, though… |
I hadn't thought about that, but it is possible in UTF-8.
I've created a branch where exactly this happens. |
Very cool; this looks like it's on the right track. For maintainability's sake, it would be awesome to combine the encoding and truncation steps. That is, rather than encoding, decoding again, and then re-encoding one byte at a time to truncate, we should do the encoding and truncation all at once. Where we currently call This way, we won't have to mentally maintain the invariant that the same encoding (with the same Unicode normalization?) has to be used in both places—something that I could imagine biting us in the future. It would be fine to not truncate in |
How will this distinction hold up with Python 3? |
I believe it will be the same: these constraints come from the OS, and (fortunately) don't have to do with Python's representation. |
There are some restrictions coming from Python, concretely that Python 3 can't deal with non-ASCII byte paths. We'd have to convert those to strings everytime while Python 2 doesn't seem to have any problems with unicode paths. |
I'd suggest to just follow Python 3's lead and make everything unicode -- the stability problems with that approach regarding certain UNIX environments are already given through its stdlib. |
In my understanding/experience, bytestring paths work fine on Python 3:
In fact, when you use Unicode paths on Python 3 on Unix, it actually just encodes them to bytes before passing them off to the OS. The surrogate codepoints let you shuttle arbitrary bytes through a Unicode representation, but that doesn't change the fact that you're still dealing with a bytes API. In any case, I absolutely agree with you that we should stick with Unicode internally when we switch to Python 3—but, eventually, the limit enforced by the OS won't care about our internal representation. I believe (although it's unverified!) that all the length restrictions on Unix filenames, for any Unixy OS I know of, are in terms of bytes, so we'll need to enforce the truncation that way even if we use a Unicode representation internally. Path encodings are fun! 😱 |
Ok well, I've tested with |
Oddly, that works fine for me here too:
|
Argh, nevermind. On Fri, Jul 03, 2015 at 03:48:41PM -0700, Adrian Sampson wrote:
|
When using the max_filename_length config option to truncate long filenames, setting it to a number that is not a multiple of 4 seems to create invalid unicode characters at the end. Example from my server (CentOS 6.6, Beets 1.3.11), with an album by Belye Flagi Zazhigayte Medlenno, the full album name is:
Даже если пролетариат возьмёт власть в свои руки, весна всё равно достанется нам, а цели войны останутся целями войны
I set max_filename_length to 25, run beet mv and then beet ls -p and get this:
But if I try to actually ls the folder, when trying to tab complete I get this:
The final character is invalid, and I can't seem to input it, so I can't actually interact with the folder without using the inode. The same happens if I use a longer value like 100, with the exception that the full artist name comes out while the longer album name has the invalid character at the end.
I noticed that, in the case above using 25, the artist name was truncated even though it's only 13 unicode characters (<25), so I'm assuming that it's counting the bytes that make up the 1-4 byte unicode character, rather than each unicode character itself. I haven't tested this on a non-unicode album name, but I did test it with different numbers: even numbers don't necessarily work (100 did not), but powers of 2 (64) and multiples of 4 (24) both work without causing any invalid characters. Since it displays the path without issues through beets, I suspect it might be something to do with the way python handles the characters vs the way bash/ls does.
The text was updated successfully, but these errors were encountered: