Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbled textual metadata #365

Open
rastaman111 opened this issue Oct 22, 2024 · 10 comments
Open

Garbled textual metadata #365

rastaman111 opened this issue Oct 22, 2024 · 10 comments
Assignees

Comments

@rastaman111
Copy link

Hello
My file contains LAME3.93 encoding
How can I get metadata through your library, without hieroglyphs

I get the following data: "Ñòî ÷àñîâ" and "Þðèé Ëîçà"

@sbooth
Copy link
Owner

sbooth commented Oct 22, 2024

Could you share a file that exhibits the problem?

@sbooth sbooth self-assigned this Oct 22, 2024
@rastaman111
Copy link
Author

@sbooth
Copy link
Owner

sbooth commented Oct 23, 2024

Thank you for the test case. Was the ID3 tag in the file generated using LAME also?

The problem seems to be one of text encoding. While ID3v1 tags use the 8859-1 charset (although sometimes the machine's local encoding is used, such as Windows-1251 which appears to be the correct encoding for this particular ID3v1 tag), ID3v2 uses UTF-8. It seems the ID3v2 tag in this file is not encoded using UTF-8 but rather a different character set, most likely Windows-1251, the same as the ID3v1 tag.

Take the "word" Þðèé (should be Юрий) from the TPE frame. The octets in UTF-8 and Windows-1251 have the following hex values:

UTF-8 Hex value Windows-1251
Þ 0xDE Ю
ð 0xF0 р
è 0xE8 и
é 0xE9 й

The octet values interpreted using UTF-8 give Þðèé while using Windows-1251 give Юрий. So it seems that the text in both the ID3v1 and ID3v2 tags in this file is incorrectly encoded.

@sbooth sbooth changed the title Lame encoding Garbled textual metadata Oct 23, 2024
@rastaman111
Copy link
Author

It's strange when Apple's native player easily recognizes text, just like Google Translate

I'll try to search for similar files and let you know the result

@sbooth
Copy link
Owner

sbooth commented Oct 23, 2024

That is interesting. I will take a closer look at the file's tag to make sure it is being handled correctly. I've heard of charset detection for ID3v1 tags but for ID3v2 I don't think there should be any guessing involved.

@rastaman111
Copy link
Author

rastaman111 commented Oct 23, 2024

Apple Music says it's version 3

I also ran it through several libraries and they all say that it is version 3

@sbooth
Copy link
Owner

sbooth commented Oct 23, 2024

It is an ID3v2.3 tag. The TPE1 frame for example contains the following bytes:

Field Hex Bytes Meaning
Frame ID 54 50 45 31 TPE1
Size 00 00 00 0A 10
Flags 00 00
Text Encoding 00 ISO 8859-1
Information DE F0 E8 E9 20 CB EE E7 E0

It's possible that Music runs text reported as ISO 8859-1 through a character detection library. Based on the ID3v2 tag itself, TagLib (the metadata library used by SFBAudioEngine) is interpreting the data correctly.

It shouldn't be terribly hard to wrap uchardet to add the option for character set detection for ID3v1 or ID3v2 tags using ISO 8559-1 but I haven't investigated what it would entail.

@rastaman111
Copy link
Author

I have the following question.
How can I understand from the text what encoding it has and show the user the appropriate text?

@sbooth
Copy link
Owner

sbooth commented Oct 26, 2024

Algorithms for character set detection are something I know little to nothing about. Perhaps an educated guess is made based on a frequency analysis of octets in the input?

For the file that you shared it should be possible to feed the C strings from the metadata to uchardet or a similar library and see what it comes back with, and then use iconv to convert to UTF-8.

@rastaman111
Copy link
Author

This problem is not only with this file, I found a large number of such files, in the native application Files and Music the data is displayed as expected, but using the standard API does not lead to the desired result :(

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants