Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024-05-01 MP3 Detection improvements #63

Conversation

NebularNerd
Copy link
Contributor

@NebularNerd NebularNerd commented May 1, 2024

Closes #32

MP3's are a strange beast, many bits have been grafted on over the decades, the word 'standard' requires a big ⭐ next to it when talking about them.

To get a higher (and hopefully definitive) match I've added versioned main fingerprints and a lot of multi match data (seriously loads). This should match pretty much any MP3 you come across with 0.8 confidences (assuming correct extension), beating false .koz matches into the dirt. I left the non-versioned .mp3 match in and added a TAG multi match to allow for fringe cases.

The .json has grown somewhat in file size to accommodate these matches, technically we could strip some of the 4 letter matches from 2.3 if I could find which ones did not apply until 2.4, however, there is little data regarding exactly what the additional ones were. Also to ensure the best confidence I had to duplicate the 4 letter matches for both v2.3 and v2.4. Again, it would also be possible to maybe sacrifice some of the more obscure 3/4 letter matches, but as there is no set rule for the ordering of the frame headers there is the potential for fringe cases where a rarely used one comes first.

Main fingerprints:

  • 0x4944330200 / ID3 = ID3v2.2.0, Rare
  • 0x4944330300 / ID3 = ID3v2.3.0, Common
  • 0x4944330400 / ID3 = ID3v2.4.0, Common

Multi-Part fingerprints:

  • 0x544147 / TAG = v1.x tag marker at -128 bytes, if a file has tags it will have this. NOTE: It is possible for a v2 file to not have v1 tags but unlikely.
  • AENC, APIC, ASPI, COMM, COMR, ENCR, EQU2, ETCO, GEOB, GRID, LINK, MCDI, MLLT, OWNE, PRIV, PCNT, POPM, POSS, RBUF, RVA2, RVRB, SEEK, SIGN, SYLT, SYTC, TALB, TBPM, TCOM, TCON, TCOP, TDEN, TDLY, TDOR, TDRC, TDRL, TDTG, TENC, TEXT, TFLT, TIPL, TIT1, TIT2, TIT3, TKEY, TLAN, TLEN, TMCL, TMED, TMOO, TOAL , TOFN, TOLY, TOPE, TOWN, TPE1, TPE2, TPE3, TPE4, TPOS, TPRO, TPUB, TRCK, TRSN, TRSO, TSOA, TSOP, TSOT, TSRC, TSSE, TSST, TXXX, UFID, USER, USLT, WCOM, WCOP, WOAF, WOAR, WOAS, WORS, WPAY, WPUB, WXXX = 4 Letter Frame codes at byte 10 for 2.3/2.4 files
  • BUF, CNT, COM, CRA, CRM, ETC, EQU, GEO, IPL, LNK, MCI, MLL, PIC, POP, REV, RVA, SLT, STC, TAL, TBP, TCM, TCO, TCR, TDA, TDY, TEN, TFT, TIM, TKE, TLA, TLE, TMT, TOA, TOF, TOL, TOR, TOT, TP1, TP2, TP3, TP4, TPA, TPB, TRC, TRD, TRK, TSI, TSS, TT1, TT2, TT3, TXT, TXX, TYE, UFI, ULT, WAF, WAR, WAS, WCM, WCP, WPB, WXX = three letter codes used by v2.2 files

Test file:

This is a weird one I found on a corner of a drive. It would have not matched as a .koz as it's a v2.2 but equally would have a low confidence match as it had no tags, using the additional 3 letter frame match you'll get a solid match. The output comes from my own confidence test script so I can easily see/test patterns.
congratulations.zip

congratulations.mp3
Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.2.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        10
Bytes Matched: b'ID3\x02\x00\x00\x00\x00\x10BTT2'
Hex:           4944 3302 0000 0000 1042 5454 32
String:        ID3BTT2

Alternate match #1
Format:        MPEG-1 Audio Layer 3 ID3v2.2.0 (MP3) audio file
Confidence:    50.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3\x02\x00'
Hex:           4944 3302 00
String:        ID3

Alternate match #2
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    30.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3'
Hex:           4944 33
String:        ID3

Example matches:

(01) Adamski - Killer.mp3
Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3\x03\x00TAG'
Hex:           4944 3303 0054 4147
String:        ID3TAG

Alternate match #1
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        10
Bytes Matched: b'ID3\x03\x00\x00\x00\x01 \x1eMCDI'
Hex:           4944 3303 0000 0001 201e 4d43 4449
String:        ID3 MCDI

Alternate match #2
Format:        Sprint Music Store audio
Confidence:    70.0%
Extension:     .koz
MIME:
Offset:        0
Bytes Matched: b'ID3\x03\x00\x00\x00'
Hex:           4944 3303 0000 00
String:        ID3

Alternate match #3
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    60.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3TAG'
Hex:           4944 3354 4147
String:        ID3TAG

Alternate match #4
Format:        MPEG-1 Audio Layer 3 ID3v2.3.0 (MP3) audio file
Confidence:    50.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3\x03\x00'
Hex:           4944 3303 00
String:        ID3

Alternate match #5
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    30.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3'
Hex:           4944 33
String:        ID3
(01) Ash - Girl from Mars.mp3
Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        10
Bytes Matched: b'ID3\x03\x00\x00\x00\x01KTTPE1'
Hex:           4944 3303 0000 0001 4b54 5450 4531
String:        ID3KTTPE1

Alternate match #1
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3\x03\x00TAG'
Hex:           4944 3303 0054 4147
String:        ID3TAG

Alternate match #2
Format:        Sprint Music Store audio
Confidence:    70.0%
Extension:     .koz
MIME:
Offset:        0
Bytes Matched: b'ID3\x03\x00\x00\x00'
Hex:           4944 3303 0000 00
String:        ID3

Alternate match #3
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    60.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3TAG'
Hex:           4944 3354 4147
String:        ID3TAG

Alternate match #4
Format:        MPEG-1 Audio Layer 3 ID3v2.3.0 (MP3) audio file
Confidence:    50.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3\x03\x00'
Hex:           4944 3303 00
String:        ID3

Alternate match #5
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    30.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3'
Hex:           4944 33
String:        ID3
B007G7MTR2_(disc_1)_05_-_I'm_Too_Fat.mp3
Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.4.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3\x04\x00TAG'
Hex:           4944 3304 0054 4147
String:        ID3TAG

Alternate match #1
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.4.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        10
Bytes Matched: b'ID3\x04\x00\x00\x00\x10\x08\x1ePRIV'
Hex:           4944 3304 0000 0010 081e 5052 4956
String:        IDPRIV

Alternate match #2
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    60.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3TAG'
Hex:           4944 3354 4147
String:        ID3TAG

Alternate match #3
Format:        MPEG-1 Audio Layer 3 ID3v2.4.0 (MP3) audio file
Confidence:    50.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3\x04\x00'
Hex:           4944 3304 00
String:        ID3

Alternate match #4
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    30.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3'
Hex:           4944 33
String:        ID3

Links:

There's a lot here now, should cover all but fringe cases
You know it would help if I added the main fingerprints
@cdgriffith cdgriffith changed the base branch from master to develop May 3, 2024 15:33
@cdgriffith cdgriffith merged commit 916415b into cdgriffith:develop May 3, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

same (mp3) file, different name ... different output: mp3 versus koz
2 participants