Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filenames containing ? give warning : 'extension mismatch' #129

Closed
workflowsguy opened this issue Jun 24, 2019 · 2 comments
Closed

Filenames containing ? give warning : 'extension mismatch' #129

workflowsguy opened this issue Jun 24, 2019 · 2 comments
Assignees
Labels
Milestone

Comments

@workflowsguy
Copy link

When files are processed with sf, those that contain a question mark at the end of the filename will be identified with the correct type, but a "extension mismatch" warning will still be output, viz.

sf "/Volumes/Public/bearbeiten/Dateien/ermitteln Dateityp/Salzburger Nachtstudio.2019-06-19 - Kulturkampf im Klassenzimmer?.mp3"
---
siegfried   : 1.7.12
scandate    : 2019-06-24T16:27:08+02:00
signature   : default.sig
created     : 2019-06-15T12:22:38+02:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V95.xml; container-signature-20180917.xml'
---
filename : '/Volumes/Public/bearbeiten/Dateien/ermitteln Dateityp/Salzburger Nachtstudio.2019-06-19 - Kulturkampf im Klassenzimmer?.mp3'
filesize : 74564436
modified : 2019-06-21T17:03:54+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/134'
    format  : 'MPEG 1/2 Audio Layer 3'
    version : 
    mime    : 'audio/mpeg'
    basis   : 'byte match at [[0 3] [74560365 1151] [74562035 1151] [74563705 3]] (signature 1/8)'
    warning : 'extension mismatch'

I am running on macOS, where ? is an allowed character for filenames.

Thanks!

@richardlehane richardlehane self-assigned this Jun 25, 2019
@richardlehane
Copy link
Owner

thanks for this report workflowsguy, an interesting bug! I'll look into it

@richardlehane
Copy link
Owner

I've found the offending code: https://github.com/richardlehane/siegfried/blob/master/internal/namematcher/namematcher.go#L149

The issue is that some filenames are within URLs (because of WARC scanning) and where sf thinks the name is a URL it strips characters following a "?" because in a URL that's the query string. E.g. it is trying to get the name within a string like "http://www.mysite.com/file.pdf?user=richard"

But in your case where the ? is legitimately part of a regular file name, this is breaking extension matching.

I'll have a think about how to re-jig this bit of the code to fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants