Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparing changes to add format types (classification) from DROID sig file #226

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

ross-spencer
Copy link
Collaborator

A handful of changes left over from #209 that weren't merged plus the completion of tests to back up some of those assertions.

Code changes assume the use of an element for file format types in a DROID signature file, but can be easily changed to an attribute. We'll rebase/edit the commit if that happens. Tests should still work. The third commit should be dropped before merging.

Obviously following review there may be more changes to wrap into this.

assumed XML format:

        <FileFormat ID="655" MIMEType="video/x-msvideo"
            Name="Audio/Video Interleaved Format" PUID="fmt/5">
            <InternalSignatureID>51</InternalSignatureID>
            <Extension>avi</Extension>
            <!-- THIS FILE WILL EVENTUALLY BE REPLACED -->
            <FormatTypes>Audio, Video</FormatTypes>
        </FileFormat>

Connected to #207

@@ -316,6 +316,7 @@ func IsArchive(id string) Archive {
// Clear clears loc and mimeinfo details to avoid pollution when creating multiple identifiers in same session
func Clear() func() private {
return func() private {
identifier.noContainer = false
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently we need a change like this, but we're not then resetting reports to pronom so I am not sure which way is correct of if there is still some pollution somewhere?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetting reports to pronom probably a good idea too. In principle I'd say it's good to add anything to Clear where you are setting config options that change defaults.

@@ -119,7 +119,7 @@ func New(opts ...config.Option) (core.Identifier, error) {
pronom = identifier.ApplyConfig(pronom)
id := &Identifier{
Base: identifier.New(pronom, config.ZipPuid()),
hasClass: config.Reports() != "" && !config.NoClass(),
hasClass: !config.NoClass(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is all we need to do here when this is possible with the signature file too?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest dropping the extra comments added here.

Agree absolutely that anything exported should be commented (and much to do here) but these structs aren't part of the public API (hidden by the "internal" path) & are only exported internally for the sake of unmarshalling (the golang Unmarshal funcs can only assign to exported fields).

Given these structs are all just mappings to the XML file it is hard to say anything else meaningful about them apart from what's said at the top of the file - so in this case I think it is best to just keep this mapping file lean and clean (even if linters scream about it!)

Copy link
Collaborator Author

@ross-spencer ross-spencer Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if linters scream about it!

That's just it:

C:\Users\Spencer\code\git\siegfried\pkg\pronom>%gopath%\bin\golint ./...
internal\mappings\droid.go:22:6: exported type Droid should have comment or be unexported
internal\mappings\droid.go:29:6: exported type InternalSignature should have comment or be unexported
internal\mappings\droid.go:34:6: exported type ByteSeq should have comment or be unexported
internal\mappings\droid.go:39:6: exported type SubSequence should have comment or be unexported
internal\mappings\droid.go:48:6: exported type Fragment should have comment or be unexported
internal\mappings\droid.go:55:6: exported type FileFormat should have comment or be unexported
... ~45 more warnings ...  

So, as with other code you may have noticed, I follow the process of -- if I am in the code, I try to add placeholder comments so that we can say more in future, or see fewer warnings. It's probably more idiomatic not to have any comment to have some comment rather than none, but maybe there are some project level settings that makes the intent to not address these easier for maintainers? I.e. something a linter can pick up and run with?

Given these structs are all just mappings to the XML file it is hard to say anything else meaningful about them apart from what's said at the top of the file

My documentation isn't that useful, but I don't agree everyone has the same knowledge about these objects that means it's a given this resource doesn't add more. Godoc defaults its view to show the layout and comments on internal packages: https://pkg.go.dev/github.com/richardlehane/siegfried@v1.10.0/pkg/pronom/internal/mappings which does lead to interesting possibilities in digipes.

I'll have a think about some sort of pre-commit settings and rules to ignore internal packages, but otherwise this is a simple enough commit to drop when I rebase later on.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a discussion of this on the golint issues, e.g. win-t feels my pain: golang/lint#186 (comment)

golint is deprecated now. I'm using staticcheck for my linter these days which I don't think nags about this (though it does have all sorts of other "opinions" about our code!).

we should agree on a common linter and document on the wiki: e.g. a short style guide in the contributors' docs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me and I'm very happy to help draft/review. I keep a number of references/ideas in this area which may be useful.. I'll check out static check as well and see what tooling it offers.

@richardlehane
Copy link
Owner

thanks @ross-spencer, nice to be prepped for a change to the droid file. Did you get a sense from the TNA crew that this might be on the horizon?

@ross-spencer
Copy link
Collaborator Author

Did you get a sense from the TNA crew that this might be on the horizon?

No problem. And yes, I've been putting the case forward at the PRONOM bi-weeklys and Francesca has been following up their end. Last update is they'll likely reach out to you soon (a month? a few months maybe?) to ensure there are no other incompatibilities in the code here for any proposed changes to the signature file. There should be some other interesting things coming from that change too - but I won't ruin the surprise! (NB. shouldn't result in any further code changes).

@@ -12,30 +12,40 @@
// See the License for the specific language governing permissions and
// limitations under the License.

// Package mappings contains struct mappings to unmarshal three
// Package mappings contains struct mappings to unmarshall three
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While totally inconsequential, I just wanted to share my observation that the previous spelling was already the correct one (see https://en.wiktionary.org/wiki/unmarshal)?

@ross-spencer
Copy link
Collaborator Author

Update from the PRONOM team is that this is likely to appear: 27th/28th/29th November- will be the same content as v.115 but with the addition of the formattypes tag -- so I will look at correcting these issues this weekend or next weekend. I have been experimenting with staticcheck in other projects but it seems reasonable to park that conversation for another time.

@ross-spencer
Copy link
Collaborator Author

hi @richardlehane I think the necessary changes are in here now (including reverting the docs), a8c5b49

Changes are based on the latest sample DROID file: https://github.com/digital-preservation/PRONOM_Research/blob/0d0869ba854baf44e25e9f8543f4f0c4dc98273c/Test%20Releases/DROID_SignatureFile_Classification_V2.xml -- and no further changes have been discussed in the PRONOM bi-weekly.

It looks like format type/classification will appear as an attribute, so:

<FileFormat FormatType="Image (Raster), Dataset" ID="1807" Name="Nearly Raw Raster Data"
    PUID="fmt/1002" Version="1">
    <InternalSignatureID>1357</InternalSignatureID>
    <Extension>nrrd</Extension>
</FileFormat>

or:

<FileFormat FormatType="Audio, Video" ID="655" MIMEType="video/x-msvideo"
    Name="Audio/Video Interleaved Format" PUID="fmt/5">
    <InternalSignatureID>51</InternalSignatureID>
    <Extension>avi</Extension>
    <HasPriorityOverFileFormatID>2741</HasPriorityOverFileFormatID>
</FileFormat>

Building with noreports:

---
siegfried   : 1.11.0
scandate    : 2023-11-05T19:41:35+01:00
signature   : default.sig
created     : 2023-11-05T19:41:19+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V114.xml; container-signature-20230822.xml; built without reports'
---
filename : 'testdata/skeleton-suite/fmt/fmt-1002-signature-id-1357.nrrd'
filesize : 9
modified : 2023-09-16T22:53:24+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1002'
    format  : 'Nearly Raw Raster Data'
    version : '1'
    mime    : 
    class   : 'Image (Raster), Dataset'
    basis   : 'extension match nrrd; byte match at 0, 9'
    warning : 

and reports:

---
siegfried   : 1.11.0
scandate    : 2023-11-05T19:42:18+01:00
signature   : default.sig
created     : 2023-11-05T19:42:07+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V114.xml; container-signature-20230822.xml'
---
filename : 'testdata/skeleton-suite/fmt/fmt-1002-signature-id-1357.nrrd'
filesize : 9
modified : 2023-09-16T22:53:24+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1002'
    format  : 'Nearly Raw Raster Data'
    version : '1'
    mime    : 
    class   : 'Image (Raster), Dataset'
    basis   : 'extension match nrrd; byte match at 0, 9'
    warning :

Looks equivalent. New tests pass as anticipated. Let me know if it needs further revision. I can rebase into a single commit once it looks okay.

Not sure how to handle replacing the test signature file when time comes to merge? I think it will eventually just get replaced with the new one?

@richardlehane
Copy link
Owner

thanks @ross-spencer this all looks great to me!

Changes to DROID signature file parsing to enable reading of file
format types (FormatType). 'FormatType' is included in soon to be
released DROID signature files as a 'FormatType' attribute within
the FileFormat elements of the FileFormatCollection section of the
signature file.
@ross-spencer ross-spencer marked this pull request as ready for review November 6, 2023 19:45
@ross-spencer
Copy link
Collaborator Author

Note from TNA but they haven't managed to export the changes on their side just yet, from how it sounds it seems like there's an additional process happening during signature export/publication that's removing the expected data. It's something they're looking at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants