Add API and DICOM-specific implementations for providing supplemental metadata during conversion #4016

melissalinkert · 2023-06-10T00:08:28Z

Writers now have the option of implementing the loci.formats.out.IExtraMetadataWriter interface, which indicates that the writer can accept additional metadata beyond what might have been in the input data. bfconvert now includes an -extra-metadata option, which takes a string that can be passed to the writer if it implements IExtraMetadataWriter. In current implementations, that string will be a file path.

DicomWriter now implements IExtraMetadataWriter, and can parse metadata from either a .dcdump file (a subset of the output of dcdump, see loci.formats.dicom.DCDumpProvider) or a .json file in a specific layout (see loci.formats.dicom.DicomJSONProvider). The former is meant to be a "quick start" for workflows that need to copy metadata from an existing DICOM dataset. There are almost certainly some gaps in parsing functionality, but some basic examples like this should work:

$ cat basic-test.dcdump
(0x0048,0x0302)  ?  - Warning - Unrecognized tag - assuming explicit value representation OK
(0x0048,0x0303)  ?  - Warning - Unrecognized tag - assuming explicit value representation OK
(0x0040,0x0710)  ?  - Warning - Unrecognized tag - assuming explicit value representation OK
(0x0008,0x0008) CS Image Type 	 VR=<CS>   VL=<0x001e>  <DERIVED\PRIMARY\OVERVIEW\NONE > 
(0x0018,0x0015) CS Body Part Examined 	 VR=<CS>   VL=<0x0006>  <BRAIN > 
(0x0018,0x1000) LO Device Serial Number 	 VR=<LO>   VL=<0x0008>  <UNKNOWN >
$ bfconvert -extra-metadata basic-test.dcdump test.fake basic-test-dcdump.dcm -debug
# conversion should succeed, observe that the "Warning" lines are skipped and "Image Type" tag is ignored since the writer provides it; "Body Part Examined" should be visible in output file
$ cat hierarchy-test.dcdump 
(0x0018,0xa001) SQ Contributing Equipment Sequence 	 VR=<SQ>   VL=<0xffffffff>  
  ----:
    > (0x0008,0x0070) LO Manufacturer 	 VR=<LO>   VL=<0x0008>  <PixelMed> 
    > (0x0008,0x0080) LO Institution Name 	 VR=<LO>   VL=<0x0008>  <PixelMed> 
    > (0x0008,0x0081) ST Institution Address 	 VR=<ST>   VL=<0x000a>  <Bangor, PA> 
    > (0x0008,0x1040) LO Institutional Department Name 	 VR=<LO>   VL=<0x0014>  <Software Development> 
    > (0x0008,0x1090) LO Manufacturer's Model Name 	 VR=<LO>   VL=<0x0020>  <com.pixelmed.convert.TIFFToDicom> 
    > (0x0018,0x1020) LO Software Version(s) 	 VR=<LO>   VL=<0x0022>  <Vers. Wed Jun  2 14:36:49 EDT 2021> 
    > (0x0018,0xa002) DT Contribution DateTime 	 VR=<DT>   VL=<0x0018>  <20210710234601.105+0000 > 
    > (0x0018,0xa003) ST Contribution Description 	 VR=<ST>   VL=<0x0018>  <TIFF to DICOM conversion> 
    > (0x0040,0xa170) SQ Purpose of Reference Code Sequence 	 VR=<SQ>   VL=<0xffffffff>  
  ----:
    > (0x0008,0x0100) SH Code Value 	 VR=<SH>   VL=<0x0006>  <109103> 
    > (0x0008,0x0102) SH Coding Scheme Designator 	 VR=<SH>   VL=<0x0004>  <DCM > 
    > (0x0008,0x0104) LO Code Meaning 	 VR=<LO>   VL=<0x0014>  <Modifying Equipment > 
$ bfconvert -extra-metadata hierarchy-test.dcdump test.fake hierarchy-test-dcdump.dcm -debug
# conversion should succeed, observe that all input metadata is visible in output file

The JSON files are where I would propose to focus further effort. As noted in the header comment in DicomJSONProvider, these are loosely based on existing work in https://github.com/QIICR/dcmqi/tree/master/doc/examples, but explicitly define tags and VRs. Some basic examples are:

$ cat basic-test.json
{
   "BodyPartExamined": {
     "Value": "BRAIN",
     "VR": "CS",
     "Tag": "(0018,0015)"
   }
}
$ bfconvert -extra-metadata basic-test.json test.fake basic-test-json.dcm
$ cat hierarchy-test.json 
{
   "BodyPartExamined": {
     "Value": "BRAIN",
     "VR": "CS",
     "Tag": "(0018,0015)"
   },
   "ContributingEquipmentSequence": {
     "VR": "SQ",
     "Tag": "(0018,a001)",
     "Sequence": {
       "Manufacturer": {
         "Value": "PixelMed",
         "VR": "LO",
         "Tag": "(0008,0070)"
       },
       "ContributionDateTime": {
         "Value": "20210710234601.105+0000",
         "VR": "DT",
         "Tag": "(0018,a002)"
       }
     }
   }
}
$ bfconvert -extra-metadata hierarchy-test.json test.fake hierarchy-test-json.dcm

General API, implementation, and overall usability feedback is welcome as always. Some things to consider in particular:

If extra metadata parsing fails, or tag validation fails (e.g. mismatched VR and value), should that fail the whole conversion?
Is the proposed JSON structure usable? If not, are there concrete suggestions for improvement?
Conflict resolution is very simple at the moment; anything defined by the writer takes precedence, and any extra metadata that would overwrite is ignored. Is this sufficient, or should we consider a subset of the writer-defined metadata as overwritable? I could imagine wanting to overwrite e.g. Pixel Spacing, but allowing e.g. Total Pixel Matrix Rows to be overwritten is a bad idea.

I considered making this more DICOM-specific, using a writer option and the existing -option flag in bfconvert. However, I can imagine that we might want to make use of this feature in other writers in the future. Adding an optional lightweight API seemed like the most flexible path forward.

This will require a minor release (due to API updates; this will not affect readers or memo files). Note also that this introduces an org.json:json dependency in formats-bsd (similar to formats-gpl). If that's a problem, can consider other JSON parsers. Opening as a draft PR for now, for 6.14.0 consideration.

/cc @dclunie, @fedorov

… tags

Not fully functional, but a place to start connecting bfconvert.

Fiddled with the API a bit to make this work in a more extensible way. Supplying an extra metadata location is no longer tied to DicomWriter specifically, but is in an extra interface that can be implemented by writers that support this feature. Considered implementing bfconvert connectivity via an option in DicomWriter, instead of a new command line argument in bfconvert. Either way would work, but the approach here would allow us to implement a similar feature in other writers later on (if we choose to do so).

...and fix up some minor issues. A simple test like this should now work: $ cat test.json { "BodyPartExamined": { "Value": "BRAIN", "VR": "CS", "Tag": "(0018,0015)" } } $ bfconvert -extra-metadata test.json test.fake test.dcm $ dcdump test.dcm

dclunie · 2023-06-18T13:05:50Z

I tried:

echo '{ "PatientID": { "Value": "1234", "VR": "LO", "Tag": "(0010,0020)" } }' >/tmp/crap.json

./bfconvert -extra-metadata /tmp/crap.json -noflat -tilex 256 -tiley 256 -compression JPEG CMU-1.svs /tmp/wsiconverted/crap.dcm

But it gave the following warning(s):

Ignoring tag Patient ID = 1234 from provider loci.formats.dicom.DicomJSONProvider@3246fb96

and the supplied PatientID value was not present in the output:

dckey -k PatientID /tmp/wsiconverted/crap_0_3.dcm
Error - Not found - (0x0010,0x0020) LO Patient ID

dclunie · 2023-06-18T13:20:19Z

I think the JSON format is better than the dcdump format for supplying the metadata, and we do not need both.

The JSON format to use for this sort of thing is always a challenge, since what is in the standard for DICOMweb is not very user friendly, and what is in dcmqi or my own SetCharacteristicsFromSummary both use only keywords, so they then depend on the conversion tool having a DICOM data dictionary to determine the Data Element Tag and VR.

So your approach is a reasonable compromise, though it will be more irritating to use since the caller will have to supply that information.

I have no problem adding a BSD-licensed JSON parser dependency.

dclunie · 2023-06-18T13:51:26Z

How do you see merging or replacing nested metadata working?

For example, a common use case is to supply content within SpecimenDescriptionSequence, specifically multiple items of the nested SpecimenPreparationStepContentItemSequence within SpecimenPreparationSequence that describe staining with H&E, fixation and embedding with FFPE, etc. Your default behavior populates a few attributes within SpecimenDescriptionSequence such as the SpecimenUID that is created by the convertor.

The simplest solution is probably to allow overwriting the entire sequence.

Attached please find an example of some relatively complex nested metadata that describes the sort of thing I normally supply, using the JSON syntax for my own conversion tool

example_rms_wsi_metadata.json.zip

I agree that preventing overwriting of structural metadata (things likes Rows, Columns) is probably a good idea, though I usually don't prevent that in my own tools, just assume the caller isn't going to do that sort of thing (unless they want to create a deliberately bad object for validator testing).

dclunie · 2023-06-18T14:21:15Z

I also found that supplying ContainerTypeCodeSequence was ignored - it seems that anything you are populating with default values for standard compliance cannot be overridden (yet). E.g.:

cat <<EOF >/tmp/crap.json
{
  "ContainerTypeCodeSequence": {
     "VR": "SQ",
     "Tag": "(0040,0518)",
     "Sequence": {
       "CodeValue": {
         "Value": "433466003",
         "VR": "SH",
         "Tag": "(0008,0100)"
       },
       "CodingSchemeDesignator": {
         "Value": "SCT",
         "VR": "CS",
         "Tag": "(0008,0102)"
       },
       "CodeMeaning": {
         "Value": "Microscope slide",
         "VR": "LO",
         "Tag": "(0008,0104)"
       }
     }
   }
}
EOF

rm -f /tmp/wsiconverted/*
./bfconvert -extra-metadata /tmp/crap.json -noflat -tilex 256 -tiley 256 -compression JPEG CMU-1.svs /tmp/wsiconverted/crap.dcm

Ignoring tag Container Type Code Sequence = null from provider loci.formats.dicom.DicomJSONProvider@6b8ca3c8

Or maybe I got the syntax wrong, though it didn't complain.

Also, how do you plan to allow multiple items in one sequence to be specified? The syntax you describe only seems to allow for one item.

Each tag contained in a JSON file may now optionally contain a "ResolutionStrategy" property set to "IGNORE", "REPLACE", or "APPEND". IGNORE means that the tag will be ignored if the same tag code has been defined already. REPLACE means that the tag will be used to replace any existing tag with the same code. APPEND means that if there is an existing tag with the same code, the current tag's value will be appended to the pre-existing tag's value array.

…icom-provide-metadata

melissalinkert · 2023-08-24T20:38:29Z

With a bunch of testing over the last few days, I think the current state of this PR with cc3549e and fb647dd addresses comments so far.

For anything defined in JSON that is not a sequence (VR SQ), the default behavior will now be to replace what was defined by the writer (or simply insert if no previous definition). For anything defined in JSON that is a sequence, the default behavior is to append to the existing sequence defined by the writer, or insert if the sequence is not defined by the writer.

This behavior is now configurable within the JSON, by setting ResolutionStrategy to REPLACE, APPEND, or IGNORE (ignores the JSON metadata in favor of writer-defined metadata, or inserts if the writer did not define anything). ResolutionStrategy is optional, and defaults to REPLACE (non-SQ) or APPEND (SQ). I'd be happy to hear other thoughts on how to do this, but figured something more flexible would be useful - I can imagine cases where you would want a mix of behavior in a single conversion operation.

The restriction on overwriting "important" metadata such as Rows and Columns has been lifted, which means that anything can now be overwritten. I debated adding special cases to the default ResolutionStrategy behavior that would set Rows, Columns, etc. to IGNORE by default, but ultimately that ended up looking much more confusing. If useful, a -dry-run option in bfconvert could be added to print the metadata that will be written, without actually converting.

An example that demonstrates multiple items in a sequence, and different combinations of ResolutionStrategy:

{
   "BodyPartExamined": {
     "Value": "BRAIN",
     "VR": "CS",
     "Tag": "(0018,0015)"
   },
   "SpecimenLabelInImage": {
      "Value": "NO",
      "VR": "CS",
      "Tag": "(0048,0010)",
      "ReplacementStrategy": "IGNORE"
   },
   "ContributingEquipmentSequence": {
     "VR": "SQ",
     "Tag": "(0018,a001)",
     "Sequence": {
       "Manufacturer": {
         "Value": "PixelMed",
         "VR": "LO",
         "Tag": "(0008,0070)"
       },
       "ContributionDateTime": {
         "Value": "20210710234601.105+0000",
         "VR": "DT",
         "Tag": "(0018,a002)"
       }
     }
   },
   "OpticalPathSequence": {
    "VR": "SQ",
    "Tag": "(0048,0105)",
    "Sequence": {
      "IlluminationTypeCodeSequence": {
        "VR": "SQ",
        "Tag": "(0022,0016)",
        "Sequence": {
          "CodeValue": {
            "VR": "SH",
            "Tag": "(0008,0100)",
            "Value": "111743"
          },
          "CodingSchemeDesignator": {
            "VR": "SH",
            "Tag": "(0008,0102)",
            "Value": "DCM"
          },
          "CodeMeaning": {
            "VR": "LO",
            "Tag": "(0008,0104)",
            "Value": "Epifluorescence illumination"
          }
        }
      },
      "IlluminationWaveLength": {
        "VR": "FL",
        "Tag": "(0022,0055)",
        "Value": "488.0"
      },
      "OpticalPathIdentifier": {
        "VR": "SH",
        "Tag": "(0048,0106)",
        "Value": "1"
      },
      "OpticalPathDescription": {
        "VR": "ST",
        "Tag": "(0048,0107)",
        "Value": "replacement channel"
      }
    },
    "ResolutionStrategy": "REPLACE"
   }
}

The expected behavior in this example is:

BodyPartExamined is inserted.
SpecimenInLabelImage is ignored since it is defined by the writer; remove ResolutionStrategy or change it to REPLACE to see the difference.
ContributingEquipmentSequence is inserted, since no matching sequence is defined by the writer.
OpticalPathSequence replaces the entire OpticalPathSequence defined by the writer. Change ResolutionStrategy to APPEND to see an additional optical path added to the OpticalPathSequence defined by the writer.

I need to add some unit/integration tests here (which I will do early next week), and then propose to take out of draft status and review for the next minor release.

dclunie · 2023-08-28T11:40:39Z

The ResolutionStrategy approach looks sound to me ... while it gives the caller the ability to really screw things up if they try, it is also powerful enough to satisfy any use-case.

The one thing that would irritate me is the need to specify the tag number (e.g., (0018,0015) for BodyPartExamined). Can you look these up in the data dictionary for standard data elements? That would save the user having to do that and supply it. Only if a (relatively recently added to the standard) keyword is not recognized (error message) should the tag number be required, or for a private data element (for which the caller would also need to know that they need to add a private creator element as well, but that's OK).

This is a bit lenient as it will ignore case and whitespace, e.g. "Study Date", "StudyDate", and "sTuDydAte " should all resolve to 0008,0020. Also fixes a couple of small discrepancies in the dictionary.

fedorov · 2023-08-28T20:14:10Z

The one thing that would irritate me is the need to specify the tag number (e.g., (0018,0015) for BodyPartExamined). Can you look these up in the data dictionary for standard data elements?

Same would apply to VR, right? I think those should also come from the data dictionary.

If a VR is not defined, the default is used. If a VR other than the default is defined, a warning will be shown but the user-defined VR will be used.

Also fixes tag/VR lookup for sequences.

melissalinkert · 2023-09-01T18:58:26Z

Last few commits here add a bunch of unit tests and make Tag and VR optional in the provided JSON. I believe that addresses all comments to date, so removing "draft" status and assigning to @dgault and @sbesson for wider review. As I am on leave the week of September 4, I will address any further comments upon return the following week.

It looks like the Maven macos and ubuntu builds are struggling to start (~45 minutes with no indication that the build has actually started), so these may need to be restarted. The Maven Windows builds did pass though, so I'd be surprised if that's a problem with this PR specifically.

…tion

…nvestigation" This reverts commit 09f65ff.

This prevents test run times from depending on the contents of the temp directory. Without this change, some builds failed as the tests didn't finish.

joshmoore · 2023-11-24T17:58:18Z

All— Very sorry for the slowness in the response here. I’ve at least partially been the hold up, since in discussions about this PR I keep coming back to the issue of not having a second format where this is implemented. I agree that adding the IExtraMetadataWriter interface is fairly low-impact, but without validating it against another format, it’s hard to know if this interface will be usable elsewhere or if this is more ultimately an implementation detail of the DICOM writer itself.

I’ve failed to find time, but I think what would really help us understand the impact of the void setExtraMetadata(String metadataSource); method would be, for example, introducing this into the writing of OME-TIFF. A user has some non-DICOM, non-OME-TIFF that they would like to convert into both of those formats and attach metadata. I think it’s fair to say that no one would particular expect the input format to be the same for both.

In the case of OME-TIFF, though, I don’t see how to make use of the new "-extra-metadata" argument to specifically attach metadata to one of possibly many images. Is that also a matter of the input format? And if so, are there any rules on that format? Or is each Writer type intended to bring their own “ExtraMetadataFormat”? If in the case of DICOM that’s already a well-known format, then would an -option dicom.metadata=${filepath} option suffice?

melissalinkert · 2023-11-28T16:41:15Z

A user has some non-DICOM, non-OME-TIFF that they would like to convert into both of those formats and attach metadata. I think it’s fair to say that no one would particular expect the input format to be the same for both.

It's definitely up to each writer that implements IExtraMetadataWriter to determine which metadata formats are accepted. In the case of OMETiffWriter, I could imagine a lenient approach of attaching any input format other than OME-XML as an unlinked annotation.

In the case of OME-TIFF, though, I don’t see how to make use of the new "-extra-metadata" argument to specifically attach metadata to one of possibly many images. Is that also a matter of the input format?

Yes, that would be down to the input format and the writer implementation.

And if so, are there any rules on that format?

No, it's entirely up to the implementing writer to say what it allows.

Or is each Writer type intended to bring their own “ExtraMetadataFormat”?

Exactly, yes.

If in the case of DICOM that’s already a well-known format, then would an -option dicom.metadata=${filepath} option suffice?

As noted in the PR description, I considered doing exactly this, but thought a more generic bfconvert option would give us the flexibility to use this feature elsewhere if we choose. If making this more DICOM-specific would help to get through review at this point, that's fine; it's not realistic to add and test another writer that implements IExtraMetadataWriter for 7.1.0.

dgault

In general I like the concept of being able to attach extra metadata and the API additions here look to be fairly flexible and low impact in its current state. As Josh mentioned, in an ideal world we would have other writer implementations such as OME-TIFF to really be able to test the wider impact, but that is certainly something we can look at implementing in the future. For now having the DICOM implementation alone will provide benefit for end users.

Tested using bfconvert with the new option and some of the sample json provided in the PR. I tested converting both existing DICOM with new extra metadata as well as converting OME-TIFF to DICOM with extra metadata. In both cases converting the file completed successfully, the resulting file could be read and displayed without any exceptions. Inspecting the metadata showed that the additional metadata values were all correct and present.

The resolution strategy also makes sense to me and provides enough power and flexibility to give the user the desired level of control.

Overall the code changes look good from my side and the new tests look to have good coverage. All other builds and tests have remained green with its included so it looks good to merge to me. I will get a follow up docs PR to document the new functionality.

melissalinkert added 7 commits May 30, 2023 15:46

Initial version of interface and writer API for accepting extra DICOM…

07d8014

… tags

Add initial implementation of ITagProvider that parses dcdump

280b43d

Not fully functional, but a place to start connecting bfconvert.

Add basic JSON provider for DICOM tags

85433d3

...and fix up some minor issues. A simple test like this should now work: $ cat test.json { "BodyPartExamined": { "Value": "BRAIN", "VR": "CS", "Tag": "(0018,0015)" } } $ bfconvert -extra-metadata test.json test.fake test.dcm $ dcdump test.dcm

Fix up some string parsing

c71011b

Allow tag hierarchies in dcdump provider

c55f807

Allow tag hierarchies in DICOM JSON provider

c5b3427

melissalinkert added this to the 6.14.0 milestone Jun 10, 2023

melissalinkert marked this pull request as draft June 10, 2023 00:09

melissalinkert mentioned this pull request Jun 16, 2023

Allow DICOM writer to accept additional DICOM-specific metadata #3744

Closed

dgault removed this from the 6.14.0 milestone Jun 26, 2023

melissalinkert added 5 commits July 18, 2023 16:30

Remove dcdump metadata provider

c03a28d

Fix some sorting issues and allow appending to existing sequences

fb647dd

Merge branch 'develop' of github.com:openmicroscopy/bioformats into d…

72831ff

…icom-provide-metadata

Make sure trailing padding has even length

52dfc64

Attempt to lookup tags by name if not specified in JSON

2159cd4

This is a bit lenient as it will ignore case and whitespace, e.g. "Study Date", "StudyDate", and "sTuDydAte " should all resolve to 0008,0020. Also fixes a couple of small discrepancies in the dictionary.

melissalinkert added 3 commits August 30, 2023 20:55

Add VR lookup

afcce2b

If a VR is not defined, the default is used. If a VR other than the default is defined, a warning will be shown but the user-defined VR will be used.

Add some unit tests for supplemental metadata

7ece5a4

Expand unit tests to cover sequences and ResolutionStrategy

c7fd528

Also fixes tag/VR lookup for sequences.

melissalinkert marked this pull request as ready for review September 1, 2023 18:58

melissalinkert requested a review from dgault September 1, 2023 18:58

melissalinkert requested a review from sbesson September 1, 2023 18:58

dgault added this to the 7.1.0 milestone Sep 4, 2023

dgault requested a review from joshmoore September 4, 2023 13:27

melissalinkert added 3 commits September 11, 2023 13:06

Temporarily comment out all but one test, for build failure investiga…

09f65ff

…tion

Revert "Temporarily comment out all but one test, for build failure i…

5a787f6

…nvestigation" This reverts commit 09f65ff.

Turn off file grouping in test step that reads converted DICOM files

caccc63

This prevents test run times from depending on the contents of the temp directory. Without this change, some builds failed as the tests didn't finish.

melissalinkert mentioned this pull request Sep 11, 2023

Update GitHub Actions from checkout v2 to v3 #4096

Merged

dgault approved these changes Dec 5, 2023

View reviewed changes

dgault merged commit 766577d into ome:develop Dec 5, 2023
17 checks passed

sbesson mentioned this pull request Dec 5, 2023

Bump org.json:json from 20230227 to 20231013 in /components/formats-bsd #4123

Merged

melissalinkert mentioned this pull request Dec 6, 2023

7.1.0 API updates ome/bio-formats-documentation#349

Merged

joshmoore mentioned this pull request Jun 21, 2024

Allow processing multiple files sequentially in a single run, with file names read from stdin. #4200

Merged

melissalinkert deleted the dicom-provide-metadata branch September 6, 2024 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API and DICOM-specific implementations for providing supplemental metadata during conversion #4016

Add API and DICOM-specific implementations for providing supplemental metadata during conversion #4016

melissalinkert commented Jun 10, 2023

dclunie commented Jun 18, 2023

dclunie commented Jun 18, 2023

dclunie commented Jun 18, 2023

dclunie commented Jun 18, 2023

melissalinkert commented Aug 24, 2023

dclunie commented Aug 28, 2023 •

edited

Loading

fedorov commented Aug 28, 2023

melissalinkert commented Sep 1, 2023

joshmoore commented Nov 24, 2023

melissalinkert commented Nov 28, 2023

dgault left a comment

Add API and DICOM-specific implementations for providing supplemental metadata during conversion #4016

Add API and DICOM-specific implementations for providing supplemental metadata during conversion #4016

Conversation

melissalinkert commented Jun 10, 2023

dclunie commented Jun 18, 2023

dclunie commented Jun 18, 2023

dclunie commented Jun 18, 2023

dclunie commented Jun 18, 2023

melissalinkert commented Aug 24, 2023

dclunie commented Aug 28, 2023 • edited Loading

fedorov commented Aug 28, 2023

melissalinkert commented Sep 1, 2023

joshmoore commented Nov 24, 2023

melissalinkert commented Nov 28, 2023

dgault left a comment

Choose a reason for hiding this comment

dclunie commented Aug 28, 2023 •

edited

Loading