Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API and DICOM-specific implementations for providing supplemental metadata during conversion #4016

Merged
merged 19 commits into from
Dec 5, 2023

Conversation

melissalinkert
Copy link
Member

Writers now have the option of implementing the loci.formats.out.IExtraMetadataWriter interface, which indicates that the writer can accept additional metadata beyond what might have been in the input data. bfconvert now includes an -extra-metadata option, which takes a string that can be passed to the writer if it implements IExtraMetadataWriter. In current implementations, that string will be a file path.

DicomWriter now implements IExtraMetadataWriter, and can parse metadata from either a .dcdump file (a subset of the output of dcdump, see loci.formats.dicom.DCDumpProvider) or a .json file in a specific layout (see loci.formats.dicom.DicomJSONProvider). The former is meant to be a "quick start" for workflows that need to copy metadata from an existing DICOM dataset. There are almost certainly some gaps in parsing functionality, but some basic examples like this should work:

$ cat basic-test.dcdump
(0x0048,0x0302)  ?  - Warning - Unrecognized tag - assuming explicit value representation OK
(0x0048,0x0303)  ?  - Warning - Unrecognized tag - assuming explicit value representation OK
(0x0040,0x0710)  ?  - Warning - Unrecognized tag - assuming explicit value representation OK
(0x0008,0x0008) CS Image Type 	 VR=<CS>   VL=<0x001e>  <DERIVED\PRIMARY\OVERVIEW\NONE > 
(0x0018,0x0015) CS Body Part Examined 	 VR=<CS>   VL=<0x0006>  <BRAIN > 
(0x0018,0x1000) LO Device Serial Number 	 VR=<LO>   VL=<0x0008>  <UNKNOWN >
$ bfconvert -extra-metadata basic-test.dcdump test.fake basic-test-dcdump.dcm -debug
# conversion should succeed, observe that the "Warning" lines are skipped and "Image Type" tag is ignored since the writer provides it; "Body Part Examined" should be visible in output file
$ cat hierarchy-test.dcdump 
(0x0018,0xa001) SQ Contributing Equipment Sequence 	 VR=<SQ>   VL=<0xffffffff>  
  ----:
    > (0x0008,0x0070) LO Manufacturer 	 VR=<LO>   VL=<0x0008>  <PixelMed> 
    > (0x0008,0x0080) LO Institution Name 	 VR=<LO>   VL=<0x0008>  <PixelMed> 
    > (0x0008,0x0081) ST Institution Address 	 VR=<ST>   VL=<0x000a>  <Bangor, PA> 
    > (0x0008,0x1040) LO Institutional Department Name 	 VR=<LO>   VL=<0x0014>  <Software Development> 
    > (0x0008,0x1090) LO Manufacturer's Model Name 	 VR=<LO>   VL=<0x0020>  <com.pixelmed.convert.TIFFToDicom> 
    > (0x0018,0x1020) LO Software Version(s) 	 VR=<LO>   VL=<0x0022>  <Vers. Wed Jun  2 14:36:49 EDT 2021> 
    > (0x0018,0xa002) DT Contribution DateTime 	 VR=<DT>   VL=<0x0018>  <20210710234601.105+0000 > 
    > (0x0018,0xa003) ST Contribution Description 	 VR=<ST>   VL=<0x0018>  <TIFF to DICOM conversion> 
    > (0x0040,0xa170) SQ Purpose of Reference Code Sequence 	 VR=<SQ>   VL=<0xffffffff>  
  ----:
    > (0x0008,0x0100) SH Code Value 	 VR=<SH>   VL=<0x0006>  <109103> 
    > (0x0008,0x0102) SH Coding Scheme Designator 	 VR=<SH>   VL=<0x0004>  <DCM > 
    > (0x0008,0x0104) LO Code Meaning 	 VR=<LO>   VL=<0x0014>  <Modifying Equipment > 
$ bfconvert -extra-metadata hierarchy-test.dcdump test.fake hierarchy-test-dcdump.dcm -debug
# conversion should succeed, observe that all input metadata is visible in output file

The JSON files are where I would propose to focus further effort. As noted in the header comment in DicomJSONProvider, these are loosely based on existing work in https://github.com/QIICR/dcmqi/tree/master/doc/examples, but explicitly define tags and VRs. Some basic examples are:

$ cat basic-test.json
{
   "BodyPartExamined": {
     "Value": "BRAIN",
     "VR": "CS",
     "Tag": "(0018,0015)"
   }
}
$ bfconvert -extra-metadata basic-test.json test.fake basic-test-json.dcm
$ cat hierarchy-test.json 
{
   "BodyPartExamined": {
     "Value": "BRAIN",
     "VR": "CS",
     "Tag": "(0018,0015)"
   },
   "ContributingEquipmentSequence": {
     "VR": "SQ",
     "Tag": "(0018,a001)",
     "Sequence": {
       "Manufacturer": {
         "Value": "PixelMed",
         "VR": "LO",
         "Tag": "(0008,0070)"
       },
       "ContributionDateTime": {
         "Value": "20210710234601.105+0000",
         "VR": "DT",
         "Tag": "(0018,a002)"
       }
     }
   }
}
$ bfconvert -extra-metadata hierarchy-test.json test.fake hierarchy-test-json.dcm

General API, implementation, and overall usability feedback is welcome as always. Some things to consider in particular:

  • If extra metadata parsing fails, or tag validation fails (e.g. mismatched VR and value), should that fail the whole conversion?
  • Is the proposed JSON structure usable? If not, are there concrete suggestions for improvement?
  • Conflict resolution is very simple at the moment; anything defined by the writer takes precedence, and any extra metadata that would overwrite is ignored. Is this sufficient, or should we consider a subset of the writer-defined metadata as overwritable? I could imagine wanting to overwrite e.g. Pixel Spacing, but allowing e.g. Total Pixel Matrix Rows to be overwritten is a bad idea.

I considered making this more DICOM-specific, using a writer option and the existing -option flag in bfconvert. However, I can imagine that we might want to make use of this feature in other writers in the future. Adding an optional lightweight API seemed like the most flexible path forward.

This will require a minor release (due to API updates; this will not affect readers or memo files). Note also that this introduces an org.json:json dependency in formats-bsd (similar to formats-gpl). If that's a problem, can consider other JSON parsers. Opening as a draft PR for now, for 6.14.0 consideration.

/cc @dclunie, @fedorov

Not fully functional, but a place to start connecting bfconvert.
Fiddled with the API a bit to make this work in a more extensible way.
Supplying an extra metadata location is no longer tied to DicomWriter
specifically, but is in an extra interface that can be implemented
by writers that support this feature.

Considered implementing bfconvert connectivity via an option in DicomWriter,
instead of a new command line argument in bfconvert. Either way would work,
but the approach here would allow us to implement a similar feature in
other writers later on (if we choose to do so).
...and fix up some minor issues. A simple test like this should now work:

$ cat test.json
{
   "BodyPartExamined": {
     "Value": "BRAIN",
     "VR": "CS",
     "Tag": "(0018,0015)"
   }
}
$ bfconvert -extra-metadata test.json test.fake test.dcm
$ dcdump test.dcm
@dclunie
Copy link

dclunie commented Jun 18, 2023

I tried:

echo '{ "PatientID": { "Value": "1234", "VR": "LO", "Tag": "(0010,0020)" } }' >/tmp/crap.json

./bfconvert -extra-metadata /tmp/crap.json -noflat -tilex 256 -tiley 256 -compression JPEG CMU-1.svs /tmp/wsiconverted/crap.dcm

But it gave the following warning(s):

Ignoring tag Patient ID = 1234 from provider loci.formats.dicom.DicomJSONProvider@3246fb96

and the supplied PatientID value was not present in the output:

dckey -k PatientID /tmp/wsiconverted/crap_0_3.dcm
Error - Not found - (0x0010,0x0020) LO Patient ID

@dclunie
Copy link

dclunie commented Jun 18, 2023

I think the JSON format is better than the dcdump format for supplying the metadata, and we do not need both.

The JSON format to use for this sort of thing is always a challenge, since what is in the standard for DICOMweb is not very user friendly, and what is in dcmqi or my own SetCharacteristicsFromSummary both use only keywords, so they then depend on the conversion tool having a DICOM data dictionary to determine the Data Element Tag and VR.

So your approach is a reasonable compromise, though it will be more irritating to use since the caller will have to supply that information.

I have no problem adding a BSD-licensed JSON parser dependency.

@dclunie
Copy link

dclunie commented Jun 18, 2023

How do you see merging or replacing nested metadata working?

For example, a common use case is to supply content within SpecimenDescriptionSequence, specifically multiple items of the nested SpecimenPreparationStepContentItemSequence within SpecimenPreparationSequence that describe staining with H&E, fixation and embedding with FFPE, etc. Your default behavior populates a few attributes within SpecimenDescriptionSequence such as the SpecimenUID that is created by the convertor.

The simplest solution is probably to allow overwriting the entire sequence.

Attached please find an example of some relatively complex nested metadata that describes the sort of thing I normally supply, using the JSON syntax for my own conversion tool

example_rms_wsi_metadata.json.zip

I agree that preventing overwriting of structural metadata (things likes Rows, Columns) is probably a good idea, though I usually don't prevent that in my own tools, just assume the caller isn't going to do that sort of thing (unless they want to create a deliberately bad object for validator testing).

@dclunie
Copy link

dclunie commented Jun 18, 2023

I also found that supplying ContainerTypeCodeSequence was ignored - it seems that anything you are populating with default values for standard compliance cannot be overridden (yet). E.g.:

cat <<EOF >/tmp/crap.json
{
  "ContainerTypeCodeSequence": {
     "VR": "SQ",
     "Tag": "(0040,0518)",
     "Sequence": {
       "CodeValue": {
         "Value": "433466003",
         "VR": "SH",
         "Tag": "(0008,0100)"
       },
       "CodingSchemeDesignator": {
         "Value": "SCT",
         "VR": "CS",
         "Tag": "(0008,0102)"
       },
       "CodeMeaning": {
         "Value": "Microscope slide",
         "VR": "LO",
         "Tag": "(0008,0104)"
       }
     }
   }
}
EOF

rm -f /tmp/wsiconverted/*
./bfconvert -extra-metadata /tmp/crap.json -noflat -tilex 256 -tiley 256 -compression JPEG CMU-1.svs /tmp/wsiconverted/crap.dcm

Ignoring tag Container Type Code Sequence = null from provider loci.formats.dicom.DicomJSONProvider@6b8ca3c8

Or maybe I got the syntax wrong, though it didn't complain.

Also, how do you plan to allow multiple items in one sequence to be specified? The syntax you describe only seems to allow for one item.

@dgault dgault removed this from the 6.14.0 milestone Jun 26, 2023
Each tag contained in a JSON file may now optionally contain a
"ResolutionStrategy" property set to "IGNORE", "REPLACE", or "APPEND".
IGNORE means that the tag will be ignored if the same tag code has been defined already.
REPLACE means that the tag will be used to replace any existing tag with the same code.
APPEND means that if there is an existing tag with the same code, the current tag's
value will be appended to the pre-existing tag's value array.
@melissalinkert
Copy link
Member Author

With a bunch of testing over the last few days, I think the current state of this PR with cc3549e and fb647dd addresses comments so far.

For anything defined in JSON that is not a sequence (VR SQ), the default behavior will now be to replace what was defined by the writer (or simply insert if no previous definition). For anything defined in JSON that is a sequence, the default behavior is to append to the existing sequence defined by the writer, or insert if the sequence is not defined by the writer.

This behavior is now configurable within the JSON, by setting ResolutionStrategy to REPLACE, APPEND, or IGNORE (ignores the JSON metadata in favor of writer-defined metadata, or inserts if the writer did not define anything). ResolutionStrategy is optional, and defaults to REPLACE (non-SQ) or APPEND (SQ). I'd be happy to hear other thoughts on how to do this, but figured something more flexible would be useful - I can imagine cases where you would want a mix of behavior in a single conversion operation.

The restriction on overwriting "important" metadata such as Rows and Columns has been lifted, which means that anything can now be overwritten. I debated adding special cases to the default ResolutionStrategy behavior that would set Rows, Columns, etc. to IGNORE by default, but ultimately that ended up looking much more confusing. If useful, a -dry-run option in bfconvert could be added to print the metadata that will be written, without actually converting.

An example that demonstrates multiple items in a sequence, and different combinations of ResolutionStrategy:

{
   "BodyPartExamined": {
     "Value": "BRAIN",
     "VR": "CS",
     "Tag": "(0018,0015)"
   },
   "SpecimenLabelInImage": {
      "Value": "NO",
      "VR": "CS",
      "Tag": "(0048,0010)",
      "ReplacementStrategy": "IGNORE"
   },
   "ContributingEquipmentSequence": {
     "VR": "SQ",
     "Tag": "(0018,a001)",
     "Sequence": {
       "Manufacturer": {
         "Value": "PixelMed",
         "VR": "LO",
         "Tag": "(0008,0070)"
       },
       "ContributionDateTime": {
         "Value": "20210710234601.105+0000",
         "VR": "DT",
         "Tag": "(0018,a002)"
       }
     }
   },
   "OpticalPathSequence": {
    "VR": "SQ",
    "Tag": "(0048,0105)",
    "Sequence": {
      "IlluminationTypeCodeSequence": {
        "VR": "SQ",
        "Tag": "(0022,0016)",
        "Sequence": {
          "CodeValue": {
            "VR": "SH",
            "Tag": "(0008,0100)",
            "Value": "111743"
          },
          "CodingSchemeDesignator": {
            "VR": "SH",
            "Tag": "(0008,0102)",
            "Value": "DCM"
          },
          "CodeMeaning": {
            "VR": "LO",
            "Tag": "(0008,0104)",
            "Value": "Epifluorescence illumination"
          }
        }
      },
      "IlluminationWaveLength": {
        "VR": "FL",
        "Tag": "(0022,0055)",
        "Value": "488.0"
      },
      "OpticalPathIdentifier": {
        "VR": "SH",
        "Tag": "(0048,0106)",
        "Value": "1"
      },
      "OpticalPathDescription": {
        "VR": "ST",
        "Tag": "(0048,0107)",
        "Value": "replacement channel"
      }
    },
    "ResolutionStrategy": "REPLACE"
   }
}

The expected behavior in this example is:

  • BodyPartExamined is inserted.
  • SpecimenInLabelImage is ignored since it is defined by the writer; remove ResolutionStrategy or change it to REPLACE to see the difference.
  • ContributingEquipmentSequence is inserted, since no matching sequence is defined by the writer.
  • OpticalPathSequence replaces the entire OpticalPathSequence defined by the writer. Change ResolutionStrategy to APPEND to see an additional optical path added to the OpticalPathSequence defined by the writer.

I need to add some unit/integration tests here (which I will do early next week), and then propose to take out of draft status and review for the next minor release.

@dclunie
Copy link

dclunie commented Aug 28, 2023

The ResolutionStrategy approach looks sound to me ... while it gives the caller the ability to really screw things up if they try, it is also powerful enough to satisfy any use-case.

The one thing that would irritate me is the need to specify the tag number (e.g., (0018,0015) for BodyPartExamined). Can you look these up in the data dictionary for standard data elements? That would save the user having to do that and supply it. Only if a (relatively recently added to the standard) keyword is not recognized (error message) should the tag number be required, or for a private data element (for which the caller would also need to know that they need to add a private creator element as well, but that's OK).

This is a bit lenient as it will ignore case and whitespace,
e.g. "Study Date", "StudyDate", and "sTuDydAte " should all resolve to 0008,0020.

Also fixes a couple of small discrepancies in the dictionary.
@fedorov
Copy link

fedorov commented Aug 28, 2023

The one thing that would irritate me is the need to specify the tag number (e.g., (0018,0015) for BodyPartExamined). Can you look these up in the data dictionary for standard data elements?

Same would apply to VR, right? I think those should also come from the data dictionary.

If a VR is not defined, the default is used.
If a VR other than the default is defined, a warning will be shown
but the user-defined VR will be used.
Also fixes tag/VR lookup for sequences.
@melissalinkert
Copy link
Member Author

Last few commits here add a bunch of unit tests and make Tag and VR optional in the provided JSON. I believe that addresses all comments to date, so removing "draft" status and assigning to @dgault and @sbesson for wider review. As I am on leave the week of September 4, I will address any further comments upon return the following week.

It looks like the Maven macos and ubuntu builds are struggling to start (~45 minutes with no indication that the build has actually started), so these may need to be restarted. The Maven Windows builds did pass though, so I'd be surprised if that's a problem with this PR specifically.

@melissalinkert melissalinkert marked this pull request as ready for review September 1, 2023 18:58
@dgault dgault added this to the 7.1.0 milestone Sep 4, 2023
This prevents test run times from depending on the contents of the temp directory.
Without this change, some builds failed as the tests didn't finish.
@joshmoore
Copy link
Member

All— Very sorry for the slowness in the response here. I’ve at least partially been the hold up, since in discussions about this PR I keep coming back to the issue of not having a second format where this is implemented. I agree that adding the IExtraMetadataWriter interface is fairly low-impact, but without validating it against another format, it’s hard to know if this interface will be usable elsewhere or if this is more ultimately an implementation detail of the DICOM writer itself.

I’ve failed to find time, but I think what would really help us understand the impact of the void setExtraMetadata(String metadataSource); method would be, for example, introducing this into the writing of OME-TIFF. A user has some non-DICOM, non-OME-TIFF that they would like to convert into both of those formats and attach metadata. I think it’s fair to say that no one would particular expect the input format to be the same for both.

In the case of OME-TIFF, though, I don’t see how to make use of the new "-extra-metadata" argument to specifically attach metadata to one of possibly many images. Is that also a matter of the input format? And if so, are there any rules on that format? Or is each Writer type intended to bring their own “ExtraMetadataFormat”? If in the case of DICOM that’s already a well-known format, then would an -option dicom.metadata=${filepath} option suffice?

@melissalinkert
Copy link
Member Author

A user has some non-DICOM, non-OME-TIFF that they would like to convert into both of those formats and attach metadata. I think it’s fair to say that no one would particular expect the input format to be the same for both.

It's definitely up to each writer that implements IExtraMetadataWriter to determine which metadata formats are accepted. In the case of OMETiffWriter, I could imagine a lenient approach of attaching any input format other than OME-XML as an unlinked annotation.

In the case of OME-TIFF, though, I don’t see how to make use of the new "-extra-metadata" argument to specifically attach metadata to one of possibly many images. Is that also a matter of the input format?

Yes, that would be down to the input format and the writer implementation.

And if so, are there any rules on that format?

No, it's entirely up to the implementing writer to say what it allows.

Or is each Writer type intended to bring their own “ExtraMetadataFormat”?

Exactly, yes.

If in the case of DICOM that’s already a well-known format, then would an -option dicom.metadata=${filepath} option suffice?

As noted in the PR description, I considered doing exactly this, but thought a more generic bfconvert option would give us the flexibility to use this feature elsewhere if we choose. If making this more DICOM-specific would help to get through review at this point, that's fine; it's not realistic to add and test another writer that implements IExtraMetadataWriter for 7.1.0.

Copy link
Member

@dgault dgault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like the concept of being able to attach extra metadata and the API additions here look to be fairly flexible and low impact in its current state. As Josh mentioned, in an ideal world we would have other writer implementations such as OME-TIFF to really be able to test the wider impact, but that is certainly something we can look at implementing in the future. For now having the DICOM implementation alone will provide benefit for end users.

Tested using bfconvert with the new option and some of the sample json provided in the PR. I tested converting both existing DICOM with new extra metadata as well as converting OME-TIFF to DICOM with extra metadata. In both cases converting the file completed successfully, the resulting file could be read and displayed without any exceptions. Inspecting the metadata showed that the additional metadata values were all correct and present.

The resolution strategy also makes sense to me and provides enough power and flexibility to give the user the desired level of control.

Overall the code changes look good from my side and the new tests look to have good coverage. All other builds and tests have remained green with its included so it looks good to merge to me. I will get a follow up docs PR to document the new functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants