BUG: Remove invalid characters from DICOM strings with unsupported character encoding #968

lassoan · 2021-05-09T19:41:37Z

When DICOM strings could not be decoded (because an unsupported encoding was used) then it was assumed to be encoded as Latin1.
This led to incorrect special characters appearing in the output and in some cases even invalid UTF8 string (causing data corruption - see https://discourse.slicer.org/t/re-failure-to-opening-saved-work/17473/5).

As a workaround, to make decoding issues more visible and avoid having random non-ASCII characters appearing in the output, we replace non-ASCII characters by '?'.

As a proper solution, DCMTK could be used to convert from all DICOM standard encodings. However, for this DMTK must be built with iconv, which seems to be quite complicated (mostly DCMTK's build system handles third-party dependencies in unusual way). Since non-supported encodings are very rare (just a few ISO 2022 IR encodings) it is probably not a serious limitation. For example, DCM4CHEE does not support any of the ISO 2022 IR encodings (https://dcm4chee-arc-cs.readthedocs.io/en/latest/charsets.html).

…aracter encoding When DICOM strings could not be decoded (because an unsupported encoding was used) then it was assumed to be encoded as Latin1. This led to incorrect special characters appearing in the output and in some cases even invalid UTF8 string (causing data corruption - see https://discourse.slicer.org/t/re-failure-to-opening-saved-work/17473/5). As a workaround, to make decoding issues more visible and avoid having random non-ASCII characters appearing in the output, we replace non-ASCII characters by '?'. As a proper solution, DCMTK could be used to convert from all DICOM standard encodings. However, for this DMTK must be built with iconv, which seems to be quite complicated (mostly DCMTK's build system handles third-party dependencies in unusual way). Since non-supported encodings are very rare (just a few ISO 2022 IR encodings) it is probably not a serious limitation. For example, DCM4CHEE does not support any of the ISO 2022 IR encodings (https://dcm4chee-arc-cs.readthedocs.io/en/latest/charsets.html).

pieper

👍

lassoan requested a review from pieper May 9, 2021 19:41

lassoan self-assigned this May 9, 2021

pieper approved these changes May 9, 2021

View reviewed changes

pieper merged commit dc2e128 into commontk:master May 9, 2021

lassoan deleted the dicom-remove-invalid-non-decoded-chars branch March 26, 2024 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Remove invalid characters from DICOM strings with unsupported character encoding #968

BUG: Remove invalid characters from DICOM strings with unsupported character encoding #968

lassoan commented May 9, 2021

pieper left a comment

BUG: Remove invalid characters from DICOM strings with unsupported character encoding #968

BUG: Remove invalid characters from DICOM strings with unsupported character encoding #968

Conversation

lassoan commented May 9, 2021

pieper left a comment

Choose a reason for hiding this comment