Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Remove invalid characters from DICOM strings with unsupported character encoding #968

Merged

Conversation

lassoan
Copy link
Member

@lassoan lassoan commented May 9, 2021

When DICOM strings could not be decoded (because an unsupported encoding was used) then it was assumed to be encoded as Latin1.
This led to incorrect special characters appearing in the output and in some cases even invalid UTF8 string (causing data corruption - see https://discourse.slicer.org/t/re-failure-to-opening-saved-work/17473/5).

As a workaround, to make decoding issues more visible and avoid having random non-ASCII characters appearing in the output, we replace non-ASCII characters by '?'.

As a proper solution, DCMTK could be used to convert from all DICOM standard encodings. However, for this DMTK must be built with iconv, which seems to be quite complicated (mostly DCMTK's build system handles third-party dependencies in unusual way). Since non-supported encodings are very rare (just a few ISO 2022 IR encodings) it is probably not a serious limitation. For example, DCM4CHEE does not support any of the ISO 2022 IR encodings (https://dcm4chee-arc-cs.readthedocs.io/en/latest/charsets.html).

…aracter encoding

When DICOM strings could not be decoded (because an unsupported encoding was used) then it was assumed to be encoded as Latin1.
This led to incorrect special characters appearing in the output and in some cases even invalid UTF8 string (causing data corruption - see https://discourse.slicer.org/t/re-failure-to-opening-saved-work/17473/5).

As a workaround, to make decoding issues more visible and avoid having random non-ASCII characters appearing in the output, we replace non-ASCII characters by '?'.

As a proper solution, DCMTK could be used to convert from all DICOM standard encodings. However, for this DMTK must be built with iconv, which seems to be quite complicated (mostly DCMTK's build system handles third-party dependencies in unusual way). Since non-supported encodings are very rare (just a few ISO 2022 IR encodings) it is probably not a serious limitation. For example, DCM4CHEE does not support any of the ISO 2022 IR encodings (https://dcm4chee-arc-cs.readthedocs.io/en/latest/charsets.html).
@lassoan lassoan requested a review from pieper May 9, 2021 19:41
@lassoan lassoan self-assigned this May 9, 2021
Copy link
Member

@pieper pieper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pieper pieper merged commit dc2e128 into commontk:master May 9, 2021
@lassoan lassoan deleted the dicom-remove-invalid-non-decoded-chars branch March 26, 2024 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants