Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU-8695ucw9b deid transformers fix #490

Merged
merged 8 commits into from
Oct 7, 2024

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented Sep 16, 2024

Since transformers==4.42.0, the tokenizer being loaded is expected to have the split_special_tokens attribute. However, the version we've saved (and load) won't have that attribute. So the processing fails (an exceptions is raised).
This failure (along with the exception) is logged. But the overall process is never halted.

What this PR does:

  • Creates a workaround for the underlying issue
    • Adds the split_special_tokens attribute to the tokenizer if required
  • Creates a way for these failures to be more easily noticed and fixed in the future
    • By limiting the TransformersNER class to only a pre-defined (10 by default) number of similar consecutive excptions
    • If more than the specified number of consecutive similar exceptions are caught, the last one is raised instead
    • This way, if this happens again, after the 10th (by default) consecutive issue, the process is halted and the issue can be seen and (hopefully) fixed

Since transformers 4.42.0, the tokenizer is expected to have the 'split_special_tokens' attribute. But the version we've saved does not. So when it's loaded, this causes an exception to be raised (which is currently caught and logged by medcat).
… fail upon consistent consecutive exceptions.

The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged
…failure.

Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised.
@tomolopolis
Copy link
Member

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just raise the exception and halt the process?

@mart-r
Copy link
Collaborator Author

mart-r commented Sep 30, 2024

why not just raise the exception and halt the process?

My best guess is that at some point in the past, something in the try-except block was raising an exception due to something that was unable to be fixed (perhaps some kind of undefined characters that caused spacy to fail? I don't know) and thus all the exceptions were being caught to allow the rest of the pipeline to still keep working.

The reason I left in the massive try-except block is because there's potentially users that rely on it.

Though in principle, I agree. The exception should be raised since otherwise (most likely) the DeID pipe will most likely simply not do anything. And personal information will be revealed where it shouldn't.

I will do the following:

  • Look into running some DeID models without the try-except block on various data
  • See if/what part has the possibility of raising an exception (and what kind) that we can recover from

And then I'll report back

  • Either we let the exception be raised as it is
  • Or we handle the few that I find in a reasonable way (where possible)

@mart-r
Copy link
Collaborator Author

mart-r commented Sep 30, 2024

Went through the DeID of a document that had some characters in the middle of the target text that could potentially cause issues.

  • Tried
    • Control sequences (\x00 and \x1F)
    • Long character seqences (a, and \n repetated 100 000 times)
    • Some random special characters (§, , ¿)

And the result was:

  • Everything was successfully de-identified on either side of the added 'special' part was just as well with or without the additional text

As such, I don't really see any reason for the try-except to exist and I removed it. So next time we support a transformers version that doesn't work with the saved model, we'll see the crash - and reason for it - immediately.

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mart-r mart-r merged commit 44db08b into master Oct 7, 2024
8 checks passed
mart-r added a commit to mart-r/MedCAT that referenced this pull request Oct 14, 2024
* CU-8695ucw9b: Fix older DeID models due to changes in transformers.

Since transformers 4.42.0, the tokenizer is expected to have the 'split_special_tokens' attribute. But the version we've saved does not. So when it's loaded, this causes an exception to be raised (which is currently caught and logged by medcat).

* CU-8695ucw9b: Add functionality for transformers NER to spectacularly fail upon consistent consecutive exceptions.

The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged

* CU-8695ucw9b: Add tests for exception raising after a pre-defined number of failed document processes

* CU-8695ucw9b: Change conditions for raising exception on consecutive failure.

Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised.

* CU-8695ucw9b: Small additional cleanup on successful TNER processing

* CU-8695ucw9b: Use custom exception when failing due to consecutive exceptions

* CU-8695ucw9b: Remove try-except when processing transformers NER to force immediate raising of exception
mart-r added a commit that referenced this pull request Oct 14, 2024
* CU-8695ucw9b: Fix older DeID models due to changes in transformers.

Since transformers 4.42.0, the tokenizer is expected to have the 'split_special_tokens' attribute. But the version we've saved does not. So when it's loaded, this causes an exception to be raised (which is currently caught and logged by medcat).

* CU-8695ucw9b: Add functionality for transformers NER to spectacularly fail upon consistent consecutive exceptions.

The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged

* CU-8695ucw9b: Add tests for exception raising after a pre-defined number of failed document processes

* CU-8695ucw9b: Change conditions for raising exception on consecutive failure.

Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised.

* CU-8695ucw9b: Small additional cleanup on successful TNER processing

* CU-8695ucw9b: Use custom exception when failing due to consecutive exceptions

* CU-8695ucw9b: Remove try-except when processing transformers NER to force immediate raising of exception
@mart-r mart-r deleted the CU-8695ucw9b-deid-transformers-fix branch November 18, 2024 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants