CU-8695ucw9b deid transformers fix #490

mart-r · 2024-09-16T11:12:50Z

Since transformers==4.42.0, the tokenizer being loaded is expected to have the split_special_tokens attribute. However, the version we've saved (and load) won't have that attribute. So the processing fails (an exceptions is raised).
This failure (along with the exception) is logged. But the overall process is never halted.

What this PR does:

Creates a workaround for the underlying issue
- Adds the split_special_tokens attribute to the tokenizer if required
Creates a way for these failures to be more easily noticed and fixed in the future
- By limiting the TransformersNER class to only a pre-defined (10 by default) number of similar consecutive excptions
- If more than the specified number of consecutive similar exceptions are caught, the last one is raised instead
- This way, if this happens again, after the 10th (by default) consecutive issue, the process is halted and the issue can be seen and (hopefully) fixed

Since transformers 4.42.0, the tokenizer is expected to have the 'split_special_tokens' attribute. But the version we've saved does not. So when it's loaded, this causes an exception to be raised (which is currently caught and logged by medcat).

… fail upon consistent consecutive exceptions. The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged

…ber of failed document processes

…failure. Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised.

tomolopolis · 2024-09-16T11:12:55Z

Task linked: CU-8695ucw9b Fix issue with DeID models and transformers 4.42+

…ceptions

tomolopolis

why not just raise the exception and halt the process?

mart-r · 2024-09-30T08:06:26Z

why not just raise the exception and halt the process?

My best guess is that at some point in the past, something in the try-except block was raising an exception due to something that was unable to be fixed (perhaps some kind of undefined characters that caused spacy to fail? I don't know) and thus all the exceptions were being caught to allow the rest of the pipeline to still keep working.

The reason I left in the massive try-except block is because there's potentially users that rely on it.

Though in principle, I agree. The exception should be raised since otherwise (most likely) the DeID pipe will most likely simply not do anything. And personal information will be revealed where it shouldn't.

I will do the following:

Look into running some DeID models without the try-except block on various data
See if/what part has the possibility of raising an exception (and what kind) that we can recover from

And then I'll report back

Either we let the exception be raised as it is
Or we handle the few that I find in a reasonable way (where possible)

…orce immediate raising of exception

mart-r · 2024-09-30T15:08:35Z

Went through the DeID of a document that had some characters in the middle of the target text that could potentially cause issues.

Tried
- Control sequences (\x00 and \x1F)
- Long character seqences (a, and \n repetated 100 000 times)
- Some random special characters (§, ¶, ¿)

And the result was:

Everything was successfully de-identified on either side of the added 'special' part was just as well with or without the additional text

As such, I don't really see any reason for the try-except to exist and I removed it. So next time we support a transformers version that doesn't work with the saved model, we'll see the crash - and reason for it - immediately.

tomolopolis

lgtm

* CU-8695ucw9b: Fix older DeID models due to changes in transformers. Since transformers 4.42.0, the tokenizer is expected to have the 'split_special_tokens' attribute. But the version we've saved does not. So when it's loaded, this causes an exception to be raised (which is currently caught and logged by medcat). * CU-8695ucw9b: Add functionality for transformers NER to spectacularly fail upon consistent consecutive exceptions. The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged * CU-8695ucw9b: Add tests for exception raising after a pre-defined number of failed document processes * CU-8695ucw9b: Change conditions for raising exception on consecutive failure. Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised. * CU-8695ucw9b: Small additional cleanup on successful TNER processing * CU-8695ucw9b: Use custom exception when failing due to consecutive exceptions * CU-8695ucw9b: Remove try-except when processing transformers NER to force immediate raising of exception

mart-r added 5 commits September 16, 2024 11:22

CU-8695ucw9b: Add functionality for transformers NER to spectacularly…

b111255

… fail upon consistent consecutive exceptions. The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged

CU-8695ucw9b: Add tests for exception raising after a pre-defined num…

49ca1e7

…ber of failed document processes

CU-8695ucw9b: Change conditions for raising exception on consecutive …

5bb188d

…failure. Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised.

CU-8695ucw9b: Small additional cleanup on successful TNER processing

f2407e3

CU-8695ucw9b: Use custom exception when failing due to consecutive ex…

d3f653d

…ceptions

tomolopolis reviewed Sep 25, 2024

View reviewed changes

mart-r added 2 commits September 30, 2024 15:36

Merge branch 'master' into CU-8695ucw9b-deid-transformers-fix

28ba359

CU-8695ucw9b: Remove try-except when processing transformers NER to f…

b65f6d4

…orce immediate raising of exception

tomolopolis approved these changes Oct 7, 2024

View reviewed changes

mart-r merged commit 44db08b into master Oct 7, 2024
8 checks passed

This was referenced Nov 14, 2024

RobertaTokenizerFast object has no attribute 'split_special_tokens' #501

Closed

CU-8696n7w95: Remove commented code to fix DeID (oversight in PR 490) #502

Merged

mart-r deleted the CU-8695ucw9b-deid-transformers-fix branch November 18, 2024 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CU-8695ucw9b deid transformers fix #490

CU-8695ucw9b deid transformers fix #490

mart-r commented Sep 16, 2024 •

edited

Loading

tomolopolis commented Sep 16, 2024

tomolopolis left a comment

mart-r commented Sep 30, 2024

mart-r commented Sep 30, 2024 •

edited

Loading

tomolopolis left a comment

CU-8695ucw9b deid transformers fix #490

CU-8695ucw9b deid transformers fix #490

Conversation

mart-r commented Sep 16, 2024 • edited Loading

tomolopolis commented Sep 16, 2024

tomolopolis left a comment

Choose a reason for hiding this comment

mart-r commented Sep 30, 2024

mart-r commented Sep 30, 2024 • edited Loading

tomolopolis left a comment

Choose a reason for hiding this comment

mart-r commented Sep 16, 2024 •

edited

Loading

mart-r commented Sep 30, 2024 •

edited

Loading