-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CU-8695ucw9b deid transformers fix #490
Conversation
Since transformers 4.42.0, the tokenizer is expected to have the 'split_special_tokens' attribute. But the version we've saved does not. So when it's loaded, this causes an exception to be raised (which is currently caught and logged by medcat).
… fail upon consistent consecutive exceptions. The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged
…ber of failed document processes
…failure. Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just raise the exception and halt the process?
My best guess is that at some point in the past, something in the try-except block was raising an exception due to something that was unable to be fixed (perhaps some kind of undefined characters that caused spacy to fail? I don't know) and thus all the exceptions were being caught to allow the rest of the pipeline to still keep working. The reason I left in the massive try-except block is because there's potentially users that rely on it. Though in principle, I agree. The exception should be raised since otherwise (most likely) the DeID pipe will most likely simply not do anything. And personal information will be revealed where it shouldn't. I will do the following:
And then I'll report back
|
…orce immediate raising of exception
Went through the DeID of a document that had some characters in the middle of the target text that could potentially cause issues.
And the result was:
As such, I don't really see any reason for the try-except to exist and I removed it. So next time we support a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
* CU-8695ucw9b: Fix older DeID models due to changes in transformers. Since transformers 4.42.0, the tokenizer is expected to have the 'split_special_tokens' attribute. But the version we've saved does not. So when it's loaded, this causes an exception to be raised (which is currently caught and logged by medcat). * CU-8695ucw9b: Add functionality for transformers NER to spectacularly fail upon consistent consecutive exceptions. The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged * CU-8695ucw9b: Add tests for exception raising after a pre-defined number of failed document processes * CU-8695ucw9b: Change conditions for raising exception on consecutive failure. Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised. * CU-8695ucw9b: Small additional cleanup on successful TNER processing * CU-8695ucw9b: Use custom exception when failing due to consecutive exceptions * CU-8695ucw9b: Remove try-except when processing transformers NER to force immediate raising of exception
* CU-8695ucw9b: Fix older DeID models due to changes in transformers. Since transformers 4.42.0, the tokenizer is expected to have the 'split_special_tokens' attribute. But the version we've saved does not. So when it's loaded, this causes an exception to be raised (which is currently caught and logged by medcat). * CU-8695ucw9b: Add functionality for transformers NER to spectacularly fail upon consistent consecutive exceptions. The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged * CU-8695ucw9b: Add tests for exception raising after a pre-defined number of failed document processes * CU-8695ucw9b: Change conditions for raising exception on consecutive failure. Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised. * CU-8695ucw9b: Small additional cleanup on successful TNER processing * CU-8695ucw9b: Use custom exception when failing due to consecutive exceptions * CU-8695ucw9b: Remove try-except when processing transformers NER to force immediate raising of exception
Since
transformers==4.42.0
, the tokenizer being loaded is expected to have thesplit_special_tokens
attribute. However, the version we've saved (and load) won't have that attribute. So the processing fails (an exceptions is raised).This failure (along with the exception) is logged. But the overall process is never halted.
What this PR does:
split_special_tokens
attribute to the tokenizer if requiredTransformersNER
class to only a pre-defined (10 by default) number of similar consecutive excptions