Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.14.0 release PR #506

Merged
merged 19 commits into from
Nov 19, 2024
Merged

v1.14.0 release PR #506

merged 19 commits into from
Nov 19, 2024

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented Nov 19, 2024

In preparation for minor release (mainly to disable python 3.8).

mart-r and others added 19 commits August 30, 2024 10:07
* Pushing bug fix for metacat

2-phase learning for MetaCAT utilises data_undersampled. Fixed a bug in the eval function, which was incorrectly using the data_undersampled instead of the full_data

* Pushing change for lazy logging

* Pushing update for lazy logging

* Pushing lint fix
* CU-8695uhe5n: Update docs dependency pins

* CU-8695uhe5n: Fix typo in fsspec version pin
* CU-8695pvhfe: Rename a test class

* CU-8695pvhfe: Add tests for multiprocessig usage monitoring

* CU-8695pvhfe: Fix usage monitor for multiprocessig.

When using CAT.multiprocessing_batch_char_size (CAT._multiprocessing_batch and CAT._mp_cons internally), flush the usage monitor at the end of multiprocessing method.
When using CAT.get_entities_multi_texts or CAT.multiprocessing_batch_docs_size (uses the former internally), add logging of usage to output

* CU-8695pvhfe: Fix remaining issues with usage monitor for multiprocessig.

Avoid checking length of (potentially) non-existent strings. Avoid early iteration of generator.
* CU-8695knfbg: Decouple the edit finder methods from the spell checker

* CU-8695knfbg: Add methods for random edit picking and variant estimation to utils; Plus a few tests

* CU-8695knfbg: Add edit distance option and use to CLI

* CU-8695knfbg: Allow retaining order of elements in generator when getting edits for run-to-run consistency

* CU-8695knfbg: Add safeguard for name order to be consistent across runs

* CU-8695knfbg: Sort names when getting from CDB to avoid run to run variance

* CU-8695knfbg: Move edit finding methods back to BasicSpellChecker class, but make the 1-distance method a class method

* CU-8695knfbg: Move validation earlier in edit finder

* CU-8695knfbg: Simplify edit finder somewhat
* CU-869574kvp: Add pattern based release version identifying for Snomed preprocessing

* CU-869574kvp: Add tests for pattern-based snomed release identification

* CU-869574kvp: Update Snomed preprocessing:

Separate extensions into an Enum.
Do the release/paths check at init to allow for early failures in case of issues

* CU-869574kvp: Simplify mappings somewhat.

Move common avoids to a common location.
Fix UK Drug relationship name

* CU-869574kvp: Simplify mappings somewhat more.

Remove some clutter by separating common prefixes for release types and file names.

* CU-869574kvp: Simplify mappings somewhat more, agai.

Remove some clutter by separating common suffixes for release types.

* CU-869574kvp: Update preprocessing.

New abstraction. Use supprted extensions which describe their file formats along with bundles which give some further insight and control.

* CU-869574kvp: Fix data class init

* CU-869574kvp: Fix issue with file paths

* CU-869574kvp: Fix a UK Clinical description file path

* CU-869574kvp: Add (optional) 2nd part of folder name to extension.

For AU models, the folder name seems to be 'SnomedCT_Release_AU1000036_20240630T120000Z', so the 1st part is just 'Release' and the 2nd part is indicative of AU.
Add usage of this where relevant.

* CU-869574kvp: Fix preprocessing tests.

Add patch for files/folders where applicable.
Change the paths of attributes where applicable.
* CU-8695ucw9b: Fix older DeID models due to changes in transformers.

Since transformers 4.42.0, the tokenizer is expected to have the 'split_special_tokens' attribute. But the version we've saved does not. So when it's loaded, this causes an exception to be raised (which is currently caught and logged by medcat).

* CU-8695ucw9b: Add functionality for transformers NER to spectacularly fail upon consistent consecutive exceptions.

The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged

* CU-8695ucw9b: Add tests for exception raising after a pre-defined number of failed document processes

* CU-8695ucw9b: Change conditions for raising exception on consecutive failure.

Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised.

* CU-8695ucw9b: Small additional cleanup on successful TNER processing

* CU-8695ucw9b: Use custom exception when failing due to consecutive exceptions

* CU-8695ucw9b: Remove try-except when processing transformers NER to force immediate raising of exception
* MetaCAT fixes and upgrades

Pushing for 3 updates:
1) Removed the check and update for labels with zero data, as this was causing issues during evaluation
2) Resolved an issue where the confusion matrix couldn't be calculated when testing on a single class with an F1 score of 1, as it expected the original number of training classes (3)
3) Updated the attention mask creation to dynamically use the actual pad_idx value instead of assuming it to be 0

* Pushing type fix

* Pushing for type fix

* Fixing type issues

* Pushing change

* Pushing update w/o try except block

For the issue where the confusion matrix couldn't be calculated when testing on a single class with an F1 score of 1, as it expected the original number of training classes (3), pushing an optimized version w/o the try except block
…497)

* CU-869671bn4: Update requirements (GHA should fail due to mypy)

* CU-869671bn4: Update mypy dev requirement to be less than 1.12
* CU-86967nnra: Remove python 3.8 from GHA

* CU-86967nnra: Remove python 3.8 from classifiers

* CU-86967nnra: Add python version requirements to setup.py (allowing from 3.9 to 3.11)

* CU-86967nnra: Remove upper bound from python requirements.

Upper bound could be lifted as soon as `spacy` releases a compatible versions. And it _shouldn't_ require any changes from our side. And it isn't possible to install it on higher versions (currently) due to no `spacy` being available for those versions
* CU-86964zm4d: Use ignore tag correctly to ignore certain parts of UK release

* CU-86964zm4d: Use OPCS4 later refset ID by default (and switch to older if needed)

* CU-86964zm4d: Fix OPCS4 refset ID tests.

Fix the default value being tested for (i.e in case of international release that'll be shown).
Add a test for old UK extension.

* CU-86964zm4d: Add note regarding OPCS refset ID relevance only for UK extensions.

* CU-86964zm4d: Fix checking of extension outside loops.

I.e determinie if a UK release/bundle is used for OPCS4/ICD10 mappings splitting.
Always returning separate refsets for ICD10 and OSC internally, even if the latter is None.
* CU-8695hghww: Add bash script to run backwards compatibility

* CU-8695hghww: Rename backwards compatibility running bash script

* CU-8695hghww: Add new step to workflow to run model backwards compatibility

* CU-8695hghww: Fix model compatibility regression suite path

* CU-8695hghww: Simplify creation and removal of fake model folder
…ecated (#500)

* CU-8696m1mch: Remove versioning utility since all its parts were deprecated

* CU-8696m1mch: Remove tests for versioning utility

* CU-8696m1mch: Remove unused test-specific binary (CDB)
@mart-r mart-r merged commit ceb74b1 into production Nov 19, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants