Installation - Getting started - Features - Documentation - Development and contributiong - Authors - License
Create your own document de-identifier using docdeid
, a simple framework independent of language or domain.
Note that
docdeid
is still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involvingdocdeid
, feel free to get in touch to coordinate.
Grab the latest version from PyPi:
pip install docdeid
from docdeid import DocDeid
from docdeid.tokenize import WordBoundaryTokenizer
from docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor
deidentifier = DocDeid()
deidentifier.tokenizers["default"] = WordBoundaryTokenizer()
deidentifier.processors.add_processor(
"name_lookup",
SingleTokenLookupAnnotator(lookup_values=["John", "Mary"], tag="name"),
)
deidentifier.processors.add_processor(
"name_regexp",
RegexpAnnotator(regexp_pattern=re.compile(r"[A-Z]\w+"), tag="name"),
)
deidentifier.processors.add_processor(
"redactor",
SimpleRedactor()
)
text = "John loves Mary, but Mary loves William."
doc = deidentifier.deidentify(text)
Find the relevant info in the Document
object:
print(doc.annotations)
AnnotationSet({
Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),
Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),
Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4),
Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)
})
print(doc.deidentified_text)
'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'
Additionally, docdeid
features:
- Ability to create your own
Annotator
,AnnotationProcessor
,Redactor
andTokenizer
components - Some basic re-usable components included (e.g. regexp, token lookup, token patterns)
- Callable from one interface (
DocDeid.deidenitfy()
) - String processing and filtering
- Fast lookup based on sets or tries
- Anything you add! PRs welcome.
For a more in-depth tutorial, see: docs/tutorial
For full documentation and API, see: https://docdeid.readthedocs.io/en/latest/
For setting up dev environment, see: docs/environment
For contributing, see: docs/contributing
Vincent Menger - Author, maintainer
This project is licensed under the MIT license - see the LICENSE.md file for details.