docdeid

Installation - Getting started - Features - Documentation - Development and contributiong - Authors - License

Create your own document de-identifier using docdeid, a simple framework independent of language or domain.

Note that docdeid is still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involving docdeid, feel free to get in touch to coordinate.

Installation

Grab the latest version from PyPi:

pip install docdeid

Getting started

from docdeid import DocDeid
from docdeid.tokenize import WordBoundaryTokenizer
from docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor

deidentifier = DocDeid()

deidentifier.tokenizers["default"] = WordBoundaryTokenizer()

deidentifier.processors.add_processor(
    "name_lookup",
    SingleTokenLookupAnnotator(lookup_values=["John", "Mary"], tag="name"),
)

deidentifier.processors.add_processor(
    "name_regexp",
    RegexpAnnotator(regexp_pattern=re.compile(r"[A-Z]\w+"), tag="name"),
)

deidentifier.processors.add_processor(
    "redactor", 
    SimpleRedactor()
)

text = "John loves Mary, but Mary loves William."
doc = deidentifier.deidentify(text)

Find the relevant info in the Document object:

print(doc.annotations)

AnnotationSet({
    Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),
    Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),
    Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4), 
    Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)
})

print(doc.deidentified_text)

'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'

Features

Additionally, docdeid features:

Ability to create your own Annotator, AnnotationProcessor, Redactor and Tokenizer components
Some basic re-usable components included (e.g. regexp, token lookup, token patterns)
Callable from one interface (DocDeid.deidenitfy())
String processing and filtering
Fast lookup based on sets or tries
Anything you add! PRs welcome.

For a more in-depth tutorial, see: docs/tutorial

Documentation

For full documentation and API, see: https://docdeid.readthedocs.io/en/latest/

Development and contributing

For setting up dev environment, see: docs/environment

For contributing, see: docs/contributing

Authors

Vincent Menger - Author, maintainer

License

This project is licensed under the MIT license - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github/workflows		.github/workflows
docdeid		docdeid
docs		docs
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docdeid

Installation

Getting started

Features

Documentation

Development and contributing

Authors

License

About

Releases 12

Languages

License

vmenger/docdeid

Folders and files

Latest commit

History

Repository files navigation

docdeid

Installation

Getting started

Features

Documentation

Development and contributing

Authors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 12

Languages