Duplicate Text Finder

duptextfinder is a python library to detect duplicated zones in text. Primarily meant to detect copy/paste across medical documents. Should be faster than python's built-in difflib algorithm and more robust to whitespace, newlines and other irrelevant characters.

Installation

duptextfinder can be installed through pip:

pip install duptextfinder

Usage

from pathlib import Path
from duptextfinder import CharFingerprintBuilder, DuplicateFinder

# load some text files
texts = [p.read_text() for p in Path("some/dir").glob("*.txt")]

# init fingerprint and duplicate finder
fingerprintBuilder = CharFingerprintBuilder(fingerprintLength=15)
duplicateFinder = DuplicateFinder(fingerprintBuilder, minDuplicateLength=15)

# call findDuplicates() on each file
for i, text in enumerate(texts):
    id = f"D{i}"
    duplicates = duplicateFinder.findDuplicates(id, text)
    for duplicate in duplicates:
        print(
            f"sourceDoc={duplicate.sourceDocId}, "
            f"sourceStart={duplicate.sourceSpan.start}, "
            f"sourceEnd={duplicate.sourceSpan.end}, "
            f"targetStart={duplicate.targetSpan.start}, "
            f"targetEnd={duplicate.targetSpan.end}"
        )
        duplicated_text = text[duplicate.targetSpan.start : duplicate.targetSpan.end]
        print(duplicated_text)

WordFingerprintBuilder can be used instead of CharFingerprintBuilder. For more details, refer to the docstrings of DuplicateFinder, CharFingerprintBuilder and WordFingerprintBuilder.

How to run tests

Install package in editable mode with test and extra dependencies by running pip install -e ".[tests, ncls, intervaltree]" in the repo directory
Launch pytest tests/

About ncls and intervaltree

This tool can be used without any additional dependencies, but performance can be improved when using interval trees. To benefit from this you well need to install either the ncls package or the intervaltree package.

References

Evaluating the Impact of Text Duplications on a Corpus of More than 600,000 Clinical Narratives in a French Hospital. https://www.hal.inserm.fr/hal-02265124/

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github		.github
demo		demo
duptextfinder		duptextfinder
tests		tests
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
build.sh		build.sh
run.sh		run.sh
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duplicate Text Finder

Installation

Usage

How to run tests

About ncls and intervaltree

References

About

Releases 1

Packages

Contributors 4

Languages

License

equipe22/duplicatedZoneInClinicalText

Folders and files

Latest commit

History

Repository files navigation

Duplicate Text Finder

Installation

Usage

How to run tests

About ncls and intervaltree

References

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages