Termgrep

Termgrep is a library and command-line utility for finding and counting occurences of terms in documents. More specifically, it is designed to look for many terms at the same time, kind of a regex with many |, except it returns a map of {term-> number of occurences} direclty. For more information about the motivation and the mechanisms involved, see this blog post.

Building

This repository has git submodules for dependencies, so remember to clone it with the --recursive option.

git clone --recursive https://github.com/neurovertex/termgrep
mkdir termgrep/build && cd termgrep/build
cmake .. && make

Usage

Termgrep requires a list of terms and a list of documents. The simplest command-line would be: ./termgrep_main document1.txt document2.txt document3.txt --terms terms.txt --output-file=output.json

Terms must be supplied as a newline-separated list of terms, either after the --terms parameter or to the standard input with the --terms-stdin parameter.

The document list can be given as a potitional parameters, or fed to the standard input (better suited for large corpora) as a newline-separated list with --file-list-stdin (obviously mutually exclusive with --terms-stdin). You can also prepend an ID/name and a separator (specified with --fileid-separator) to the path, which will be used in the output to identify the file. The resulting command line would look like:

./termgrep_main --terms=terms.txt --termid-separator=: docA:./path/A/doc.txt docB:./path/B/doc.txt docC:./path/C/doc.txt --output-file=output.json

The JSON output will look like so:

[
    {
        "file": "docA",
        "matches": {
            "foo": 0, "bar": 2, "baz": 1
        }
    },
    {
        "file": "docB",
        "matches": {
            "foo": 1, "bar": 5, "baz": 3
        }
    },
    {
        "file": "docC",
        "matches": {
            "foo": 0, "bar": 3, "baz": 0
        }
    }
]```

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
deps		deps
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Termgrep

Building

Usage

About

Releases

Packages

Contributors 2

Languages

compilin/termgrep

Folders and files

Latest commit

History

Repository files navigation

Termgrep

Building

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages