GitHub - mortehu/substring-frequencies: C++ program for finding strings that are over-represented in one of two texts

This program takes two files as inputs, and prints strings that are over-represented in one file or the other. This is achieved by concatenating the two files, constructing a suffix array and an LCP array, and then counting the number of occurrences of every substring of every length.

The inputs may contain NUL-delimited documents, in which case each substring will be counted only once for each document it occurs in. To enable this behavior, use --document.

Example run:

$ ./substring-frequencies --threshold-count=0 \
  --threshold=2.5 --skip-prefixes \
  /usr/share/common-licenses/GPL-3 \
  /usr/share/common-licenses/GPL-2
3.178   23      0       Corresponding
3.296   26      0       Source
2.890   17      0        a covered
3.584   35      0        conve
3.638   37      0       red w
2.773   15      0       rial
3.258   25      0       uct
2.565   12      0       the object code
2.674   28      1       he work
2.639   13      0       k i
2.833   16      0       aga
3.157   46      1       vey
2.708   14      0       d work,
2.944   18      0       onal
2.565   12      0       e covered
2.565   12      0       er a
2.773   15      0       eying
2.708   14      0       gate
2.833   16      0       mate
2.996   19      0       produc
2.773   15      0       onal
2.833   16      0       teri

Building:

$ ./configure
$ make
# make install

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
base		base
examples		examples
libdivsufsort		libdivsufsort
m4		m4
.gitignore		.gitignore
Makefile.am		Makefile.am
Makefile.in		Makefile.in
README.md		README.md
aclocal.m4		aclocal.m4
ar-lib		ar-lib
compile		compile
configure		configure
configure.ac		configure.ac
depcomp		depcomp
install-sh		install-sh
main.cc		main.cc
missing		missing
substrings.cc		substrings.cc
substrings.h		substrings.h
substrings_test.cc		substrings_test.cc
tag-rfc822.c		tag-rfc822.c
test-driver		test-driver

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

mortehu/substring-frequencies

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages