-
Notifications
You must be signed in to change notification settings - Fork 1
This is a command-line program for searching text for multiple words (or phrases) in a single pass. The runtime is O(n + m + z), where n is the length of the searched text, m is the total length of all the words we are looking for, and z is the total number of occurrences of words we are looking for.
raypereda/multisearch
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a Java program for searching for a collection of words (or phrases) at the same time. This uses the Aho-Corasick algorithm. See http://en.wikipedia.org/wiki/Aho-Corasick_algorithm Ray Pereda raypereda (at) gmail $ java -jar multisearch.jar Usage: java -jar msearch.jar -p PATTERNFILENAME FILENAME1 FILENAME2 ... Search for a list of fixed patterns in a list target files. Example: java -jar multisearch.jar -f patterns.txt newarticle1.txt newsarticle2.txt Required: must specify the patterns files with -f must specify at least one target filename Suppose you have a list of phrases that identify things that you're interested in. Put those phrases one per in a file. Here's an example file: $ cat phrases-of-interests.txt chocolate laptop bicycle caveman paleo simplify genomics Now suppose you have a list of news articles that you want to scan for all possible matches of phrases that are interesting. Here are two example news articles. $ cat newsarticle1.txt This article is about the latest in bicycle races. In here we will review the latest in eliptical gears. $ cat newsarticle2.txt This article is about Otzi. A caveman that lived about 10,000 years ago. paleo-genomics leverage DNA to piece together Otzi's life. Here's an example of multisearching for all the phrases in one pass through the news articles: java -jar multisearch.jar -p phrases-of-interests.txt newsarticle1.txt newsarticle2.txt target file: newsarticle1.txt location: [ 36, 43] matched: bicycle target file: newsarticle2.txt location: [ 30, 37] matched: caveman location: [ 73, 78] matched: paleo location: [ 79, 87] matched: genomics
About
This is a command-line program for searching text for multiple words (or phrases) in a single pass. The runtime is O(n + m + z), where n is the length of the searched text, m is the total length of all the words we are looking for, and z is the total number of occurrences of words we are looking for.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published