RAKE-Java

A Java 8 implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.

The implementation is based on the python one from https://github.com/aneesha/RAKE (however some changes have been made) The source code is released under the GPL V3License.

Add this repository to your POM.XML whether you want to use it with maven

 <repository>
        <id>galan-maven-repo</id>
        <name>galan-maven-repo-releases</name>
        <url>http://galan.ehu.es/artifactory/ext-release-local</url>
 </repository>

This implementation requires a POS tagger to be used in order to work. For example The Illinois POS tagger could be used for English.

http://cogcomp.cs.illinois.edu/page/software_view/POS

For Spanish or other languages:

FreeLing --> http://nlp.lsi.upc.edu/freeling/

or Standford Pos tagger --> http://nlp.stanford.edu/software/tagger.shtml

The implementation is in beta state

TODO:

 - More testing

Then an example parser for english that will provide the required data (using Illinois POS Tagger)

    import LBJ2.nlp.SentenceSplitter;
    import LBJ2.nlp.WordSplitter;
    import LBJ2.nlp.seg.PlainToTokenParser;
    import LBJ2.parse.Parser;
    import edu.illinois.cs.cogcomp.lbj.chunk.Chunker;
    import edu.illinois.cs.cogcomp.lbj.pos.POSTagger;
    import edu.ehu.galan.cvalue.model.Token;
     ......

     List<LinkedList<Token>> tokenizedSentenceList;
     List<String> sentenceList;
     POSTagger tagger = new POSTagger();
     Chunker chunker = new Chunker();
     boolean first = true;
     parser = new PlainToTokenParser(new WordSplitter(new SentenceSplitter(pFile)));
     String sentence = "";
     LinkedList<Token> tokenList = null;
     for (LBJ2.nlp.seg.Token word = (LBJ2.nlp.seg.Token) parser.next(); word != null;
            word = (LBJ2.nlp.seg.Token) parser.next()) {
            String chunked = chunker.discreteValue(word);
            tagger.discreteValue(word);
            if (first) {
                tokenList = new LinkedList<>();
                tokenizedSentenceList.add(tokenList);
                first = false;
            }
            tokenList.add(new Token(word.form, word.partOfSpeech, null, chunked));
            sentence = sentence + " " + (word.form);
            if (word.next == null) {
                sentenceList.add(sentence);
                first = true;
                sentence = "";
            }
     }
     parser.reset();

Then RAKE can be processed then.....

    Document doc=new Document(full_path,name);
    doc.setSentenceList(sentences);
    doc.setTokenList(tokenized_sentences); 
    RakeAlgorithm ex = new RakeAlgorithm();
    ex.loadStopWordsList("resources/lite/stopWordLists/RakeStopLists/SmartStopListEn");
    ex.loadPunctStopWord("resources/lite/stopWordLists/RakeStopLists/RakePunctDefaultStopList");
    PlainTextDocumentReaderLBJEn parser = new PlainTextDocumentReaderLBJEn();
    parser.readSource("testCorpus/textAstronomy");
    Document doc = new Document("full_path", "name");
    ex.init(doc);
    ex.runAlgorithm();
    doc.getTermList();

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
stopLists		stopLists
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAKE-Java

About

Releases

Packages

Languages

License

Neuw84/RAKE-Java

Folders and files

Latest commit

History

Repository files navigation

RAKE-Java

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages