Skip to content
This repository has been archived by the owner on Jun 17, 2022. It is now read-only.

takeshi-yoshimura/nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A simple tool for NLP

The primary purpose of this tool is to get rid of stressful data managements with Mahout and Hadoop. Thus, it basically wraps Mahout and Hadoop with simple command line interfaces, but also provides some utilities.

Requirement

maven, jdk1.8 (other jdk cause failures), hadoop-2.6.0-cdh5.4.4, mahout-0.9-cdh5.4.4

Build

$ mvn package

Run

$ vi conf.json
$ vi run

Configure your environments

$ su {hadoop user}
$ ./run

Available commands are displayed if no arguments

Develop with Eclipse

$ mvn eclipse:eclipse

Note: you may encounter jdk.tools warnings on pom.xml if you convert the project to a Maven project.

License

MIT

TODO

  • DeleteJob

    • Deletes job results on HDFS
    • Hides HDFS from users more
  • Result decorator for Hive queries

    • Allows users to promptly analyze data by Mahout
    • Needs VectorWritable parser for Hive
  • Better logging

  • Stopping Maven directory layout

    • Moves target/ and eclipse settings out of tree for Git-friendly
    • CMake?
  • Spark movement

    • Potentially speeds up everything
    • But needs to consider high memory pressures
    • Parameter Server?
  • Job history and statistics collections

    • e.g., Hadoop job configuration, task counters (.xml and .jhist files)
    • May be useful for future uses
  • Add other data analytics

    • Machine learning, graph, etc.

Author

Takeshi Yoshimura (https://github.com/takeshi-yoshimura)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published