-
Notifications
You must be signed in to change notification settings - Fork 142
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #563 from mayhewsw/master
Adding transliteration code as a sub project to cogcomp-nlp
- Loading branch information
Showing
43 changed files
with
5,960 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,3 +7,4 @@ | |
edison/wordnet/ | ||
**/.project | ||
**/.classpath | ||
*~ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# Transliteration | ||
|
||
This is a Java port of Jeff Pasternack's C# code from [Learning Better Transliterations](http://cogcomp.org/page/publication_view/205) | ||
|
||
See examples in [TestTransliteration](src/test/java/edu/illinois/cs/cogcomp/transliteration/TestTransliteration.java) | ||
or [Runner](src/main/java/edu/illinois/cs/cogcomp/transliteration/Runner.java). | ||
|
||
|
||
## Training data | ||
|
||
To train a model, you need pairs of names. A common source is Wikipedia interlanguage links. For example, | ||
see [this data](http://www.clsp.jhu.edu/~anni/data/wikipedia_names) | ||
from [Transliterating From All Languages](http://cis.upenn.edu/~ccb/publications/transliterating-from-all-languages.pdf) | ||
by Anne Irvine et al. | ||
|
||
The standard data format expected is: | ||
```bash | ||
foreign<tab>english | ||
``` | ||
|
||
That said, the [Utils class](src/main/java/edu/illinois/cs/cogcomp/utils/Utils.java) has readers for many | ||
different datasets (including Anne Irvine's data). | ||
|
||
## Training a model | ||
The standard class is the [SPModel](src/main/java/edu/illinois/cs/cogcomp/transliteration/SPModel.java). Use it | ||
as follows: | ||
|
||
```java | ||
List<Example> training = Utils.readWikiData(trainfile); | ||
SPModel model = new SPModel(training); | ||
model.Train(10); | ||
model.WriteProbs(modelfile); | ||
|
||
``` | ||
|
||
This will train a model, and write it to the path specified by `modelfile`. | ||
|
||
`SPModel` has another useful function called `Probability(source, target)`, which will return the transliteration probability | ||
of a given pair. | ||
|
||
## Annotating | ||
A trained model can be used immediately after training, or you can initialize `SPModel` using a | ||
previously trained and saved `modelfile`. | ||
|
||
```java | ||
SPModel model = new SPModel(modelfile); | ||
model.setMaxCandidates(10); | ||
TopList<Double,String> predictions = model.Generate(testexample); | ||
``` | ||
|
||
We limited the max number of candidates to 10, so `predictions` will have at most 10 elements. These | ||
are sorted by score, highest to lowest, where the first element is the best. | ||
|
||
## Interactive | ||
|
||
Once you have trained a model, it is often helpful to try interacting with it. Use [interactive.sh](scripts/interactive.sh) | ||
for this: | ||
```bash | ||
$ ./scripts/interactive.sh models/modelfile | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Use ResourceManager to read these properties | ||
CuratorHost = trollope.cs.illinois.edu | ||
CuratorPort = 9010 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd | ||
http://www.w3.org/2001/XMLSchema-instance "> | ||
<parent> | ||
<artifactId>illinois-cogcomp-nlp</artifactId> | ||
<groupId>edu.illinois.cs.cogcomp</groupId> | ||
<version>3.1.33</version> | ||
</parent> | ||
<modelVersion>4.0.0</modelVersion> | ||
|
||
<artifactId>illinois-transliteration</artifactId> | ||
|
||
<properties> | ||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> | ||
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding> | ||
</properties> | ||
|
||
<repositories> | ||
<repository> | ||
<id>CogcompSoftware</id> | ||
<name>CogcompSoftware</name> | ||
<url>http://cogcomp.cs.illinois.edu/m2repo/</url> | ||
</repository> | ||
</repositories> | ||
|
||
<dependencies> | ||
<dependency> | ||
<groupId>junit</groupId> | ||
<artifactId>junit</artifactId> | ||
<version>3.8.1</version> | ||
<scope>test</scope> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>edu.illinois.cs.cogcomp</groupId> | ||
<artifactId>illinois-core-utilities</artifactId> | ||
<version>3.1.33</version> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>org.apache.commons</groupId> | ||
<artifactId>commons-lang3</artifactId> | ||
<version>3.4</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>junit</groupId> | ||
<artifactId>junit</artifactId> | ||
<version>4.12</version> | ||
<scope>test</scope> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>com.belerweb</groupId> | ||
<artifactId>pinyin4j</artifactId> | ||
<version>2.5.0</version> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>org.slf4j</groupId> | ||
<artifactId>slf4j-log4j12</artifactId> | ||
<version>1.7.13</version> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>com.ibm.icu</groupId> | ||
<artifactId>icu4j</artifactId> | ||
<version>56.1</version> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>edu.illinois.cs.cogcomp</groupId> | ||
<artifactId>illinois-abstract-server</artifactId> | ||
<version>0.1</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>edu.illinois.cs.cogcomp</groupId> | ||
<artifactId>curator-interfaces</artifactId> | ||
<version>0.7</version> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>org.apache.thrift</groupId> | ||
<artifactId>libthrift</artifactId> | ||
<version>0.8.0</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>edu.illinois.cs.cogcomp</groupId> | ||
<artifactId>curator-utils</artifactId> | ||
<version>0.0.4-SNAPSHOT</version> | ||
</dependency> | ||
|
||
</dependencies> | ||
|
||
<build> | ||
<plugins> | ||
<plugin> | ||
<groupId>org.apache.maven.plugins</groupId> | ||
<artifactId>maven-compiler-plugin</artifactId> | ||
<version>2.0.2</version> | ||
<configuration> | ||
<source>1.7</source> | ||
<target>1.7</target> | ||
</configuration> | ||
</plugin> | ||
<plugin> | ||
<groupId>org.apache.maven.plugins</groupId> | ||
<artifactId>maven-source-plugin</artifactId> | ||
<version>2.1.2</version> | ||
<executions> | ||
<execution> | ||
<id>attach-sources</id> | ||
<goals> | ||
<goal>jar</goal> | ||
</goals> | ||
</execution> | ||
</executions> | ||
</plugin> | ||
</plugins> | ||
<resources> | ||
<resource> | ||
<directory>src/main/resources</directory> | ||
</resource> | ||
</resources> | ||
<extensions> | ||
<extension> | ||
<groupId>org.apache.maven.wagon</groupId> | ||
<artifactId>wagon-ssh</artifactId> | ||
<version>2.4</version> | ||
</extension> | ||
</extensions> | ||
</build> | ||
|
||
</project> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
#!/bin/sh | ||
|
||
cpath="target/classes:target/dependency/*:config" | ||
MODEL=$1 | ||
|
||
CMD="java -classpath ${cpath} -Xmx8g edu.illinois.cs.cogcomp.transliteration.Interactive $MODEL" | ||
echo "Running: $CMD" | ||
${CMD} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
#!/bin/bash | ||
|
||
if [ "$#" -ne 1 ]; then | ||
echo "usage: $0 <package-name>" | ||
exit | ||
fi | ||
|
||
# Get the current version | ||
VERSION=`mvn org.apache.maven.plugins:maven-help-plugin:2.1.1:evaluate -Dexpression=project.version | grep -v INFO` | ||
|
||
## DON'T FORGET TO CHANGE VERSION IF THIS IS A NEW RELEASE!!! | ||
PACKAGE_NAME=$1 | ||
|
||
echo "The script should run the following commands for package: ${PACKAGE_NAME}-${VERSION}" | ||
|
||
## Deploy the Maven release | ||
echo "mvn javadoc:jar deploy" | ||
|
||
## Update the GitLab repository (also create a tag) | ||
echo "git tag v${VERSION} -m \"Releasing ${PACKAGE_NAME}-${VERSION}\"" | ||
|
||
echo "git push --tags" | ||
|
||
|
||
## Generate the distribution package | ||
echo -n "Generating the distribution package ..." | ||
|
||
## Create a temporary directory | ||
TEMP_DIR="temp90614" | ||
PACKAGE_DIR="${TEMP_DIR}/${PACKAGE_NAME}-${VERSION}" | ||
|
||
mvn dependency:copy-dependencies | ||
|
||
mkdir -p ${PACKAGE_DIR} | ||
mkdir ${PACKAGE_DIR}/lib | ||
mkdir ${PACKAGE_DIR}/dist | ||
mkdir -p ${PACKAGE_DIR}/doc/javadoc | ||
mkdir ${PACKAGE_DIR}/src | ||
mkdir ${PACKAGE_DIR}/scripts | ||
|
||
mv target/${PACKAGE_NAME}-${VERSION}.jar ${PACKAGE_DIR}/dist/ | ||
mv target/${PACKAGE_NAME}-${VERSION}-sources.jar ${PACKAGE_DIR}/src/ | ||
unzip target/${PACKAGE_NAME}-${VERSION}-javadoc.jar -d ${PACKAGE_DIR}/doc/javadoc | ||
mv target/dependency/* ${PACKAGE_DIR}/lib/ | ||
cp doc/* ${PACKAGE_DIR}/doc | ||
cp scripts/* ${PACKAGE_DIR}/scripts | ||
|
||
cd ${TEMP_DIR} | ||
zip -r ../${PACKAGE_NAME}.zip ${PACKAGE_NAME}-${VERSION} | ||
cd .. | ||
|
||
rm -rf ${TEMP_DIR} | ||
echo "Distribution package created: ${PACKAGE_NAME}.zip" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
#!/bin/sh | ||
|
||
cpath="target/classes:target/dependency/*:config" | ||
DIR="/path/to/transliteration/data" | ||
TRAIN=$DIR/train.data | ||
TEST=$DIR/test.data | ||
|
||
CMD="java -classpath ${cpath} -Xmx8g edu.illinois.cs.cogcomp.transliteration.Runner $TRAIN $TEST" | ||
echo "Running: $CMD" | ||
${CMD} |
Oops, something went wrong.