Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for the Dataless Classifier (closes #556) #544

Merged
merged 8 commits into from
Dec 16, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ Each library contains detailed readme and instructions on how to use it. In addi
| [commasrl](commasrl/README.md) | This software extracts relations that commas participate in. |
| [similarity](similarity/README.md) | This software compare objects --especially Strings-- and return a score indicating how similar they are. |
| [temporal-normalizer](temporal-normalizer/README.md) | A temporal extractor and normalizer. |
| [dataless-classifier](dataless-classifier/README.md) | Classifies text into a user-specified label hierarchy from just the textual label descriptions |
| [external-annotators](external/README.md) | A collection useful external annotators. |


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,9 @@ public class ViewNames {

public static final String WIKIFIER = "WIKIFIER";

public static final String DATALESS_ESA = "DATALESS_ESA";
public static final String DATALESS_W2V = "DATALESS_W2V";

/**
* @deprecated Replaced by {@link #CLAUSES_CHARNIAK}, {@link #CLAUSES_BERKELEY},
* {@link #CLAUSES_STANFORD}
Expand Down Expand Up @@ -150,6 +153,8 @@ public static ViewTypes getViewType(String viewName) {
case SHALLOW_PARSE:
case QUANTITIES:
case WIKIFIER:
case DATALESS_ESA:
case DATALESS_W2V:
case CLAUSES_CHARNIAK:
case CLAUSES_STANFORD:
case CLAUSES_BERKELEY:
Expand Down
45 changes: 45 additions & 0 deletions dataless-classifier/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# CogComp-DatalessClassifier
Given a label ontology, and textual descriptions of those labels, Dataless-Classifier is capable of classifying arbitrary text into that ontology.

It is particularly useful in those scenarios where it is difficult/expensive to gather enough training data to train a supervised text classifier. Dataless-Classifier utilizes the semantic meaning of the labels to bypass the need for explicit supervision. For more information, please visit our main project [page](http://cogcomp.org/page/project_view/6).


Some key points:
- The Main classes for the Dataless Annotators are:
* **ESADatalessAnnotator** for the ESA-based Dataless Annotator
* **W2VDatalessAnnotator** for the Word2Vec-based Dataless Annotator
- Dataless Annotators add the **DATALESS_ESA** and **DATALESS_W2V** views to the input `TextAnnotation` respectively, and it requires the presence of a **TOKENS** view with the end-user's desired Tokenization.
- Since Labels/Topics are inferred at the Document-Level, all topic annotations span the entire document.
- Sample invocation has been provided in the main functions of each annotator.
- Both annotators load up embeddings in memory, and thus can easily consume upto **10GB RAM**.


## Label Hierarchy
Dataless Classification requires the end-user to specifcy a Label hierarchy (with label descriptions), which it classifies into. The Label hierarchy needs to be provided using a very specific format:
* **labelNamePath**: Specify your label id to label name mapping here in the `labelID \t labelName` format
(label id can be any ID specific to your system, however we use the label name itself as ID in our sample hierachy for readibility)
* **labelHierarchyPath**: The first line of this file should contain tab-separated list of Top-Level nodes in the hierarchy (i.e. the ones directly connected to the root). Then, every following line should specify the connections in the hierachy in the `parentLabelID \t childLabelID1 \t childLabelID2 \t ...` format.
* **labelDescPath**: Dataless' performance hinges on good label descriptions, which you specify in this file in the `labelID \t labelDescription` format.

We provide a sample 20newsgroups hierarchy with label descriptions inside data/hierarchy/20newsgroups, where:
* idToLabelNameMap.txt should be used as labelNamePath
* parentChildIdMap.txt should be used as labelHierarchyPath
* labelDesc\_Kws\_simple.txt should be used as labelDescPath

We also provide improved 20newsgroups label descriptions in *labelDesc\_Kws\_embellished.txt* which corresponds to the label descriptions used in [2], whereas the *labelDesc\_Kws\_simple.txt* corresponds to the label descriptions used in [1].

## Embeddings
ESA and Word2Vec Embeddings are fetched from the DataStore on demand.

## Config
A sample config file with the default values has been provided in the config folder .. *config/project.properties*

To check whether you are properly set to use the project or not, run:
* `mvn -Dtest=ESADatalessTest#testPredictions test` to test the ESADatalessAnnotator.
* `mvn -Dtest=W2VDatalessTest#testPredictions test` to test the W2VDatalessAnnotator.

If you use this software for research, please cite the following papers:

[1] Chang, Ming-Wei, et al. "Importance of Semantic Representation: Dataless Classification." AAAI. Vol. 2. 2008.

[2] Song, Yangqiu, and Dan Roth. "On Dataless Hierarchical Text Classification." AAAI. Vol. 7. 2014.
25 changes: 25 additions & 0 deletions dataless-classifier/config/project.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
## Use ResourceManager to read these properties
# curatorHost = trollope.cs.illinois.edu
# curatorPort = 9010

## Target Label Hierarchy
labelHierarchyPath = data/hierarchies/20newsgroups/parentChildIdMap.txt
labelNamePath = data/hierarchies/20newsgroups/idToLabelNameMap.txt
labelDescPath = data/hierarchies/20newsgroups/labelDesc_Kws_simple.txt
# labelDescPath = data/hierarchies/20newsgroups/labelDesc_Kws_embellished.txt

## Classifier configuration
inferenceBottomUp = True
classifierThreshold = 0.99
classifierLeastK = 1
classifierMaxK = 3

## ESA Configuration
#esaPath = data/embeddings/esaEmbedding/esa_vectors.txt
#esaMapPath = data/embeddings/esaEmbedding/idToConceptMap.txt
#esaDimension = 100

## W2V Configuration
#w2vPath = data/embeddings/w2vEmbedding-100/w2v_vectors.txt
#w2vDimension = 200

1 change: 1 addition & 0 deletions dataless-classifier/data/electronicsTestDocument.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
yes i know it s nowhere near christmas time but i m gonna loose net access in a few days maybe a week or if i m lucky and wanted to post this for interested people to save till xmas note bell labs is a good place if you have a phd and a good boss i have neither subject xmas light set with levels of brightness another version of a variable brightness xmas light set this set starts with a blinker bulb string diagram orginal way set 0v b b 0rtn modified set for level brightness string 0v 0k w string b 0v rtn note no mods to wiring to the right of this point only one blinker is used note that the blinker would not have as much current thru it as the string bulbs because of the second string of bulbs in parallel with it that s why the use of the 0k w resistor here to add extra current thru the blinker to make up for the current shunted thru the second string while the blinker is glowing and the second string is not glowing when the blinker goes open this resistor has only a slight effect on the brightness of the strings s slightly dimmer s slightly brighter or use a w 0v bulb in place of the 0k resistor if you can get one caution do not replace with a standard c bulb as these draw too much current and burn out the blinker c approx w what you ll see when it s working powerup string will light at full brightness and b will be lit bypassing most of the current from the second string making them not light b will open placing both strings in series making the string that was out to glow at a low brightness and the other string that was on before to glow at reduced brightness be sure to wire and insulate the splices resistor leads and cut wires in a safe manner level brightness xmas light set for easter
1 change: 1 addition & 0 deletions dataless-classifier/data/graphicsTestDocument.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
i m looking for some recommendations for screen capture programs a couple of issues ago pc mag listed as editor s choices both conversion artist and hijaak for windows anyone have any experience with those or some others i m trying to get an alpha manual in the next few days and i m not making much progress with the screen shots i m currently using dodot and i m about to burn it and the disks it rode it on it s got a lot of freaky bugs and oversights that are driving me crazy tonight it decided that for any graphic it writes out as a tiff file that s under a certain arbitrary size it will swap the left and right sides of the picture usually it confines itself to not copying things to the clipboard so i have to save and load pix for editing in paintbrush or crashing every hour or so the one nice thing it has though is it s dither option you d think that this would turn colors into dots which it does if you go from say colors to colors but if you go from or colors to b w you can set a threshold level for which colors turn to black and which turn to white for me this is useful because i can turn light grays on buttons to white and the dark grays to black and thereby preserve the d effect on buttons and other parts of the window if you understood my description can you tell me if another less buggy program can do this as well much thanks for any help signature david delgreco what lies behind us and what lies technically a writer before us are tiny matters compared delgreco rahul net to what lies within us oliver wendell holmes david f delgreco delgreco rahul net recommendation for screen capture program
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
politics politics
religion religion
computer computer
autos.sports autos.sports
science science
sales sales
talk.politics.guns talk.politics.guns
talk.politics.mideast talk.politics.mideast
talk.politics.misc talk.politics.misc
alt.atheism alt.atheism
soc.religion.christian soc.religion.christian
talk.religion.misc talk.religion.misc
comp.sys.ibm.pc.hardware comp.sys.ibm.pc.hardware
comp.sys.mac.hardware comp.sys.mac.hardware
comp.graphics comp.graphics
comp.windows.x comp.windows.x
comp.os.ms.windows.misc comp.os.ms.windows.misc
rec.autos rec.autos
rec.motorcycles rec.motorcycles
rec.sport.baseball rec.sport.baseball
rec.sport.hockey rec.sport.hockey
sci.electronics sci.electronics
sci.crypt sci.crypt
sci.med sci.med
sci.space sci.space
misc.forsale misc.forsale
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
politics politics gun fbi guns weapon compound israel arab jews jewish muslim gay homosexual sexual
religion religion atheist christian atheism god islamic christian god christ church bible jesus christian morality jesus god religion horus
computer computer bus pc motherboard bios board computer dos mac apple powerbook graphics image gif animation tiff window motif xterm sun windows windows dos microsoft ms driver drivers card printer
autos.sports autos.sports car ford auto toyota honda nissan bmw bike motorcycle yamaha baseball ball hitter hockey wings espn
science science circuit electronics radio signal battery encryption key crypto algorithm security doctor medical disease medicine patient space orbit moon earth sky solar
sales sales sale offer shipping forsale sell price brand obo
talk.politics.guns gun fbi guns weapon compound
talk.politics.mideast israel arab jews jewish muslim
talk.politics.misc gay homosexual sexual
alt.atheism atheist christian atheism god islamic
soc.religion.christian christian god christ church bible jesus
talk.religion.misc christian morality jesus god religion horus
comp.sys.ibm.pc.hardware bus pc motherboard bios board computer dos
comp.sys.mac.hardware mac apple powerbook
comp.graphics graphics image gif animation tiff
comp.windows.x window motif xterm sun windows
comp.os.ms.windows.misc windows dos microsoft ms driver drivers card printer
rec.autos car ford auto toyota honda nissan bmw
rec.motorcycles bike motorcycle yamaha
rec.sport.baseball baseball ball hitter
rec.sport.hockey hockey wings espn
sci.electronics circuit electronics radio signal battery
sci.crypt encryption key crypto algorithm security
sci.med doctor medical disease medicine patient
sci.space space orbit moon earth sky solar
misc.forsale sale offer shipping forsale sell price brand obo
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
politics politics politics guns politics mideast politics
religion religion atheism society religion christianity christian religion
computer computer computer systems ibm pc hardware computer systems mac macintosh apple hardware computer graphics computer windows x windowsx computer os operating system microsoft windows
autos.sports autos.sports cars motorcycles baseball hockey
science science science electronics science cryptography medicine science space
sales sales for sale discount
talk.politics.guns politics guns
talk.politics.mideast politics mideast
talk.politics.misc politics
alt.atheism atheism
soc.religion.christian society religion christianity christian
talk.religion.misc religion
comp.sys.ibm.pc.hardware computer systems ibm pc hardware
comp.sys.mac.hardware computer systems mac macintosh apple hardware
comp.graphics computer graphics
comp.windows.x computer windows x windowsx
comp.os.ms.windows.misc computer os operating system microsoft windows
rec.autos cars
rec.motorcycles motorcycles
rec.sport.baseball baseball
rec.sport.hockey hockey
sci.electronics science electronics
sci.crypt science cryptography
sci.med science medicine
sci.space science space
misc.forsale for sale discount
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
politics religion computer autos.sports science sales
politics talk.politics.guns talk.politics.mideast talk.politics.misc
religion alt.atheism soc.religion.christian talk.religion.misc
computer comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.graphics comp.windows.x comp.os.ms.windows.misc
autos.sports rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey
science sci.electronics sci.crypt sci.med sci.space
sales misc.forsale
71 changes: 71 additions & 0 deletions dataless-classifier/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<parent>
<artifactId>illinois-cogcomp-nlp</artifactId>
<groupId>edu.illinois.cs.cogcomp</groupId>
<version>4.0.0</version>
</parent>

<modelVersion>4.0.0</modelVersion>

<artifactId>illinois-datalessclassification</artifactId>
<name>Illinois Dataless Classifier</name>
<description>Classifies Text into the given label hierarchy from just the textual label descriptions</description>

<dependencies>
<dependency>
<groupId>org.cogcomp</groupId>
<artifactId>cogcomp-datastore</artifactId>
<version>1.9.10</version>
</dependency>
<dependency>
<groupId>edu.illinois.cs.cogcomp</groupId>
<artifactId>illinois-core-utilities</artifactId>
<version>4.0.0</version>
</dependency>
<dependency>
<groupId>edu.illinois.cs.cogcomp</groupId>
<artifactId>illinois-tokenizer</artifactId>
<version>4.0.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.12</version>
<optional>true</optional>
</dependency>
<dependency>
<groupId>net.sf.jung</groupId>
<artifactId>jung-api</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>net.sf.jung</groupId>
<artifactId>jung-graph-impl</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>1.4</version>
</dependency>
</dependencies>

<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.20.1</version>
<configuration>
<!--increase the memory requirements if you need more space-->
<argLine>-Xmx15g</argLine>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>


</project>
3 changes: 3 additions & 0 deletions dataless-classifier/script/testESADataless.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#mvn compile
#mvn dependency:copy-dependencies
nice java -Xmx10g -cp ./target/*:./target/dependency/* edu.illinois.cs.cogcomp.datalessclassification.ta.ESADatalessAnnotator $@
3 changes: 3 additions & 0 deletions dataless-classifier/script/testW2VDataless.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#mvn compile
#mvn dependency:copy-dependencies
nice java -Xmx10g -cp ./target/*:./target/dependency/* edu.illinois.cs.cogcomp.datalessclassification.ta.W2VDatalessAnnotator $@
Loading