Initial commit for the Dataless Classifier (closes #556) #544

shatu · 2017-09-10T06:11:40Z

Slightly cleaned and mavenized version of the original Dataless Classifier.

danyaljj · 2017-09-10T08:26:11Z

dataless-classifier/src/main/resources/eclipse-java-google-style.xml

@@ -0,0 +1,337 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>


danyaljj · 2017-09-10T08:27:53Z

Can you add the link to the root pom.xml and readme.md?

danyaljj · 2017-09-10T17:02:27Z

dataless-classifier/README.md

@@ -0,0 +1,59 @@
+# Illinois-DatalessClassification


CogComp instead of Illinois

Can you add an introduction paragraph here? Like for someone who has no idea about the relevant papers, in simple words, what is the input and the corresponding output? You may put some example input / outputs too.
Some of the content in the "Label Hierarchy" might be useful for here too.

danyaljj · 2017-09-10T17:04:06Z

dataless-classifier/README.md

+# Illinois-DatalessClassification
+
+Some key points:
+- The Main Class for the Dataless Annotators are :-


The main classes

shatu · 2017-09-10T17:18:24Z

@danyaljj .. What do you mean by root pom and README? ... and where do you want to add those links?

P.S. Addressed the rest of the requests.

danyaljj · 2017-09-10T17:33:14Z

dataless-classifier/README.md

+
+
+## Label Hierarchy
+Dataless Classification requires the end-user to specifcy a Label hierarchy (with label descriptions), which it classifies into. The Label hierarchy is inputted using a very specific format:


I think the past form of input (put) should be input (put).

danyaljj · 2017-09-10T17:34:11Z

dataless-classifier/README.md

+* **labelHierarchyPath**: The first line of this file should contain tab-separated list of Top-Level nodes in the hierarchy (i.e. the ones directly connected to the root). Then, every following line should specify the connections in the hierachy in the `parentLabelID \t childLabelID1 \t childLabelID2 \t ...` format.
+* **labelDescPath**: Dataless' performance hinges on good label descriptions, which you specify in this file in the `labelID \t labelDescription` format.
+
+We provide a sample 20newsgroups hierarchy with label descriptions inside data/hierarchy/20newsgroups, where :-


":-" instead of ":" seems to be intentional?

danyaljj · 2017-09-10T17:42:00Z

Small question:
Any of the resources your use are already included in the similarity package? (in which case we should use them as dependency).
If not, I think the right way to do this is to move the embeddings to the similarity package and use it here.

What do you think?
I can see there is ESA embeddings here:

cogcomp-nlp/similarity/src/main/java/edu/illinois/cs/cogcomp/sim/WordSim.java

Line 156 in 8193e05

f = ds.getFile("org.cogcomp.wordembedding", "memorybasedESA.txt", 1.5);

And word2vec vectors here:

cogcomp-nlp/similarity/src/main/java/edu/illinois/cs/cogcomp/sim/WordSim.java

Line 136 in 8193e05

f = ds.getFile("org.cogcomp.wordembedding", "word2vec.txt", 1.5);

danyaljj · 2017-09-10T17:42:58Z

dataless-classifier/README.md

+- The Main classes for the Dataless Annotators are :-
+  * **ESADatalessAnnotator** for the ESA-based Dataless Annotator
+  * **W2VDatalessAnnotator** for the Word2Vec-based Dataless Annotator
+- Dataless Annotators add the **ESA-Dataless** and **W2V-Dataless** views to the input TextAnnotation respectively, and it requires the presence of a **TOKENS** view with the end-user's desired Tokenization.


TextAnnotation in single quotes "`"

danyaljj · 2017-09-10T17:46:02Z

dataless-classifier/README.md

+		<groupId>edu.illinois.cs.cogcomp</groupId>
+		<artifactId>w2vEmbedding-100</artifactId>
+		<version>1.0</version>
+	</dependency>


Ideally we should use Datastore for loading large resources. This would make things lazy (i.e. they would get downloaded (and cached) only when we call them). Example usages are in the links I posted about the similarity package.

@danyaljj, @mssammon .. Some quick background:

Dataless Classification concerns with Document Embeddings, and thus segmentation of documents and composition of word-level embeddings are important -- For word-level embeddings, you can choose either a Sparse Embedding (like ESA), or a Dense Embedding (like Word2Vec).

The similarity package has more or less copied the code from the legacy Dataless package, and thus the segmentation and compositionality related components are hard-coded. Moreover, it doesn't unify the interface for ESA and Word2Vec embeddings, since one of them is a Sparse Embedding, and the other is a Dense Embedding -- which probably makes sense for a Similarity package.

On the other hand, I had refactored most of the legacy Dataless Classification codebase to make it more friendly for research. For instance:

I unified the interface for ESA and W2V embeddings since the Dataless Classifier code works only on Sparse Embeddings, and designed it in such a way that most of the segmentation and compositionality related components have been refactored out into separate functions -- thus providing flexibility for user-desired segmentation and composition functions.

I wrote generic DenseVector and SparseVector implementations, thus allowing embeddings to be associated with arbitrary Document components, for instance Entities, Relations, Types, Paragraphs etc.

So, this is where things get a bit muddy. There definitely is some intersection, both on the code-side and the project-scope side, but there is no clear solution without major refactoring efforts. Some ideas to deal with this, with their cons:

Refactor out the Document Embedding computation portion from Dataless-Classifier and incorporate it into the similarity package. Some issues with this approach:

The document embedding computation code in Dataless-Classifier was written with Dataless-Classifier in mind, and is not generic enough. For instance, even if the word-level embedding is a Dense Embedding, we emit Document-level Sparse Embeddings since that's what the Dataless Classifier works on -- This might not be desirable for a generic Similarity Package, for it should emit a Document-level Dense Embedding.

Dataless-Classifier's AEmbedding class is analogous to the Embedding class in the Similarity package, with the difference that AEmbedding works with Sparse Embeddings, and Embedding works with double arrays (Dense Embedding) .. so, if Embedding is replaced with AEmbedding in the similarity package, most of its code will have to be adapted accordingly. That adaptation will still be fine if the scope of the Similarity package is just to provide Similarities, and not the underlying embeddings themselves.

Revert back the Dataless-Classifier to its legacy code, and use just similarities from the similarity package. Some issues with this approach:

We end up losing all the work that has gone into making the Dataless package easier for expansion and future research.

Let them coexist side-by-side for now.

Please let me know what you think about all this; I've been postponing working on this merge request because of the scale of the changes required to merge it.

Hi @shatu .

Thanks for clarifying the issues.

For this PR, I think we should move on with the transition, without much changes in the similarity package (item 3 above).
I don't like (2). I like most of 3, but I have to study it too so that I completely understand what you're saying and definitely not for this PR.
So my suggestions is let's move on with (3) and we will get back to (1).

What do you think?

@danyaljj: Sounds good! I'll just address the 2 leftover bullet-points from your previous code-review then!

@danyaljj .. Apart from the DataStore issue that I mentioned (that it only uploads upto 2GB sized file), I'm done from my end. Once that issue is fixed, my tests should pass.

danyaljj · 2017-09-10T17:52:27Z

...-classifier/src/main/java/edu/illinois/cs/cogcomp/datalessclassification/util/Utilities.java

+import java.util.HashSet;
+import java.util.Set;
+
+public class Utilities {


Rename this something that indicates it's specific to the dataless package? Say DatalessUtilities?

danyaljj · 2017-09-10T17:52:42Z

...-classifier/src/main/java/edu/illinois/cs/cogcomp/datalessclassification/util/Utilities.java

+    //
+    // }
+    // return stopWords;
+    // }


Drop the unused comments?

danyaljj · 2017-09-10T17:57:18Z

In overall things looks good except some small issues:

Files are missing the license headers; you can add them with mvn license:format.
Move the embeddings to the similarity package and use datastore
Add your annotators to the PipelineFactory: https://github.com/CogComp/cogcomp-nlp/blob/master/pipeline/src/main/java/edu/illinois/cs/cogcomp/pipeline/main/PipelineFactory.java
Add your view names to the ViewNames: https://github.com/CogComp/cogcomp-nlp/blob/master/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/datastructures/ViewNames.java and update your code wherever your use the view names.
And make sure the tests are passing the CI.

mssammon · 2017-09-10T19:25:16Z

@shatu by "root pom and README" Daniel means the pom.xml and README.md file in the parent directory of this module (i.e., 'cogcomp-nlp')

… pom. Also, registered Dataless' viewnames with core-utilities

shatu · 2017-09-17T23:59:07Z

Done with the following:

License Headers.
Addition of Dataless-Classifier as a module in the parent pom.
Addition of Dataless ViewNames to core-utilities.
Addition of Dataless Annotators to Pipeline.
Other Documentation change suggestions.

Yet to do address the following:

~~Migrate Dataless Annotators to use the Annotator Interface/Abstract class.~~
~~Consider reusing ESA and W2V embeddings from the Similarity package.~~
~~Lazy loading using DataStore.~~
~~Modify Test-cases to pass CI.~~

mssammon

Overall, code is nicely structured. Aside from a few specific questions/clarifications, there are a few consistent issues. I've marked a couple in each case, but please fix all instances.

Every class needs a brief Javadoc description Where files are opened, either use try-with-resources or call close in a 'finally' block.
Don't allow exceptions to be swallowed -- if a file is missing, throw an exception to the client and at an appropriate point in the call stack, fail with a helpful error message.

The SparseVector code seems broadly useful; once this is integrated, please open an issue suggesting that it be moved to core-utilities.

mssammon · 2017-11-25T15:46:36Z

README.md

@@ -33,6 +33,7 @@ Each library contains detailed readme and instructions on how to use it. In addi
 | [commasrl](commasrl/README.md) | This software extracts relations that commas participate in. |
 | [similarity](similarity/README.md) | This software compare objects --especially Strings-- and return a score indicating how similar they are. |
 | [temporal-normalizer](temporal-normalizer/README.md) | A temporal extractor and normalizer.  |
+| [dataless-classifier](dataless-classifier/README.md) | Classifies Text into the given label hierarchy from just the textual label descriptions |


"Classifies text into a user-specified label hierarchy" ?

mssammon · 2017-11-25T15:47:38Z

core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/datastructures/ViewNames.java

@@ -88,6 +88,9 @@

    public static final String WIKIFIER = "WIKIFIER";

+    public static final String ESA_DATALESS = "ESA_DATALESS";


nitpick: would prefer "DATALESS_W2V" and "DATALESS_ESA" to follow convention for other views with multiple sources

mssammon · 2017-11-25T15:49:23Z

dataless-classifier/README.md

@@ -0,0 +1,42 @@
+# CogComp-DatalessClassifier
+


Start with a 1-2 sentence description of what this does, i.e. given a label set and a piece of text, assign a label from that set to the text, and a representative task where this would be useful.

mssammon · 2017-11-25T15:59:46Z

dataless-classifier/README.md

+* `mvn -Dtest=W2VDatalessTest#testPredictions test` to test the W2VDatalessAnnotator.
+
+
+References:


If you use this software for research, please cite the following papers:

mssammon · 2017-11-25T16:02:25Z

...src/main/java/edu/illinois/cs/cogcomp/datalessclassification/classifier/AClassifierTree.java

+import edu.illinois.cs.cogcomp.datalessclassification.hierarchy.TreeNode;
+
+/**
+ * @author shashank


add a brief statement of class's purpose for every class.

mssammon · 2017-11-26T15:40:11Z

...n/java/edu/illinois/cs/cogcomp/datalessclassification/representation/w2v/MemoryBasedW2V.java

+
+                }
+
+                bf.close();


as before -- close in 'finally' block or use try-with-resources

mssammon · 2017-11-26T15:42:29Z

...fier/src/main/java/edu/illinois/cs/cogcomp/datalessclassification/ta/ADatalessAnnotator.java

+    /**
+     * Call this before trying to annotate the objects Call this only after calling
+     * initializeEmbedding
+     */


This kind of documentation is extremely helpful

mssammon · 2017-11-26T15:44:18Z

.../src/main/java/edu/illinois/cs/cogcomp/datalessclassification/ta/DatalessAnnotatorUtils.java

+            }
+
+            reader.close();
+        } catch (Exception e) {


please don't swallow exceptions. Re-throw up the chain.

mssammon · 2017-11-26T15:47:02Z

.../src/main/java/edu/illinois/cs/cogcomp/datalessclassification/ta/DatalessAnnotatorUtils.java

+
+            reader.close();
+        } catch (Exception e) {
+            e.printStackTrace();


Throw the exception and somewhere at the top of the call stack, fail with a helpful error message.

mssammon · 2017-11-26T15:53:59Z

...assifier/src/main/java/edu/illinois/cs/cogcomp/datalessclassification/util/SparseVector.java

+        updateNorm();
+    }
+
+    // TODO: Decide what to do when the key is not found


for now, document actual behavior

shatu · 2017-12-13T02:41:01Z

@danyaljj, @mssammon .. Many thanks for the extensive code-review.
All comments have been addressed; this branch is ready to be merged.

Initial commit for the Dataless Classifier

dfc7645

danyaljj reviewed Sep 10, 2017

View reviewed changes

minor changes to the README

381a10c

danyaljj reviewed Sep 10, 2017

View reviewed changes

shatu added 2 commits September 17, 2017 17:53

Added license headers and dataless-classifier as module to the parent…

2a5d7de

… pom. Also, registered Dataless' viewnames with core-utilities

Merge remote-tracking branch 'cogcomp/master' into dataless-classifier

2a56ef2

shatu mentioned this pull request Sep 17, 2017

Make Dataless Annotators extend the Annotator class. #556

Closed

Added Dataless Annotators to the pipeline + some redundant code removal

fac9988

Merge branch 'master' into dataless-classifier

d5e6ad7

shatu force-pushed the dataless-classifier branch from a442426 to 46f2189 Compare November 25, 2017 15:22

shatu changed the title ~~Initial commit for the Dataless Classifier~~ Initial commit for the Dataless Classifier (closes #556) Nov 26, 2017

mssammon reviewed Nov 26, 2017

View reviewed changes

shatu force-pushed the dataless-classifier branch 3 times, most recently from c86a20c to e5d9397 Compare December 13, 2017 02:04

Dataless-Classifier now uses DataStore and extends Annotator

ab81125

Merge branch 'master' into dataless-classifier

e47d328

shatu force-pushed the dataless-classifier branch from 76ba873 to e47d328 Compare December 14, 2017 05:29

danyaljj merged commit ac4e358 into CogComp:master Dec 16, 2017

shatu deleted the dataless-classifier branch December 17, 2017 05:12

		@@ -0,0 +1,337 @@
		<?xml version="1.0" encoding="UTF-8" standalone="no"?>



		## Label Hierarchy
		Dataless Classification requires the end-user to specifcy a Label hierarchy (with label descriptions), which it classifies into. The Label hierarchy is inputted using a very specific format:

		@@ -88,6 +88,9 @@

		public static final String WIKIFIER = "WIKIFIER";

		public static final String ESA_DATALESS = "ESA_DATALESS";

		* `mvn -Dtest=W2VDatalessTest#testPredictions test` to test the W2VDatalessAnnotator.


		References:

Initial commit for the Dataless Classifier (closes #556) #544

Initial commit for the Dataless Classifier (closes #556) #544

Conversation

shatu commented Sep 10, 2017

Choose a reason for hiding this comment

danyaljj commented Sep 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danyaljj Sep 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shatu commented Sep 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danyaljj commented Sep 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shatu Nov 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shatu Nov 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danyaljj commented Sep 10, 2017

mssammon commented Sep 10, 2017

shatu commented Sep 17, 2017 • edited Loading

mssammon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shatu commented Dec 13, 2017

danyaljj Sep 10, 2017 •

edited

Loading

shatu commented Sep 10, 2017 •

edited

Loading

shatu Nov 16, 2017 •

edited

Loading

shatu Nov 25, 2017 •

edited

Loading

shatu commented Sep 17, 2017 •

edited

Loading