-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial commit for the Dataless Classifier (closes #556) #544
Conversation
@@ -0,0 +1,337 @@ | |||
<?xml version="1.0" encoding="UTF-8" standalone="no"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop this?
Can you add the link to the root pom.xml and readme.md? |
dataless-classifier/README.md
Outdated
@@ -0,0 +1,59 @@ | |||
# Illinois-DatalessClassification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CogComp instead of Illinois
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an introduction paragraph here? Like for someone who has no idea about the relevant papers, in simple words, what is the input and the corresponding output? You may put some example input / outputs too.
Some of the content in the "Label Hierarchy" might be useful for here too.
dataless-classifier/README.md
Outdated
# Illinois-DatalessClassification | ||
|
||
Some key points: | ||
- The Main Class for the Dataless Annotators are :- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:- -> :
@danyaljj .. What do you mean by root pom and README? ... and where do you want to add those links? P.S. Addressed the rest of the requests. |
dataless-classifier/README.md
Outdated
|
||
|
||
## Label Hierarchy | ||
Dataless Classification requires the end-user to specifcy a Label hierarchy (with label descriptions), which it classifies into. The Label hierarchy is inputted using a very specific format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the past form of input (put) should be input (put).
dataless-classifier/README.md
Outdated
* **labelHierarchyPath**: The first line of this file should contain tab-separated list of Top-Level nodes in the hierarchy (i.e. the ones directly connected to the root). Then, every following line should specify the connections in the hierachy in the `parentLabelID \t childLabelID1 \t childLabelID2 \t ...` format. | ||
* **labelDescPath**: Dataless' performance hinges on good label descriptions, which you specify in this file in the `labelID \t labelDescription` format. | ||
|
||
We provide a sample 20newsgroups hierarchy with label descriptions inside data/hierarchy/20newsgroups, where :- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
":-" instead of ":" seems to be intentional?
Small question: What do you think?
And word2vec vectors here:
|
dataless-classifier/README.md
Outdated
- The Main classes for the Dataless Annotators are :- | ||
* **ESADatalessAnnotator** for the ESA-based Dataless Annotator | ||
* **W2VDatalessAnnotator** for the Word2Vec-based Dataless Annotator | ||
- Dataless Annotators add the **ESA-Dataless** and **W2V-Dataless** views to the input TextAnnotation respectively, and it requires the presence of a **TOKENS** view with the end-user's desired Tokenization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextAnnotation
in single quotes "`"
dataless-classifier/README.md
Outdated
<groupId>edu.illinois.cs.cogcomp</groupId> | ||
<artifactId>w2vEmbedding-100</artifactId> | ||
<version>1.0</version> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should use Datastore for loading large resources. This would make things lazy (i.e. they would get downloaded (and cached) only when we call them). Example usages are in the links I posted about the similarity package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danyaljj, @mssammon .. Some quick background:
- Dataless Classification concerns with Document Embeddings, and thus segmentation of documents and composition of word-level embeddings are important -- For word-level embeddings, you can choose either a Sparse Embedding (like ESA), or a Dense Embedding (like Word2Vec).
- The similarity package has more or less copied the code from the legacy Dataless package, and thus the segmentation and compositionality related components are hard-coded. Moreover, it doesn't unify the interface for ESA and Word2Vec embeddings, since one of them is a Sparse Embedding, and the other is a Dense Embedding -- which probably makes sense for a Similarity package.
- On the other hand, I had refactored most of the legacy Dataless Classification codebase to make it more friendly for research. For instance:
- I unified the interface for ESA and W2V embeddings since the Dataless Classifier code works only on Sparse Embeddings, and designed it in such a way that most of the segmentation and compositionality related components have been refactored out into separate functions -- thus providing flexibility for user-desired segmentation and composition functions.
- I wrote generic DenseVector and SparseVector implementations, thus allowing embeddings to be associated with arbitrary Document components, for instance Entities, Relations, Types, Paragraphs etc.
So, this is where things get a bit muddy. There definitely is some intersection, both on the code-side and the project-scope side, but there is no clear solution without major refactoring efforts. Some ideas to deal with this, with their cons:
-
Refactor out the Document Embedding computation portion from Dataless-Classifier and incorporate it into the similarity package. Some issues with this approach:
- The document embedding computation code in Dataless-Classifier was written with Dataless-Classifier in mind, and is not generic enough. For instance, even if the word-level embedding is a Dense Embedding, we emit Document-level Sparse Embeddings since that's what the Dataless Classifier works on -- This might not be desirable for a generic Similarity Package, for it should emit a Document-level Dense Embedding.
- Dataless-Classifier's AEmbedding class is analogous to the Embedding class in the Similarity package, with the difference that AEmbedding works with Sparse Embeddings, and Embedding works with double arrays (Dense Embedding) .. so, if Embedding is replaced with AEmbedding in the similarity package, most of its code will have to be adapted accordingly. That adaptation will still be fine if the scope of the Similarity package is just to provide Similarities, and not the underlying embeddings themselves.
-
Revert back the Dataless-Classifier to its legacy code, and use just similarities from the similarity package. Some issues with this approach:
- We end up losing all the work that has gone into making the Dataless package easier for expansion and future research.
-
Let them coexist side-by-side for now.
Please let me know what you think about all this; I've been postponing working on this merge request because of the scale of the changes required to merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shatu .
Thanks for clarifying the issues.
For this PR, I think we should move on with the transition, without much changes in the similarity package (item 3 above).
I don't like (2). I like most of 3, but I have to study it too so that I completely understand what you're saying and definitely not for this PR.
So my suggestions is let's move on with (3) and we will get back to (1).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danyaljj: Sounds good! I'll just address the 2 leftover bullet-points from your previous code-review then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danyaljj .. Apart from the DataStore issue that I mentioned (that it only uploads upto 2GB sized file), I'm done from my end. Once that issue is fixed, my tests should pass.
import java.util.HashSet; | ||
import java.util.Set; | ||
|
||
public class Utilities { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this something that indicates it's specific to the dataless package? Say DatalessUtilities
?
// | ||
// } | ||
// return stopWords; | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the unused comments?
In overall things looks good except some small issues:
|
@shatu by "root pom and README" Daniel means the pom.xml and README.md file in the parent directory of this module (i.e., 'cogcomp-nlp') |
… pom. Also, registered Dataless' viewnames with core-utilities
Done with the following:
Yet to do address the following:
|
a442426
to
46f2189
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, code is nicely structured. Aside from a few specific questions/clarifications, there are a few consistent issues. I've marked a couple in each case, but please fix all instances.
- Every class needs a brief Javadoc description Where files are opened, either use try-with-resources or call close in a 'finally' block.
- Don't allow exceptions to be swallowed -- if a file is missing, throw an exception to the client and at an appropriate point in the call stack, fail with a helpful error message.
The SparseVector code seems broadly useful; once this is integrated, please open an issue suggesting that it be moved to core-utilities.
README.md
Outdated
@@ -33,6 +33,7 @@ Each library contains detailed readme and instructions on how to use it. In addi | |||
| [commasrl](commasrl/README.md) | This software extracts relations that commas participate in. | | |||
| [similarity](similarity/README.md) | This software compare objects --especially Strings-- and return a score indicating how similar they are. | | |||
| [temporal-normalizer](temporal-normalizer/README.md) | A temporal extractor and normalizer. | | |||
| [dataless-classifier](dataless-classifier/README.md) | Classifies Text into the given label hierarchy from just the textual label descriptions | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Classifies text into a user-specified label hierarchy" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -88,6 +88,9 @@ | |||
|
|||
public static final String WIKIFIER = "WIKIFIER"; | |||
|
|||
public static final String ESA_DATALESS = "ESA_DATALESS"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: would prefer "DATALESS_W2V" and "DATALESS_ESA" to follow convention for other views with multiple sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dataless-classifier/README.md
Outdated
@@ -0,0 +1,42 @@ | |||
# CogComp-DatalessClassifier | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start with a 1-2 sentence description of what this does, i.e. given a label set and a piece of text, assign a label from that set to the text, and a representative task where this would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dataless-classifier/README.md
Outdated
* `mvn -Dtest=W2VDatalessTest#testPredictions test` to test the W2VDatalessAnnotator. | ||
|
||
|
||
References: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use this software for research, please cite the following papers:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
import edu.illinois.cs.cogcomp.datalessclassification.hierarchy.TreeNode; | ||
|
||
/** | ||
* @author shashank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a brief statement of class's purpose for every class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
} | ||
|
||
bf.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as before -- close in 'finally' block or use try-with-resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/** | ||
* Call this before trying to annotate the objects Call this only after calling | ||
* initializeEmbedding | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of documentation is extremely helpful
} | ||
|
||
reader.close(); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't swallow exceptions. Re-throw up the chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
reader.close(); | ||
} catch (Exception e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw the exception and somewhere at the top of the call stack, fail with a helpful error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
updateNorm(); | ||
} | ||
|
||
// TODO: Decide what to do when the key is not found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, document actual behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
c86a20c
to
e5d9397
Compare
76ba873
to
e47d328
Compare
Slightly cleaned and mavenized version of the original Dataless Classifier.