Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for the Dataless Classifier (closes #556) #544

Merged
merged 8 commits into from
Dec 16, 2017

Conversation

shatu
Copy link
Member

@shatu shatu commented Sep 10, 2017

Slightly cleaned and mavenized version of the original Dataless Classifier.

@@ -0,0 +1,337 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this?

@danyaljj
Copy link
Member

Can you add the link to the root pom.xml and readme.md?

@@ -0,0 +1,59 @@
# Illinois-DatalessClassification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CogComp instead of Illinois

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an introduction paragraph here? Like for someone who has no idea about the relevant papers, in simple words, what is the input and the corresponding output? You may put some example input / outputs too.
Some of the content in the "Label Hierarchy" might be useful for here too.

# Illinois-DatalessClassification

Some key points:
- The Main Class for the Dataless Annotators are :-
Copy link
Member

@danyaljj danyaljj Sep 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main classes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:- -> :

@shatu
Copy link
Member Author

shatu commented Sep 10, 2017

@danyaljj .. What do you mean by root pom and README? ... and where do you want to add those links?

P.S. Addressed the rest of the requests.



## Label Hierarchy
Dataless Classification requires the end-user to specifcy a Label hierarchy (with label descriptions), which it classifies into. The Label hierarchy is inputted using a very specific format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the past form of input (put) should be input (put).

* **labelHierarchyPath**: The first line of this file should contain tab-separated list of Top-Level nodes in the hierarchy (i.e. the ones directly connected to the root). Then, every following line should specify the connections in the hierachy in the `parentLabelID \t childLabelID1 \t childLabelID2 \t ...` format.
* **labelDescPath**: Dataless' performance hinges on good label descriptions, which you specify in this file in the `labelID \t labelDescription` format.

We provide a sample 20newsgroups hierarchy with label descriptions inside data/hierarchy/20newsgroups, where :-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

":-" instead of ":" seems to be intentional?

@danyaljj
Copy link
Member

Small question:
Any of the resources your use are already included in the similarity package? (in which case we should use them as dependency).
If not, I think the right way to do this is to move the embeddings to the similarity package and use it here.

What do you think?
I can see there is ESA embeddings here:

f = ds.getFile("org.cogcomp.wordembedding", "memorybasedESA.txt", 1.5);

And word2vec vectors here:

f = ds.getFile("org.cogcomp.wordembedding", "word2vec.txt", 1.5);

- The Main classes for the Dataless Annotators are :-
* **ESADatalessAnnotator** for the ESA-based Dataless Annotator
* **W2VDatalessAnnotator** for the Word2Vec-based Dataless Annotator
- Dataless Annotators add the **ESA-Dataless** and **W2V-Dataless** views to the input TextAnnotation respectively, and it requires the presence of a **TOKENS** view with the end-user's desired Tokenization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextAnnotation in single quotes "`"

<groupId>edu.illinois.cs.cogcomp</groupId>
<artifactId>w2vEmbedding-100</artifactId>
<version>1.0</version>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should use Datastore for loading large resources. This would make things lazy (i.e. they would get downloaded (and cached) only when we call them). Example usages are in the links I posted about the similarity package.

Copy link
Member Author

@shatu shatu Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danyaljj, @mssammon .. Some quick background:

  1. Dataless Classification concerns with Document Embeddings, and thus segmentation of documents and composition of word-level embeddings are important -- For word-level embeddings, you can choose either a Sparse Embedding (like ESA), or a Dense Embedding (like Word2Vec).
  2. The similarity package has more or less copied the code from the legacy Dataless package, and thus the segmentation and compositionality related components are hard-coded. Moreover, it doesn't unify the interface for ESA and Word2Vec embeddings, since one of them is a Sparse Embedding, and the other is a Dense Embedding -- which probably makes sense for a Similarity package.
  3. On the other hand, I had refactored most of the legacy Dataless Classification codebase to make it more friendly for research. For instance:
    • I unified the interface for ESA and W2V embeddings since the Dataless Classifier code works only on Sparse Embeddings, and designed it in such a way that most of the segmentation and compositionality related components have been refactored out into separate functions -- thus providing flexibility for user-desired segmentation and composition functions.
    • I wrote generic DenseVector and SparseVector implementations, thus allowing embeddings to be associated with arbitrary Document components, for instance Entities, Relations, Types, Paragraphs etc.

So, this is where things get a bit muddy. There definitely is some intersection, both on the code-side and the project-scope side, but there is no clear solution without major refactoring efforts. Some ideas to deal with this, with their cons:

  1. Refactor out the Document Embedding computation portion from Dataless-Classifier and incorporate it into the similarity package. Some issues with this approach:

    • The document embedding computation code in Dataless-Classifier was written with Dataless-Classifier in mind, and is not generic enough. For instance, even if the word-level embedding is a Dense Embedding, we emit Document-level Sparse Embeddings since that's what the Dataless Classifier works on -- This might not be desirable for a generic Similarity Package, for it should emit a Document-level Dense Embedding.
    • Dataless-Classifier's AEmbedding class is analogous to the Embedding class in the Similarity package, with the difference that AEmbedding works with Sparse Embeddings, and Embedding works with double arrays (Dense Embedding) .. so, if Embedding is replaced with AEmbedding in the similarity package, most of its code will have to be adapted accordingly. That adaptation will still be fine if the scope of the Similarity package is just to provide Similarities, and not the underlying embeddings themselves.
  2. Revert back the Dataless-Classifier to its legacy code, and use just similarities from the similarity package. Some issues with this approach:

    • We end up losing all the work that has gone into making the Dataless package easier for expansion and future research.
  3. Let them coexist side-by-side for now.

Please let me know what you think about all this; I've been postponing working on this merge request because of the scale of the changes required to merge it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shatu .

Thanks for clarifying the issues.

For this PR, I think we should move on with the transition, without much changes in the similarity package (item 3 above).
I don't like (2). I like most of 3, but I have to study it too so that I completely understand what you're saying and definitely not for this PR.
So my suggestions is let's move on with (3) and we will get back to (1).

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danyaljj: Sounds good! I'll just address the 2 leftover bullet-points from your previous code-review then!

Copy link
Member Author

@shatu shatu Nov 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danyaljj .. Apart from the DataStore issue that I mentioned (that it only uploads upto 2GB sized file), I'm done from my end. Once that issue is fixed, my tests should pass.

import java.util.HashSet;
import java.util.Set;

public class Utilities {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this something that indicates it's specific to the dataless package? Say DatalessUtilities?

//
// }
// return stopWords;
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the unused comments?

@danyaljj
Copy link
Member

In overall things looks good except some small issues:

@mssammon
Copy link
Contributor

@shatu by "root pom and README" Daniel means the pom.xml and README.md file in the parent directory of this module (i.e., 'cogcomp-nlp')

@shatu
Copy link
Member Author

shatu commented Sep 17, 2017

Done with the following:

  • License Headers.
  • Addition of Dataless-Classifier as a module in the parent pom.
  • Addition of Dataless ViewNames to core-utilities.
  • Addition of Dataless Annotators to Pipeline.
  • Other Documentation change suggestions.

Yet to do address the following:

  • Migrate Dataless Annotators to use the Annotator Interface/Abstract class.
  • Consider reusing ESA and W2V embeddings from the Similarity package.
  • Lazy loading using DataStore.
  • Modify Test-cases to pass CI.

@shatu shatu force-pushed the dataless-classifier branch from a442426 to 46f2189 Compare November 25, 2017 15:22
@shatu shatu changed the title Initial commit for the Dataless Classifier Initial commit for the Dataless Classifier (closes #556) Nov 26, 2017
Copy link
Contributor

@mssammon mssammon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, code is nicely structured. Aside from a few specific questions/clarifications, there are a few consistent issues. I've marked a couple in each case, but please fix all instances.

  1. Every class needs a brief Javadoc description Where files are opened, either use try-with-resources or call close in a 'finally' block.
  2. Don't allow exceptions to be swallowed -- if a file is missing, throw an exception to the client and at an appropriate point in the call stack, fail with a helpful error message.

The SparseVector code seems broadly useful; once this is integrated, please open an issue suggesting that it be moved to core-utilities.

README.md Outdated
@@ -33,6 +33,7 @@ Each library contains detailed readme and instructions on how to use it. In addi
| [commasrl](commasrl/README.md) | This software extracts relations that commas participate in. |
| [similarity](similarity/README.md) | This software compare objects --especially Strings-- and return a score indicating how similar they are. |
| [temporal-normalizer](temporal-normalizer/README.md) | A temporal extractor and normalizer. |
| [dataless-classifier](dataless-classifier/README.md) | Classifies Text into the given label hierarchy from just the textual label descriptions |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Classifies text into a user-specified label hierarchy" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -88,6 +88,9 @@

public static final String WIKIFIER = "WIKIFIER";

public static final String ESA_DATALESS = "ESA_DATALESS";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: would prefer "DATALESS_W2V" and "DATALESS_ESA" to follow convention for other views with multiple sources

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,42 @@
# CogComp-DatalessClassifier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start with a 1-2 sentence description of what this does, i.e. given a label set and a piece of text, assign a label from that set to the text, and a representative task where this would be useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* `mvn -Dtest=W2VDatalessTest#testPredictions test` to test the W2VDatalessAnnotator.


References:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use this software for research, please cite the following papers:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

import edu.illinois.cs.cogcomp.datalessclassification.hierarchy.TreeNode;

/**
* @author shashank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a brief statement of class's purpose for every class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


}

bf.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as before -- close in 'finally' block or use try-with-resources

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/**
* Call this before trying to annotate the objects Call this only after calling
* initializeEmbedding
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of documentation is extremely helpful

}

reader.close();
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't swallow exceptions. Re-throw up the chain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


reader.close();
} catch (Exception e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw the exception and somewhere at the top of the call stack, fail with a helpful error message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

updateNorm();
}

// TODO: Decide what to do when the key is not found
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, document actual behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@shatu shatu force-pushed the dataless-classifier branch 3 times, most recently from c86a20c to e5d9397 Compare December 13, 2017 02:04
@shatu
Copy link
Member Author

shatu commented Dec 13, 2017

@danyaljj, @mssammon .. Many thanks for the extensive code-review.
All comments have been addressed; this branch is ready to be merged.

@shatu shatu force-pushed the dataless-classifier branch from 76ba873 to e47d328 Compare December 14, 2017 05:29
@danyaljj danyaljj merged commit ac4e358 into CogComp:master Dec 16, 2017
@shatu shatu deleted the dataless-classifier branch December 17, 2017 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants