-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
‼️Crush the Baseline #11
Comments
As of now we are >1 %point below Shrikant's reported performance. We will continue now with the DS model. And then (with the combined models) continue with the feature selection & hyper parameter optimization. |
So far I'm gonna let Run2 parameters as defaults since it has much better precision. Run1 is really almost like StubSameSentenceRelationExtraction, predicting everything as positive but 14 negative edges. Run2 predicts 133 edges as negative. |
Note:
In comparison:
That is, relna takes much more time for training --> may be because relna produces many more features that are actually helpful and maybe should be added to LoccText See #16 |
Ponder:
Yes, possible problems if I introduce the inner window -- but at the moment I'm not |
Maybe ponder about:
|
|
Other (from Tanya)
|
@goldbergtatyana would it be possible for you to generate the same list as And other model organisms if you think appropriate? |
HI @juanmirocks, I need to see the file human_localization_all.tab to know what to generate for you. |
? I don’t understand, that’s the file you generated.
For refreshment, the file looks like:
(Las time you added an extra column with the GO identifiers, that’s exactly
what I need together with the PubMed ids, which is also included)
Entry Entry name Protein names Gene names Subcellular
location [CC] Gene ontology (cellular component)
P04637 P53_HUMAN Cellular tumor antigen p53 (Antigen NY-CO-13)
(Phosphoprotein p53) (Tumor suppressor p53) TP53 P53
SUBCELLULAR LOCATION: Cytoplasm. Nucleus. Nucleus, PML body.
Endoplasmic reticulum. Mitochondrion matrix. Note=Interaction with
BANP promotes nuclear localization. Recruited into PML bodies together
with CHEK2. Translocates to mitochondria upon oxidative stress.;
SUBCELLULAR LOCATION: Isoform 1: Nucleus. Cytoplasm.
Note=Predominantly nuclear but localizes to the cytoplasm when
expressed with isoform 4.; SUBCELLULAR LOCATION: Isoform 2: Nucleus.
Cytoplasm. Note=Localized mainly in the nucleus with minor staining in
the cytoplasm.; SUBCELLULAR LOCATION: Isoform 3: Nucleus. Cytoplasm.
Note=Localized in the nucleus in most cells but found in the cytoplasm
in some cells.; SUBCELLULAR LOCATION: Isoform 4: Nucleus. Cytoplasm.
Note=Predominantly nuclear but translocates to the cytoplasm following
cell stress.; SUBCELLULAR LOCATION: Isoform 7: Nucleus. Cytoplasm.
Note=Localized mainly in the nucleus with minor staining in the
cytoplasm.; SUBCELLULAR LOCATION: Isoform 8: Nucleus. Cytoplasm.
Note=Localized in both nucleus and cytoplasm in most cells. In some
cells, forms foci in the nucleus that are different from nucleoli.;
SUBCELLULAR LOCATION: Isoform 9: Cytoplasm. cytoplasm [GO:0005737];
cytosol [GO:0005829]; endoplasmic reticulum [GO:0005783];
mitochondrial matrix [GO:0005759]; mitochondrion [GO:0005739]; nuclear
chromatin [GO:0000790]; nuclear matrix [GO:0016363]; nucleolus
[GO:0005730]; nucleoplasm [GO:0005654]; nucleus [GO:0005634]; PML body
[GO:0016605]; protein complex [GO:0043234]; replication fork
[GO:0005657]
…On Fri, Feb 3, 2017 at 1:32 PM Tatyana Goldberg ***@***.***> wrote:
HI @juanmirocks <https://github.com/juanmirocks>, I need to see the file
human_localization_all.tab to know what to generate for you.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAGQH2AQHlg-Banzy_WgtVdBoLzY4mdiks5rYx5EgaJpZM4KeEed>
.
|
seems to me like a simple uniprot search:
|
as for organisms,Id suggest to go as in the linked annotations article for:
|
@goldbergtatyana oh I see -- you generated it this way unfortunately the output from uniprot doesn't normalize the subcellular localizations when they extracted from citations, as in:
Also, sometimes I don't see the relation between the column CC and the GO (which should be the almost the same or at least congruent?). For example, in `Q8WZ42 TITIN_HUMAN Titin (EC 2.7.11.1) (Connectin) (Rhabdomyosarcoma antigen MU-RMS-40.14) TTN SUBCELLULAR LOCATION: Cytoplasm {ECO:0000305|PubMed:16410549}. Nucleus {ECO:0000269|PubMed:16410549}. condensed nuclear chromosome [GO:0000794]; cytosol [GO:0005829]; extracellular exosome [GO:0070062]; extracellular region [GO:0005576]; I band [GO:0031674]; M band [GO:0031430]; muscle myosin complex [GO:0005859]; striated muscle thin filament [GO:0005865]; Z disc [GO:0030018]`` |
@juanmirocks please upload the orginal file for me to able to reconstruct the logic then. By now I unfortunately do not remember what was done. Thanks |
The file is in your public_html folder
…On Fri, 3 Feb 2017 at 20:47, Tatyana Goldberg ***@***.***> wrote:
@juanmirocks <https://github.com/juanmirocks> please upload the orginal
file for me to able to reconstruct the logic then. By now I unfortunately
do not remember what was done. Thanks
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAGQH3cFj9UtWD2wYxt7JV2KoPmPfYhBks5rY4QbgaJpZM4KeEed>
.
|
organisms:
UniProt query to get the reviewed proteins of all corpus' mentioned organisms:
that is: with all columns: |
@goldbergtatyana finally I get good performance 😀 |
Goal of the task achieved:
We are now going to concentrate on different sentence models with different distances, see #34 |
Implement Features Anew
Put in dependency parsing edges information
Assert current dependencies are correct as originally planned
Number of tokens left
Number of tokens right
Investigate what the hell is going on
Use
LIBSVMscikit-learn for svm library insteadVisualize current features
Scale integer counts: http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use
Review this: ‼️Crush the Baseline #11 (comment)
Set allowed feature keys
Implement rest of features:
[ ] MAYBE Try Tanzeem ideas: ‼️Crush the Baseline #11 (comment)Set allowed features mapping by name
Test performance for created edges only
Visualize progressive performance with kbest
Try to do kbest faster
Visualize the mistakes
Check again false positives & false negatives and implement new sensible features:
Definitely have check for enzymes, ases, or prefixes, somehow:
# review all treatment or enzymes
# phosphatidylinositol-specific phospholipase C (PI-PLC -- treatment
[ ] Hormones? --hemagglutinin (HA)
-- know they are couples / relatedReceptors? `peroxisome proliferator-activated receptor γ (PPARγ) and transporter
[ ] Cytokine? `peroxisome proliferator-activated receptor γ (PPARγ)Is in negation
main verb(s)
[ ] Do Jumps as in chunk parsing (constituency parsing, shallow parsing) http://www.clips.ua.ac.be/pages/mbsp-tagsGet background info from SwissProt:
uniprot_id
is related toGO
Check for non i.i.d. groups
* [ ] Speed up sklsvm features matrix #creationReview all existing features
Do recursive selection of kbest features
Fix selected features
Grid search hyperparameters:
[ ] AdaBoostdidn't check[ ] Random Forestso far gave mediocre results compared to SVC (feat_sel from LinearSVC l1)Sentence Features
All Tokens Features
Selected Tokens Featurres
Token features are extracted for tokens that are part of the entities and for tokens that are in a linear dependency with the entities.
Linear Context and Dependency Chain
Token features are also extracted for tokens that are present in the linear and dependency context. A linear context of length 3 is considered, i.e., features are extracted for the next and previous 3 tokens relative to the token under consideration.
A dependency chain of length 3 is considered for the dependency context. Incoming and outgoing dependencies are considered for dependency-related features. For example, in the case of incoming dependencies, features are extracted for the source/from token and its incoming and outgoing dependencies are considered too. This goes on up to a dependency depth of 3. In addition to the token features, the dependency edge types are also considered while extracting dependency chain features.
Dependency Features
Many features are extracted depending on the dependency graph. Using the Floyd- Warshall algorithm, the shortest path between a protein entity and a location entity in a potential PL relation is calculated. For the purpose of extracting the shortest path, an undirected graph of dependencies is considered. Figure 4.6 shows the shortest path from protein entity "COP1" to location entity "cytoplasmic". The path is shown in bold. Note that an undirected graph is considered only for the purpose of extracting the shortest path. Most of the dependency-based features depend on this shortest path. While extracting the features, the original direction of the edges in the shortest path is also taken into consideration. The length of the shortest path contributes an integer-valued feature in addition to the binary feature for each length.
Token features are extracted for terminal features of the shortest path, which are head tokens of the entities. Some of the other path related features include token features of the internal tokens in the path, features for every edge in the path, features for internal edges in the path, etc.
N-gram Dependency Features
The protein and location entities are represented by their respective head tokens. The shortest path between two entities is actually a shortest path between the head tokens. However, there need not be a single shortest path between two entities. There can be multiple paths between two entities with same distance or same minimum distances. All such paths with minimum distance are computed and features are extracted for each one of them.
For every such minimum distance path from the set of minimum distance paths, parts of the paths are considered for N-gram features. For example, a set of all 2 consecutive tokens are considered for 2-gram features and a set of all 3 consecutive tokens are considered for 3-gram features, etc.. 2-, 3- and 4-gram features are extracted from all such paths. These features also include token features that are part of the corresponding set, dependencies in the set, directions of dependencies in the set, etc.
Other Features
Features Specific to DSModel
Some features are specific to the DSModel since it involves processing a pair of sentences as a combined sentence along with extra links. Some of those features include bag of words/stem/POS of tokens in individual sentences, binary tests like the presence of an entity in the first sentence or second sentence, etc. Importantly, the DSModel also uses the predictions of the SSModel. The features depending on SSModel predictions include binary tests like whether the entities considered in the potential relations have a predicted same-sentence relation or not. The intuition behind using same-sentence predictions is that entities that already have a same-sentence relation are unlikely to have a different-sentence relation in most cases.
'dependency_to'
, ...) See: Match reported performances relna#21svm_hyperparameter_c
edges.py
), or other thingsReported:
P 66.57%, R 67.61%, F 66.73% (with DS)
--63.22 (just SS)
August 4th 2016
F 58.05 %
November 10th 2016
The text was updated successfully, but these errors were encountered: