Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

‼️Crush the Baseline #11

Closed
54 of 74 tasks
juanmirocks opened this issue Oct 23, 2016 · 19 comments
Closed
54 of 74 tasks

‼️Crush the Baseline #11

juanmirocks opened this issue Oct 23, 2016 · 19 comments
Assignees

Comments

@juanmirocks
Copy link
Collaborator

juanmirocks commented Oct 23, 2016

Implement Features Anew

  • Put in dependency parsing edges information

  • Assert current dependencies are correct as originally planned

  • Number of tokens left

  • Number of tokens right

  • Investigate what the hell is going on

  • Use LIBSVM scikit-learn for svm library instead

  • Visualize current features

  • Scale integer counts: http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use

    • --> No clear improvement so far
  • Review this: ‼️Crush the Baseline #11 (comment)

  • Set allowed feature keys

  • Implement rest of features:

  • Set allowed features mapping by name

  • Test performance for created edges only

  • Visualize progressive performance with kbest

  • Try to do kbest faster

  • Visualize the mistakes

  • Check again false positives & false negatives and implement new sensible features:

    • Definitely have check for enzymes, ases, or prefixes, somehow:
      # review all treatment or enzymes
      # phosphatidylinositol-specific phospholipase C (PI-PLC -- treatment

    • [ ] Hormones? -- hemagglutinin (HA) -- know they are couples / related

    • Receptors? `peroxisome proliferator-activated receptor γ (PPARγ) and transporter

    • [ ] Cytokine? `peroxisome proliferator-activated receptor γ (PPARγ)

    • Is in negation

    • main verb(s)

    • [ ] Do Jumps as in chunk parsing (constituency parsing, shallow parsing) http://www.clips.ua.ac.be/pages/mbsp-tags

    • Get background info from SwissProt:

  • Check for non i.i.d. groups
    * [ ] Speed up sklsvm features matrix #creation

  • Review all existing features

    • Review synonyms
    • Get entity count
  • Do recursive selection of kbest features

  • Fix selected features

  • Grid search hyperparameters:

    • [ ] AdaBoost didn't check
    • [ ] Random Forest so far gave mediocre results compared to SVC (feat_sel from LinearSVC l1)
    • Search C & class_weight with linear kernel
    • Search C & gamma & class_weight with with RBF kernel
    • remove correlation between features, examples:
"DependencyFeatureGenerator::18_LD_bow_N_gram_LD_2_<treatment ~~ with>_[0]",  # 264
"DependencyFeatureGenerator::22_PD_bow_N_gram_PD_2_<treatment ~~ with>_[0]",  # 805


Sentence Features

  • frequency features like number of protein entities in the sentence
  • number of location entities in the sentence
  • number of organism entities in the sentence
  • count of the bag of words
    • ❓semi - neutral
  • etc.?

All Tokens Features

  • bag of words (BOW) of tokens in the sentence
    • ❌ worsens (implemented as lemmas)
  • stem of tokens in the sentence
    • ❌ worsens (implemented as lemmas) same as before

Selected Tokens Featurres

Token features are extracted for tokens that are part of the entities and for tokens that are in a linear dependency with the entities.

  • token text
  • masked token text
  • stem
  • POS tag
  • There are features for binary tests on a token, e.g:
    • whether the first letter of token is in capital case or not (capitalization test)
    • whether some letter in the middle is in capital case or not
    • etc.. ?
    • presence of a hyphen
    • presence of forward slash
    • presence of backward slash
    • presence of digits
    • etc.. ?
  • Character bigrams and trigrams of the tokens

Linear Context and Dependency Chain

Token features are also extracted for tokens that are present in the linear and dependency context. A linear context of length 3 is considered, i.e., features are extracted for the next and previous 3 tokens relative to the token under consideration.

A dependency chain of length 3 is considered for the dependency context. Incoming and outgoing dependencies are considered for dependency-related features. For example, in the case of incoming dependencies, features are extracted for the source/from token and its incoming and outgoing dependencies are considered too. This goes on up to a dependency depth of 3. In addition to the token features, the dependency edge types are also considered while extracting dependency chain features.

Dependency Features

Many features are extracted depending on the dependency graph. Using the Floyd- Warshall algorithm, the shortest path between a protein entity and a location entity in a potential PL relation is calculated. For the purpose of extracting the shortest path, an undirected graph of dependencies is considered. Figure 4.6 shows the shortest path from protein entity "COP1" to location entity "cytoplasmic". The path is shown in bold. Note that an undirected graph is considered only for the purpose of extracting the shortest path. Most of the dependency-based features depend on this shortest path. While extracting the features, the original direction of the edges in the shortest path is also taken into consideration. The length of the shortest path contributes an integer-valued feature in addition to the binary feature for each length.

Token features are extracted for terminal features of the shortest path, which are head tokens of the entities. Some of the other path related features include token features of the internal tokens in the path, features for every edge in the path, features for internal edges in the path, etc.

N-gram Dependency Features

The protein and location entities are represented by their respective head tokens. The shortest path between two entities is actually a shortest path between the head tokens. However, there need not be a single shortest path between two entities. There can be multiple paths between two entities with same distance or same minimum distances. All such paths with minimum distance are computed and features are extracted for each one of them.

For every such minimum distance path from the set of minimum distance paths, parts of the paths are considered for N-gram features. For example, a set of all 2 consecutive tokens are considered for 2-gram features and a set of all 3 consecutive tokens are considered for 3-gram features, etc.. 2-, 3- and 4-gram features are extracted from all such paths. These features also include token features that are part of the corresponding set, dependencies in the set, directions of dependencies in the set, etc.

Other Features

  • Other features include relative order of the entities
    • ❌ worsens
  • the features of the tokens that lie between entities in a potential relation
  • features depending on the presence of word "protein"
  • presence of other special words/phrases like "found into" and "localized" between two entities
  • etc. ?

Features Specific to DSModel

Some features are specific to the DSModel since it involves processing a pair of sentences as a combined sentence along with extra links. Some of those features include bag of words/stem/POS of tokens in individual sentences, binary tests like the presence of an entity in the first sentence or second sentence, etc. Importantly, the DSModel also uses the predictions of the SSModel. The features depending on SSModel predictions include binary tests like whether the entities considered in the potential relations have a predicted same-sentence relation or not. The intuition behind using same-sentence predictions is that entities that already have a same-sentence relation are unlikely to have a different-sentence relation in most cases.



Reported:

  • (validation set): P 66.57%, R 67.61%, F 66.73% (with DS) -- 63.22 (just SS)

August 4th 2016

  • (using relna features and no DS model, same as before): F 58.05 %

November 10th 2016

  • Run1:

logs/training/1478708868440160526/loctext_id1478708868440160526_m1_u0.30_c0.0085.log:Computation(precision=0.6157205240174672, precision_SE=0.002796955995324513, recall=0.6238938053097345, recall_SE=0.004141255122679181, f_measure=0.6197802197802198, f_measure_SE=0.0028700805881961096)

  • Run2:

logs/training/1478708868440160526/loctext_id1478708868440160526_m1_u0.90_c0.0080.log:Computation(precision=0.6624365482233503, precision_SE=0.0028419058353217194, recall=0.5787139689578714, recall_SE=0.004008789406529725, f_measure=0.6177514792899409, f_measure_SE=0.002730519497460233)

@juanmirocks
Copy link
Collaborator Author

As of now we are >1 %point below Shrikant's reported performance. We will continue now with the DS model. And then (with the combined models) continue with the feature selection & hyper parameter optimization.

@MadhukarSP @shpendm

@juanmirocks
Copy link
Collaborator Author

So far I'm gonna let Run2 parameters as defaults since it has much better precision. Run1 is really almost like StubSameSentenceRelationExtraction, predicting everything as positive but 14 negative edges. Run2 predicts 133 edges as negative.

@juanmirocks
Copy link
Collaborator Author

juanmirocks commented Nov 11, 2016

Note:

  • 10% of relna documents (140 * 0.1) takes s28 for training. The 100% corpus (140 docs) takes ~1m45s

In comparison:

  • 40% of LocText documents (100 * 0.4) takes 18s. The 100% corpus (100 docs) takes ~35s

That is, relna takes much more time for training --> may be because relna produces many more features that are actually helpful and maybe should be added to LoccText

See #16

@juanmirocks juanmirocks changed the title Match Shrikant's reported performance ‼️Match Shrikant's reported performance Dec 1, 2016
@juanmirocks
Copy link
Collaborator Author

juanmirocks commented Jan 2, 2017

Ponder:

Let ' s introduce *Juan Miguel* : is *awesome* !

OW1 = introduce s ' Let
IW1 = : is awesome !

OW2 = !
IW2 = is : Miguel Juan

LD = : is

Yes, possible problems if I introduce the inner window -- but at the moment I'm not

@juanmirocks
Copy link
Collaborator Author

Maybe ponder about:

(BLR1, OW1, B) | (are, OW1, B) | (receptors, OW1, B) | (novel, OW1, B)
(BLR1, IW1, F) | (are, IW1, F) | (receptors, IW1, F) | (novel, IW1, F)

...

(BLR1, LD, F) | (are, LD, F) | (receptors, LD, F) | (novel, , )```

@juanmirocks
Copy link
Collaborator Author

16:57:30|LocText$ grep tokens_count_before run.log
Feature map: 6 == SentenceFeatureGenerator::8_tokens_count_before_[0] -- _1st_ value: 4
Feature map: 6 == SentenceFeatureGenerator::8_tokens_count_before_[0] -- _1st_ value: 0
Feature map: 6 == SentenceFeatureGenerator::8_tokens_count_before_[0] -- _1st_ value: 0
Feature map: 6 == SentenceFeatureGenerator::8_tokens_count_before_[0] -- _1st_ value: 2
Feature map: 6 == SentenceFeatureGenerator::8_tokens_count_before_[0] -- _1st_ value: 2

16:58:21|LocText$ grep tokens_count_after run.log
Feature map: 7 == SentenceFeatureGenerator::9_tokens_count_after_[0] -- _1st_ value: 10
Feature map: 7 == SentenceFeatureGenerator::9_tokens_count_after_[0] -- _1st_ value: 1
Feature map: 7 == SentenceFeatureGenerator::9_tokens_count_after_[0] -- _1st_ value: 1
Feature map: 7 == SentenceFeatureGenerator::9_tokens_count_after_[0] -- _1st_ value: 1
Feature map: 7 == SentenceFeatureGenerator::9_tokens_count_after_[0] -- _1st_ value: 16

@juanmirocks
Copy link
Collaborator Author

Other (from Tanya)

weka Ranker, big variation
pca / 10


[20170112, 14:14:38] Tatyana Goldberg: Search:weka.attributeSelection.RankSearch -S 1 -R 0 -A weka.attributeSelection.GainRatioAttributeEval --
[20170112, 14:14:48] Tatyana Goldberg: Evaluator:    weka.attributeSelection.CfsSubsetEva

@juanmirocks juanmirocks mentioned this issue Jan 29, 2017
10 tasks
@juanmirocks juanmirocks changed the title ‼️Match Shrikant's reported performance ‼️Crush the Baseline Jan 29, 2017
juanmirocks added a commit that referenced this issue Jan 30, 2017
@juanmirocks
Copy link
Collaborator Author

@goldbergtatyana would it be possible for you to generate the same list as human_localization_all.tab yet also for the organisms: {arabidopsis, Saccharomyces cerevisiae (yeast)}

And other model organisms if you think appropriate?

@goldbergtatyana
Copy link
Collaborator

HI @juanmirocks, I need to see the file human_localization_all.tab to know what to generate for you.

@juanmirocks
Copy link
Collaborator Author

juanmirocks commented Feb 3, 2017 via email

@goldbergtatyana
Copy link
Collaborator

seems to me like a simple uniprot search:

  1. go to http://www.uniprot.org/
  2. click on Advanced next to the search bar
  3. select in the left drop down Organism[OS] and type in the organism name you're interested (e.g. human)
  4. in the result window select Filter By "Reviewed"
  5. then click on columns, a button that is shown above the results table
  6. then select entry name, protein name, gene name (all in names & taxonomy section), Subcellular location [CC] (in Subcellular Location section) and Gene ontology (GO) (in Gene ontology (GO) section). Unselect everything else
  7. in the result view you can then download the result in a tab separated format

@goldbergtatyana
Copy link
Collaborator

as for organisms,Id suggest to go as in the linked annotations article for:

  • human (Homo sapiens)
  • baker's yeast (Saccharomyces cerevisiae)
  • Arabidopsis thaliana

@juanmirocks
Copy link
Collaborator Author

@goldbergtatyana oh I see -- you generated it this way

unfortunately the output from uniprot doesn't normalize the subcellular localizations when they extracted from citations, as in:

Cytoplasm {ECO:0000305|PubMed:16410549}. Nucleus {ECO:0000269|PubMed:16410549}.

Also, sometimes I don't see the relation between the column CC and the GO (which should be the almost the same or at least congruent?). For example, in

`Q8WZ42 TITIN_HUMAN Titin (EC 2.7.11.1) (Connectin) (Rhabdomyosarcoma antigen MU-RMS-40.14) TTN SUBCELLULAR LOCATION: Cytoplasm {ECO:0000305|PubMed:16410549}. Nucleus {ECO:0000269|PubMed:16410549}. condensed nuclear chromosome [GO:0000794]; cytosol [GO:0005829]; extracellular exosome [GO:0070062]; extracellular region [GO:0005576]; I band [GO:0031674]; M band [GO:0031430]; muscle myosin complex [GO:0005859]; striated muscle thin filament [GO:0005865]; Z disc [GO:0030018]``

@goldbergtatyana
Copy link
Collaborator

@juanmirocks please upload the orginal file for me to able to reconstruct the logic then. By now I unfortunately do not remember what was done. Thanks

@juanmirocks
Copy link
Collaborator Author

juanmirocks commented Feb 3, 2017 via email

@juanmirocks
Copy link
Collaborator Author

juanmirocks commented Feb 4, 2017

organisms:

"4679": 1,
"7955": 1,
"9913": 2,
"562": 5,
"3888": 5,
"10116": 6,
"4097": 7,
"7227": 7,
"4577": 14,
"10090": 44,
"3702": 179,
"9606": 222,
"4932": 302,

UniProt query to get the reviewed proteins of all corpus' mentioned organisms:

(organism:human OR organism:yeast OR organism:arabidopsis OR (organism:"Allium cepa (Onion) [4679]" OR organism:"Danio rerio (Zebrafish) (Brachydanio rerio) [7955]" OR organism:"Bos taurus (Bovine) [9913]" OR organism:"Escherichia coli [562]" OR organism:"Pisum sativum (Garden pea) [3888]" OR organism:"Rattus norvegicus (Rat) [10116]" OR organism:"Nicotiana tabacum (Common tobacco) [4097]" OR organism:"Drosophila melanogaster (Fruit fly) [7227]" OR organism:"Zea mays (Maize) [4577]" OR organism:"Mus musculus (Mouse) [10090]")) AND reviewed:yes

that is:

http://www.uniprot.org/uniprot/?query=%28organism%3Ahuman+OR+organism%3Ayeast+OR+organism%3Aarabidopsis+OR+%28organism%3A%22Allium+cepa+%28Onion%29+%5B4679%5D%22+OR+organism%3A%22Danio+rerio+%28Zebrafish%29+%28Brachydanio+rerio%29+%5B7955%5D%22+OR+organism%3A%22Bos+taurus+%28Bovine%29+%5B9913%5D%22+OR+organism%3A%22Escherichia+coli+%5B562%5D%22+OR+organism%3A%22Pisum+sativum+%28Garden+pea%29+%5B3888%5D%22+OR+organism%3A%22Rattus+norvegicus+%28Rat%29+%5B10116%5D%22+OR+organism%3A%22Nicotiana+tabacum+%28Common+tobacco%29+%5B4097%5D%22+OR+organism%3A%22Drosophila+melanogaster+%28Fruit+fly%29+%5B7227%5D%22+OR+organism%3A%22Zea+mays+%28Maize%29+%5B4577%5D%22+OR+organism%3A%22Mus+musculus+%28Mouse%29+%5B10090%5D%22%29%29+AND+reviewed%3Ayes&sort=score

with all columns:

http://www.uniprot.org/uniprot/?query=%28organism%3Ahuman+OR+organism%3Ayeast+OR+organism%3Aarabidopsis+OR+%28organism%3A%22Allium+cepa+%28Onion%29+%5B4679%5D%22+OR+organism%3A%22Danio+rerio+%28Zebrafish%29+%28Brachydanio+rerio%29+%5B7955%5D%22+OR+organism%3A%22Bos+taurus+%28Bovine%29+%5B9913%5D%22+OR+organism%3A%22Escherichia+coli+%5B562%5D%22+OR+organism%3A%22Pisum+sativum+%28Garden+pea%29+%5B3888%5D%22+OR+organism%3A%22Rattus+norvegicus+%28Rat%29+%5B10116%5D%22+OR+organism%3A%22Nicotiana+tabacum+%28Common+tobacco%29+%5B4097%5D%22+OR+organism%3A%22Drosophila+melanogaster+%28Fruit+fly%29+%5B7227%5D%22+OR+organism%3A%22Zea+mays+%28Maize%29+%5B4577%5D%22+OR+organism%3A%22Mus+musculus+%28Mouse%29+%5B10090%5D%22%29%29+AND+reviewed%3Ayes&sort=score

@juanmirocks
Copy link
Collaborator Author

@goldbergtatyana finally I get good performance 😀

@juanmirocks
Copy link
Collaborator Author

Goal of the task achieved:

  • F1(Baseline) = 71
  • F1(LocText SS) = 79 -- SS == Same Sentence model, i.e., only relations present in same sentences are checked

We are now going to concentrate on different sentence models with different distances, see #34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants