-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary notebook comments #4
Comments
Corpora
|
Metadata
|
Tagging
|
Phenotype Inference
|
Candidate Generation/Candidate Annotation/Labeling Functions
|
As discussed in person yesterday, it's probably best to quantify how our tokenization + discarding non-surface marker tokens strategy works by comparing it to the performance of the NormCo-inspired embed all tokens and add embedded vectors strategy for mapping entity mentions to CL terms. A good chance to make use of the cool work in #2 (comment)! |
To your comments (anything omitted went on to a TODO list in the summary report as-is): Corpora
It returns 124,720 results (as of ~March). My hope was that sorting by relevance through the Entrez api would help give me the top ~20k results that would require the least amount of filtering to find docs worth annotating.
I was doing that initially but wasn't able to get very large result sets. I added some of my queries + result count experiments to the summary but I think this difference demonstrates the issue I was having:
The queries I added to the summary show what happens as you make the query less specific, but even at the Metadata
There are at least a few surface forms like "CDw198" (CCR8) that I think make sense to catch. Is there some set of CD* molecules you think would be problematic to include?
I think the best way they would capture that is with GO annotations and for example CD14 (chicken) has the cell surface GO annotation, yet CD4 (human) (and its parent CD4 molecule) and a few others I checked for humans do not despite have the string "surface" in several of the synonyms. I also don't see any other GO annotations common to all of them that indicate filtering by annotation would work very well. Phenotype Inference
A better way to break it down -- of all 100% of the entity mentions from the JNLPBA tagger:
By 66% I mean (5.9% / (5.9% + 3.6%)) ~= 2/3, implying that the "CD4+CD25+Foxp3+ T cells" phrasing is more common after removing entity mentions that are either impossible to classify or are very easy to classify.
That sounds right to me, and I think that analysis I did will at least go a long ways towards knowing what to throw out. I'm not sure what's more likely to constitute a new CL term though. Do you think they're more likely to look like Candidate Generation/Candidate Annotation/Labeling Functions
Yea more or less. I thought it would be worth explaining how tags and sentences are combined to create candidates. The process is pretty simple where every pairwise combination of tags (of the appropriate types) in any one sentence results in a new candidate, but it took me a bit to confirm as much in the Snorkel code.
I agree! Particularly when it comes to building a lot of regex based patterns using synonyms -- there seems to be a lot of room for tooling like that.
I'm working on building classifiers as labeling functions using only the manually annotated dev data. I think they'll make for decent labeling functions and more importantly a good baseline since the "discriminative model" (the one trained on the probabilistic labels from their generative model) should be able to perform better when evaluated on the same validation + test datasets I created, which are also hand labeled. |
Both! I guess it depends on the frequency of the entity mention. Figuring out where to position it within the CL ontology will be fun, as well as determining if any of these new types are synonyms of one another. If we wanted to get fancy we could also weight by credibility using things like the quality of the journal , the citation count of the paper, the track record of the authors reporting the new type. I suspect we'll need to do some manual inspection of the entity mentions that we have left over following the entity linking step, similar to the manual inspection you did following NormCo-style embedding. |
Well that's disappointing. It's also disappointing that the GO Annotations team only accepts suggested improvements via email. I wish they had a discussion forum! The EMBL-EBI needs a single Discourse instance for all of their open data projects... |
Hey @hammer, I updated the summary and tied it to a release that I think is a solid improvement on my first loop through the whole process where I had no real evaluation criteria other than correlation with iX. I added "Strong Supervision" and "Weak Supervision" sections to the summary that include some performance numbers on RE for each task now that I have separate validation and test sets. My takeaways from this iteration are:
My next steps before trying to run another iteration of this loop will be:
An alternative I see to this is that if that performance is at least close to good enough (F1 between 60-70), then I can let ontology matching remain as an orthogonal task and collect just enough new evaluation data to be comfortable picking and applying a final model. Results from applying that final model could then easily be merged with any progress made on the ontology matching front, without needing to re-run the very expensive modeling loop. I think I could manage doing that and writing it up by mid-August, but that full list above would likely take longer. What do you think? |
Comments on commit 0f6f5d8
The text was updated successfully, but these errors were encountered: