-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SciBERT - NER #8
Comments
Sina and I got our data reformatted finally after a couple hours, in mar12_NER/20210326_set_up_NER_runs_with_dividers.ipynb -- data was saved to data/ner/chemprot_sub_enzyme/clean/{dev, train, test}.txt We ran it yesterday but keep getting low f1s, so I'm going to start working on seeing if we can use bits and pieces of the SciBERT model to include class_weights - more coming |
how we ran it for testing (didn't want to use compute hours): |
creating new kernel: 6:20pm: issue with some iProgress module, so ran these
|
okay, my plan:
uguguguguugugg we need to modify the loss if we want the model to LEARN these weights though train: |
oof okay switching to local to make changes to AllenNLP - will try to set up similar structure of files on Savio and sync to GitHub ah sike - we realized it's not bert_text_classifier it uses for the NER set, but rather the bert_crf_tagger.py file - will try to see if we can modify that to use class weights instead! |
kmkurn/pytorch-crf#47 is helpful, and files to modify include ner_finetune.json, allennlp CRF class, and the bert_crf_tagger.py file. |
did some more checks into how people have fixed imbalanced data issues in AllenNLP before. Seems like there is no generalized solution according to this thread. Mrunali and my experiments with directly modifying the weights haven't made a big difference to performance so far, might be missing something though. |
looking into modifying CRFs to be weighted: from: allenai/allennlp#4619 someone said "I mean, I believe it can work in practice, but their theoretical motivation is not correct. If this is the case, we could do it with a much simpler approach (like weighted emission scores)." which is what we did...: tensorflow/addons#817 okay, I'm just going to keep a running list of updates in this comment on other comments/potential implementations {in any case can you tell how much fun I'm having with GitHub issues lmao} |
This textbook chapter from my NLP class actually goes over what we have concluded as being a good approach to solving this problem which I thought was validating (i.e. NER/Relation Extraction + semi-supervised approach) https://web.stanford.edu/~jurafsky/slp3/17.pdf |
Is the semi-supervised approach the approach you're/they're thinking of? It does seem really cool and it seems to have decent track record, though we'd probably need to rewrite a lot of code. Do you think this is something worth pursuing? |
Yeah take a look at 17.2.4 in there (distant supervision for relation extraction). It sounds very similar to the pattern recognition technique we've been talking about, except it learns non-regex patterns for features (or aggregates data to be fed into NN directly without extracting features beforehand). Problem is that it generally has low precision, which is similar to the other paper we read using pattern matching, so not sure what the best solution is for us. |
Trying to rebalance the data (with |
praise Ivan who modified a hugging face implementation (in his scratch folder,
|
revised TODOs:
|
I'm working on this at #26 |
Overview
We are doing this to compare SciBERT's performance on NER, relative to text classification. SciBERT didn't provide a chemprot dataset for NER, so we are using the chemprot dataset straight from its source (link here?) and formatting it to fit the model's NER task.
Attempt (ongoing)
We are in the middle of converting the source chemprot dataset, and doing part-of-speech tagging on each word, as well as connecting the relevant entities (substrate, product, and enzyme).
Plans
We will do the full 75 epoch training on this dataset, and see how it performs.
The text was updated successfully, but these errors were encountered: