-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training a converted model #10
Comments
hi @shensmobile Two ways to use it:
The second way is better but requires more ressources. If your model is converted to a 4096 tokens, you can fit up to 4096 tokens sequences without problem. In practice, I recommend you to start with local attention only if your inputs are not that long, e.g: For classification, |
Hi, Thanks for the reply! I'll start with those parameters to begin with. When I convert a model (with the pip package in Python) that's already been fine-tuned on a classification task, I get this error:
Do I need to specify to not using global tokens? Or is this ignorable? Also, when I directly convert my language model and attempt to train immediately, I get this error during the training cycle:
Do I need to save the model and tokenizer first, then reload so I can specify trust_remote_code=True? Sorry for these really basic questions, I appreciate your help. |
First warning is ignorable. Should work out of the box with this code: from lsg_converter import LSGConverter
# To convert a model
model_path = "myroberta_model" # or whatever model
converter = LSGConverter(max_sequence_length=4096)
# Simple conversion
model, tokenizer = converter.convert_from_pretrained(model_path, block_size=128, sparse_block_size=0)
# If you need to change the architecture, example: MaskedLM to SequenceClassification
# Useful if you load a "roberta-base" model
model, tokenizer = converter.convert_from_pretrained(model_path, block_size=128, sparse_block_size=0, architecture="RobertaForSequenceClassification")
# Some training logic
<do some training here>
# Save after training
model.save_pretrained("my_lsg_model")
tokenizer.save_pretrained("my_lsg_model")
# To reload
model = AutoModelForSequenceClassification.from_pretrained("my_lsg_model", trust_remote_code=True)
tokenizer= AutoTokenizer.from_pretrained("my_lsg_model", trust_remote_code=True) If you have a problem, try to convert, save the model and reload to make sure everything is fine. |
It was transformers. I was on 4.30.2. Updating to the latest version appears to have gotten it working. Now I get to tinker around with values to get it to fit on my GPU and not take half a day. Thanks for the help! I'm curious to see how this stacks up against Longformer for my data. Not sure if it's valuable but I'll report back if there's significant improvements in real world applications! |
@ccdv-ai early results have been good! I've run into some issues with memory limits with (block_size=256, sparse_block_size=128, sparsity_factor=8, sparsity_type="bos_pooling"). Given that most of my documents are less than 512, and only 20-30% are 512-1024 and maybe 1-2% would be 1024+, does it make sense to use such a large sparse_block_size and sparsity_factor? Would I still be able to achieve good multiclass text classification results with a smaller sparse block size? |
@shensmobile For a given token, the maximum context is equal to Its better to use the same size for blocks and sparse blocks for efficiency reasons. You can remove dropout on attention to reduce memory even more, check your model config to get the name of the param. |
Thanks! I appreciate the advice. I'm goign to experiment with 256/0/0 and with 128/128/8 with the added context that matching block sizes is better for efficiency. I'm currently re-implementing gradient accumulation and the BnB 8bit Adam optimizer. I had issues implementing gradient checkpointing. Would removing dropout on attention be a better first step than either of the afforementioned options? Edit: 8bit Adam optimizer actually had the opposite impact; it took up more memory instead of saving. I opted for just using gradient accumulation. Removing dropout on attention did save me some memory, but it didn't save me enough that I could increase my batch size. I could increase my gradient accumulation, but training speed did not notably improve. Currently training a 256/0/0 model, may revisit dropout/gradient accumulation when I try training the 128/128/8 model. Out of curiosity, if I wanted to try 128/128/4, I should switch to norm or pooling right? The documents say that the best sparsity type is task dependent, would pooling still be the best for text classification? |
You can also try using fp16 instead of fp32. Gradient accumulation is fine. The best hyperparameter choice is very task/data specific. If you are short on memory just remove sparse connections. |
Currently using fp16, gradient accumulation, and went back to Adam since 8-bit Adam didn't seem to be saving any memory, but I could try SGD/AdaFactor as well. Still barely fitting with a batch size of 2/gradient accumulation of 4 onto my 4090. Removing sparse connections helped, but I do like the idea of additional context being available. I'm testing 256/0/0 and 128/128/4 to see which would be best for my application. Just takes a while to train :) Edit: Actually, training 128/128/4 now and it takes up less memory than just 256/0/0. Edit 2: Huzzah! Training on 128/128/4/Pooling was excellent. My best results yet. I was able to eke out an additional 3% F1 score (91% to 94%) against my test set over both RoBERTa with chunking and Longformer. Thanks for all of the help! |
I've been going back and forth with RoBERTa and Longformer for classification. My typical use case is quite sporadic, as most of my documents are around 300 tokens, but occassionally I get massive 1000 token inputs (probably split 70:30). Longformer is definitely superior to split/chunk/pool the documents with RoBERTa for long documents but I find often that RoBERTa is more accurate for shorter documents.
I'm interested in trying out LSG but I'm curious about the training process. I like that with Longformer I can train on long documents instead of having to truncate/strip out the middle of documents to fit under the 512 token window and get the full context of the document to help the model learn. With LSG, am I able to train after the conversion to take advantage of the longer context during training?
Edit: There is mention of training memory requirements in the main repo, but just wanted to confirm that I can train the LSG-converted model just like I would the original RoBERTa model (albeit with longer context). If so, are there are best practices? Like, would it be better to start with the base RoBERTa model or my fine-tuned RoBERTa model that was trained on the classification task already (using truncated inputs)? And what are some optimal local/sparse block and sparsity settings for text classification?
The text was updated successfully, but these errors were encountered: