-
Notifications
You must be signed in to change notification settings - Fork 234
huggingface
Use this page to discuss how to add HuggingFace support. Feel free to link to forum threads, issues, etc.
-
Transform
/ItemTransform
support for dictionaries in addition to tensors (or list of tensors). -
Learner.summary
support for dictionaries - Support for Masked Language Modeling (MLM)
- Support for the various MLM denoising objectives
- Integration with huggingface
nlp
library (useful for any NLU task, not just transformers)
Currently, fastai presupposes that a "thing" is represented by a single tensor or list of tensors. However, in huggingface, a "thing" (a sequence of text) is represented by multiple tensors (e.g., input_ids, attention_mask, token_type_ids, etc...) and encapsulated in a dictionary object. fastai has no problem returning such a dictionary from the encodes
method of Transform
or ItemTransform
instances for modeling purposes ... but has problems dealing with them when it comes to Learner.summary
and the various show methods like show_batch
and show_results
.
One attempt to solve this comes from the blurr library, but requires a custom batch transform pretty much just for the purpose of working with the dictionary returned from it's HF_TokenizerTransform
object
-
Use a convention, whereby if the type is a dictionary of tensors, the FIRST one is the one to use for representing the thing. So in HF, that is the
input_ids
, which is exactly what you want to use for showing. -
Implement something like a
def transforms_repr
function (optional), that returns a single tensor representation of your thing. If the method exists in your Transform, it is used in any of the show/summary methods to return a single tensor that will work with those methods. -
Pass a "Key" to the transform that defines what item in the dictionary should "represent" the thing for summary/showing purposes?
This should be fairly straight-forward change once the Transform
and ItemTransform
classes are updated (see above). What the blurr library does now is define its own blurr_summary
methods that essentially change only a single line to make it work with a dictionary. See def blurr_summary(self:Learner)
as well as the @patched method for nn.Module
. The only real change that needed to be made is on line 66 ...
inp_sz = _print_shapes(apply(lambda x:x.shape, xb[0]['input_ids']), bs)
... but unfortunately, it requires completely overriding both of those methods currently.
There is no support for MLM in fastai currently; the only support is for causal LM used by ULMFit and transform models like GPT-2. As most transformer models utilized a MLM pre-training objective, it would be nice to have support for doing this in fastai so that where possible, folks may fine-tune those LMs in a fashion similar to what you can do in fastai with ULMFit. This may be prohibitive in certain transformers where the size of models are just too big.
The MLM objectives vary for the various transformers. The T5 and BART papers include a nice descriptions of most of them. They include:
- Token Masking
- Token Deletion
- Token Infilling
- Sentence Permutation
- Document Rotation
See section 2.2 in the BART paper and section 3.14 in the T5 paper for visuals and detailed explanations for each.
TBD
- https://forums.fast.ai/t/fasthugs-fastai-v2-and-huggingface-transformers/63681
- https://forums.fast.ai/t/update-to-blurr-library-huggingface-fastai-integration-for-developers/70619
- https://forums.fast.ai/t/pretrain-finetune-mlm-6-reproduce-glue-finetuning-results/72875
- https://forums.fast.ai/t/speedtest-huggingface-nlp-datasets-lib-vs-fastai-textdataloaders/75291
- blurr (https://ohmeow.github.io/blurr/)
- fasthugs (https://github.com/morganmcg1/fasthugs)
- fasthugs MLM specific notebook (https://github.com/morganmcg1/fasthugs/blob/master/fasthugs_language_model.ipynb)
- fastai causal LM tutorial (https://dev.fast.ai/tutorial.transformers)