-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Added DataCollator
for dynamic operations for each batch.
#5221
Conversation
Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>
Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>
Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>
Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>
Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Could you fix the failing test, and also add a test for the LanguageModelingDataCollator
?
CHANGELOG.md
Outdated
@@ -6,6 +6,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), | |||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). | |||
|
|||
## Unreleased | |||
### Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already an "Added" section below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wlhgtc, can you move this to the "Added" section below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The collator seems like a really weird place to do things like this, but if that's how huggingface does it, we shouldn't be different just for the sake of being different.
I'll leave final approval to @epwalsh.
Co-authored-by: Pete <epwalsh10@gmail.com>
It's the only way that could do some dynamic operations for different batch. |
It would make more sense to me to write a data loader that has the ability to modify instances before they are being batched. We could give the data loader a callback to do just that. That makes more sense than abusing a function that's meant to just construct batches. But like I said, let's just stick to what huggingface does, so people aren't surprised when they encounter our API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks for doing it! I just have some minor requests about how to use Registrable
.
CHANGELOG.md
Outdated
@@ -6,6 +6,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), | |||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). | |||
|
|||
## Unreleased | |||
### Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wlhgtc, can you move this to the "Added" section below?
Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com>
We should be using underscores for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
…5221) * ADD: add from_pretrained method for vocab * MOD: test format * MOD: format file * MOD: update changelog * MOD: fix bug * MOD: fix bug * MOD: fix typo * MOD: make the mothod in class * MOD: fix bug * MOD: change to instance method * MOD: fix typo * MOD: fix bug * MOD: change oov to avoid bug * Update allennlp/data/vocabulary.py * Update allennlp/data/vocabulary.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/data/vocabulary.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/data/vocabulary.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * MOD: fix formate * MOD: add test case * Update CHANGELOG.md * MOD: fix worker info bug * ADD: update changelog * MOD: fix format * Update allennlp/data/data_loaders/multitask_data_loader.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update CHANGELOG.md Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * MOD: add demo code * MOD: align code * MOD: fix bug * MOD: fix bug * MOD: fix bug * MOD: formate code * Update allennlp/data/data_loaders/data_collator.py Co-authored-by: Pete <epwalsh10@gmail.com> * fix error * MOD: add test code * mod: change tokenizer * mod: fix tokenizer * MOD: fix bug * MOD: fix bug * MOD: fix bug * Update allennlp/data/data_loaders/data_collator.py Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com> * MOD: update changelog * MOD: update change log * Update allennlp/data/data_loaders/data_collator.py We should be using underscores for everything. * Formatting Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com>
related #5218