-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guided alignments ala Garg et al. #1054
Comments
That's an interesting idea! I'm not aware of any current work to implement this in Sockeye. The paper links to a Pytorch/Fairseq implementation that could potentially be ported. If you or your colleagues are interested in working on this, we would welcome a pull request. |
Cheers! |
Hi @mjdenkowski ! However, we now have arrived at the point where we need to figure out the best way of packing in alignment data in Sockeye's data handling workflows. This turns out to be a bit of a challenge, as the source and target data files are assumed to be token-parallel; thus, it is possible to take advantage of the identical sequence lengths when packing data. The alignment data, however, has a variable length between 0 and NxM source-to-target alignment entries. Therefore, it seems that accommodating alignment data will require changes to all code preparing, loading, saving, batching, and sharding data. Where code previously handled source and target data streams, it will now have the third one for alignment data. Before we dive into doing that - maybe you have some ideas for a more elegant solution? |
Hi Toms, That sounds like a great MT Marathon project! The scenario you're describing is similar to something we're working on in a branch: adding support for sequence-level metadata. In addition to source and target sequences, each training example can include a metadata dictionary. Entries encode pairs of identifiers (tags, feature names, etc.) and weights as described by Schioppa et al. (2021). Any example can have any number of dictionary entries regardless of the source or target sequence length. The branch currently supports preparing data and running training with additional metadata files. During training, the metadata entries for each sequence are available as part of each batch. They are not yet used for anything. One option for adding alignment support would be following the changes between the current
Most of the changes are bookkeeping and no individual step should be too difficult. Feel free to follow up if you have any more questions. Best, |
Thanks, Michael! |
As of commit 26d689f, the If any modules need to check for zero-size tensors for each call (e.g., some batches contain alignment tensors but others don't), they can be scripted before the larger model is traced [1, 2]. |
Closing for inactivity. Please feel free to reopen if there are any updates. |
Hi all,
I just finished reading Sockey 3 paper. Nicely done, congratulations!
Have you considered implementing guided alignments[1] in Sockeye 3?
It is handy for formatted document translation, non-translatable entity and placeholder handling, and variations of automatic post-editing. Marian and Fairseq already have this feature. However, they have their own limitations, especially compared to the latest version of Sockeye.
Are there any plans for development in this direction?
[1] Jointly Learning to Align and Translate with Transformer Models
The text was updated successfully, but these errors were encountered: