Guided alignments ala Garg et al. #1054

tomsbergmanis · 2022-07-26T07:55:40Z

Hi all,
I just finished reading Sockey 3 paper. Nicely done, congratulations!
Have you considered implementing guided alignments[1] in Sockeye 3?
It is handy for formatted document translation, non-translatable entity and placeholder handling, and variations of automatic post-editing. Marian and Fairseq already have this feature. However, they have their own limitations, especially compared to the latest version of Sockeye.

Are there any plans for development in this direction?

[1] Jointly Learning to Align and Translate with Transformer Models

mjdenkowski · 2022-07-26T23:02:28Z

That's an interesting idea!

I'm not aware of any current work to implement this in Sockeye.

The paper links to a Pytorch/Fairseq implementation that could potentially be ported.

If you or your colleagues are interested in working on this, we would welcome a pull request.

tomsbergmanis · 2022-07-27T08:15:29Z

Cheers!
I will let you know if anything is set into motion!

tomsbergmanis · 2022-09-09T12:22:15Z

Hi @mjdenkowski !
An update on guided alignments in Sockeye: this week, my colleague and I started to work on implementing guided alignments in Sockeye as part of our Machine Translation Marathon project in Prague.

However, we now have arrived at the point where we need to figure out the best way of packing in alignment data in Sockeye's data handling workflows. This turns out to be a bit of a challenge, as the source and target data files are assumed to be token-parallel; thus, it is possible to take advantage of the identical sequence lengths when packing data. The alignment data, however, has a variable length between 0 and NxM source-to-target alignment entries. Therefore, it seems that accommodating alignment data will require changes to all code preparing, loading, saving, batching, and sharding data. Where code previously handled source and target data streams, it will now have the third one for alignment data.

Before we dive into doing that - maybe you have some ideas for a more elegant solution?

mjdenkowski · 2022-09-09T16:18:48Z

Hi Toms,

That sounds like a great MT Marathon project!

The scenario you're describing is similar to something we're working on in a branch: adding support for sequence-level metadata. In addition to source and target sequences, each training example can include a metadata dictionary. Entries encode pairs of identifiers (tags, feature names, etc.) and weights as described by Schioppa et al. (2021). Any example can have any number of dictionary entries regardless of the source or target sequence length.

The branch currently supports preparing data and running training with additional metadata files. During training, the metadata entries for each sequence are available as part of each batch. They are not yet used for anything. One option for adding alignment support would be following the changes between the current main and metadata branches. I recommend forking the metadata branch from commit 4ee4d01. Running git diff 7caa6b9 4ee4d01 will show the changes from main. The main pieces are:

Making many functions/classes aware of optional metadata in addition to source and target data. This should be similar for alignments.
Adding a MetadataReader class that reads JSON dictionary inputs. An AlignmentReader class could read alignment lines and wouldn't require a vocabulary.
Adding a MetadataBucket class that stores sequences of different lengths in a packed format and provides methods for different data operations (getting batches, permuting the data, etc.). An AlignmentBucket class could use a similar approach. With some refactoring, parts of MetadataBucket and AlignmentBucket could be shared.
Saving/loading prepared data using a new dictionary-based format. A new key could be added for alignments.

Most of the changes are bookkeeping and no individual step should be too difficult. Feel free to follow up if you have any more questions.

Best,
Michael

tomsbergmanis · 2022-09-10T07:33:18Z

Thanks, Michael!

mjdenkowski · 2022-09-12T23:05:27Z

As of commit 26d689f, the metadata branch supports training models that add metadata embeddings to encoder representations. This includes passing optional metadata tensors to SockeyeModel.forward with zero-size tensors as default values. A similar approach could be used for optional alignment tensors.

If any modules need to check for zero-size tensors for each call (e.g., some batches contain alignment tensors but others don't), they can be scripted before the larger model is traced [1, 2].

mjdenkowski · 2022-12-18T17:47:01Z

Closing for inactivity. Please feel free to reopen if there are any updates.

mjdenkowski closed this as completed Dec 18, 2022

tomsbergmanis mentioned this issue Feb 19, 2024

Guided alignments in Sockeye. #1105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guided alignments ala Garg et al. #1054

Guided alignments ala Garg et al. #1054

tomsbergmanis commented Jul 26, 2022

mjdenkowski commented Jul 26, 2022

tomsbergmanis commented Jul 27, 2022

tomsbergmanis commented Sep 9, 2022

mjdenkowski commented Sep 9, 2022

tomsbergmanis commented Sep 10, 2022

mjdenkowski commented Sep 12, 2022

mjdenkowski commented Dec 18, 2022

Guided alignments ala Garg et al. #1054

Guided alignments ala Garg et al. #1054

Comments

tomsbergmanis commented Jul 26, 2022

mjdenkowski commented Jul 26, 2022

tomsbergmanis commented Jul 27, 2022

tomsbergmanis commented Sep 9, 2022

mjdenkowski commented Sep 9, 2022

tomsbergmanis commented Sep 10, 2022

mjdenkowski commented Sep 12, 2022

mjdenkowski commented Dec 18, 2022