Issue #82 – Constrained Decoding using Levenshtein Transformer
Author: Raj Patel, Machine Translation Scientist @ Iconic
In constrained decoding, we force in-domain terminology to appear in the final translation. We have previously discussed constrained decoding in earlier blog posts (#7, #9, #79). In this blog post, we will discuss a simple and effective algorithm for incorporating lexical constraints in Neural Machine Translation (NMT) proposed by Susanto et al. (2020) and try to understand how it is better than the existing techniques.
Levenshtein Transformer (LevT)
Levenshtein Transformer is based on an encoder-decoder framework using Transformer blocks. Unlike the token generation in a typical Transformer model, LevT decoder is based on a Markov Decision Process (MDP) that iteratively refines the generated token with a sequence of insertion and deletion operations. The deletion and insertion operations are performed via three classifiers that run sequentially:
- Deletion Classifier, which predicts for each token position, whether they should be “kept”or “deleted”,
- Placeholder Classifier, which predicts the number of tokens to be inserted between every two consecutive tokens and then inserts the corresponding number of placeholder [PLH] tokens,
- Token Classifier, which predicts for each [PLH] token an actual target token.
The above predictions are conditioned on the source text and the current target text. Decoding stops when the current target text does not change, or a maximum number of refinement iterations has been reached.
Incorporating Lexical Constraints
Previous approaches integrated lexical constraints in NMT either via constrained training or decoding. The proposed method injects terminology constraints at inference time without any impact on decoding speed. Also, it does not require any modification to the training procedure and can be easily applied at run-time with custom dictionaries.
For sequence generation, the LevT decoder typically starts the first iteration of the decoding process with only the sentence boundary tokens y0 = <s></s>. To incorporate lexical constraints, we populate the y0 sequence before the first deletion operation with the target constraints, as shown in Figure 1. The initial target sequence will pass through the deletion, placeholder, and insertion classifiers sequentially, and the modified sequence will be refined for several iterations.
We note that this step happens only at inference; during training, the original LevT training routine is carried out without the constraint insertion. We refer the readers to (Gu et al., 2019) for a detailed description of the LevT model and training routine.
Is it effective?
The authors experimented with the proposed method on the WMT’14 English-German (EN-DE) news translation task using EN-DE dictionary extracted from Wiktionary. They evaluate the systems using BLEU and term usage rate (Term%). The Term% is defined as the number of constraints generated in the output divided by the total number of the given constraints. They compared the proposed approach with current best methods of constrained decoding (Post and Vilar, 2018; Dinu et al., 2019). As shown in the results table, despite having a stronger baseline, the proposed methods obtained higher absolute BLEU score improvements (0.96 and 1.16 BLEU on Wiktionary and IATE, respectively) and achieved 100% term usage.
Compared to the existing methods, where the constraints are imposed after hypothesis generation, the proposed method starts the decoding process with the terminology constraints itself. The proposed approach gives control over constraint terms in target translations while being able to decode as fast as a baseline Levenshtein Transformer model, which achieves significantly higher decoding speed than traditional beam search. As the implementation is available in FAIRSEQ, it is convenient to use this in projects where we need to force the client’s terminology.