Author: Raj Patel, Machine Translation Scientist @ Iconic
In general, Neural MT weights are randomly initialized and then trained using parallel corpus. In this post, we will discuss a Neural MT architecture proposed by Garg et al., (2020), which is based on echo state network (ESN), named echo state neural machine translation (ESNMT). In ESNMT, the encoder-decoder weights of the model are randomly generated and fixed throughout the training process.
Echo State Neural MT
ESN (Jaeger, 2010) is a special type of recurrent neural network (RNN), in which the recurrent matrix (known as “reservoir”) and input transformation are randomly generated, then fixed, and the only trainable component is the output layer (known as “readout”). A very basic version of ESN has the following formulation:
Training: The model parameters are trained using gradient descent as usual. Note that since the recurrent layer weights are fixed and their gradient is not calculated, the problem of exploding/vanishing gradient in RNN is elevated. Therefore, we expect no significant difference in quality between ESNMT and ESNMT-LSTM.
Model Compression: Since randomized components of ESNMT can be deterministically generated with a fixed random seed to store the model offline, we only need to store the seed with the remaining trainable parameters.
Experimental setup and Results
The authors evaluated the proposed ESNMT architecture on WMT’14 English→French, English→German and WMT’16 English→Romanian datasets. The baseline model is fully trainable. Table 1 compares BLEU scores for all language pairs given by different models. The results show that ESNMT can reach 70-80% of the BLEU scores yielded by fully trainable baselines across all settings. Moreover, using LSTM cells yields more or less the same performance as a simple RNN cell. This verifies our hypothesis that an LSTM cell is not particularly advantageous compared to a simple RNN cell in the ESN setting.
Given the results, we can see that the ESNMT achieves 70-80% accuracy of the fully trainable models. These surprising findings encourage us to rethink the nature of encoding and decoding in NMT, and design potentially more economic model architectures and training procedures.