NLP Deep Learning Training on Downstream tasks using Pytorch Lightning — Translation on English to Romanian WMT16 data — Part 7of 7

6 min readJul 23, 2021

This is part 6 and final part of the Series. Please go to the Intro article that talks about the motivation for this Series. We will look at the various sections of the Translation task on the Helsinki English to Russian public data in the Colab Notebook and make appropriate comments for each of the sections.

Translation task — A Trained Translation model will be able to translate text in the source language to text in the target language. This will require Sequence to Sequence DL Models instead of the Encoder Models we have dealt with in the past. The Seq2Seq Models are Transformer models that have both an Encoder Transformer and a Decoder Transformer. The Decoder decodes (or maps) the source text’s hidden representation to the target Translated text.
Download and Import the Libraries — A library called sacrebleu is installed —The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence. Other than that, the regular Pytorch, transformers and Pytorch Lightning libraries are installed and imported.
Download the Data — The WMT16 English-Romanian Translation dataset is pre-processed and available for download from the Transformers Datasets Library. This dataset is used instead of pulling the data from the public repository. There are 610, 320 Training samples, 1,999 Validation samples and 1,999 Test samples in the dataset. Each row of data looks like below.

{'translation': {'en': 'Membership of Parliament: see Minutes',   'ro': 'Componenţa Parlamentului: a se vedea procesul-verbal'}}

Define the Pre-Trained Model — The Pre-Trained Model used here is the MarianModel. Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is currently the engine behind the Microsoft Translator Neural Machine Translation services. Since Marian models are smaller than many other translation models available in the library, they can be useful for fine-tuning experiments and integration tests. There are lots of pre-trained models available in the Transformer model hub. All the pre-trained model names use the following format: Helsinki-NLP/opus-mt-{src}-{tgt} In this Notebook, we use the Helsinki-NLP/opus-mt-en-ro to fine-tune using the WMT16 English to Romanian dataset. Initially I had chosen the English to German dataset but that had 4.5 million training data rows and it is not possible to train using the free Colab account due to the huge size of the training data — the notebook will time out on Training.
Define the Pre-Process function or Dataset Class — Here we define the Pre-Process function that will create the train, val and test data in the Dataset format that is needed by the DataLoader. Pytorch uses a DataLoader class to build the data into mini-batches. The data is tokenized in this function using the pre-trained tokenizer. The source language of ‘en’ text and the target language of ‘ro’ text are separately tokenized. The ‘ro’ text is tokenized using the target_tokenizer method of the tokenizer class.

Define the DataModule Class — This is a Pytorch Lightning defined Class that contains all the code necessary to prepare the mini-batches of the data using the DataLoaders. At the start of the training, the Trainer class will call the prepare_data and setup functions first. There is a collate function here that does the padding of the mini-batches. The collate function helps in padding the input data of the mini-batch to just the longest length of data within that mini-batch. This provides for faster training and less memory usage. The ‘labels’ key needs to be padded manually as the tokenizer.pad method will not pad the target column. A ‘decoder_input_ids’ column is generated using the model’s ‘prepare_decoder_input_ids_from_labels’ method if the model has such a method. The Decoder expects a ‘decoder_input_ids’ — similar to the tokenized ‘input_ids’ column for the encoder. If that model doesn’t have this method, the ‘decoder_input_ids’ is generated from right shifting the labels column. The ‘decoder input ids’ is used for teacher forcing in the training of the decoder — giving the decoder information about the past tokens of the summary text to help with faster training of the decoder.

Define the Model Class — the forward function of the DL Model is defined here. The ‘input_ids’ , ‘attention_mask’ and ‘decoder_input_ids’ are given as inputs to the model. The ‘decoder_input_ids’ is the same size as the Target Labels during Training — it is generated by right shifting the target labels. During Inference, it is generated by right shifting the tokens that have already been predicted. Remember Inference is done token by token by the Model in a While loop (in greedy search or beam search methods). During inference, the encoder_outputs are generated by sending the Input tokens through the encoder method of the model. Since Inference is done token by token, the pre-generated encoder_outputs saves compute time during Inference. A linear layer is applied to the decoded_outputs to generate a logits with a shape of decoder_input_ids X vocab size of 32,128 (vocab size of the pre-trained model). The forward function is taken from the MarianMTModel of HuggingFace which has the Translation logits head on top of the base MarianModel. It is not clear to me what the final_logits_bias variable does — it is initialized to zeros and I couldn’t find where it gets changed during Training.

Define the Pytorch Lightning Module Class — This is where the training, validation and test step functions are defined. The model loss and accuracy are calculated in the step functions. The optimizers and schedulers are defined here as well. The loss is calculated using a regular CrossEntropyLoss for loss calculation. The validation and test metrics are calculated using the BLEU metric package in the compute_metrics functions.
Define the Trainer Parameters — All the required Trainer parameters and Trainer callbacks are defined here. We have defined 3 different callbacks — EarlyStopping, LearningRate Monitor and Checkpointing. Instead of using the argparse to define the parameters, the latest Pytorch Lightning update allows the definition of the parameters using a .yaml file — this .yaml file can be provided as an argument to a python .py file in a CLI run. This way the Trainer parameters can be maintained separate from the Training code. Since we are using a Colab Notebook for the demo purposes, we stick with the argparse way.
Train the Model — This is done using the Trainer.fit() method. A profiler can be defined in the Trainer parameters to give more information on the Training run timings. It was taking 2 hrs 50 mins to Train the model on the entire dataset and sometimes Colab times out the GPU session in an unpredictable way. So the model was trained on 50% of the training dataset to finish the Training in 1 hr 24 mins.
Evaluate Model Performance — After 1 Epoch of Training, we get Test metrics as below. The Train loss was lower at 1.16 and so the model has started over fitting.

{'test_bleu': 11.841598510742188,  
'test_genlen': 67.36799621582031,  
'test_loss': 2.1486706733703613}

These are not strong scores and will need training for additional Epochs. The Train loss was lower at 1.16 and so the model has started over fitting. The example notebook of Transformers using their Trainer shows a Training loss of 0.74 after 1 Epoch with a running time of 1 hr 20 mins — this should be possible using 100% of the Training dataset. The running time of Transformers notebook is lower than this notebook using Pytorch Lightning. Is the lightning framework introducing delays in Training step? Not sure about that and have to research it.
The SOTA score of 35.3 Bleu was achieved on the Test data using a pre-trained Cross Lingual Language model without any additional training data — Not clear what type of Transformer model they used. Larger models like T5-base could be tried outside of the Colab free version to get close to the SOTA scores.
Run Inference on the Trained Model — A Sample Inference code is shown here. Inference using a Seq2Seq model is more involved compared to inference using an Encoder model. The inference output is generated token by token and limited by the maximum length of text parameter. There are a few techniques to generate the output text like Greedy_search, Beam_search, Sampling, top-K Sampling. The code for Greedy_search and Beam_search is shown in the notebook — this took quite a number of hours to understand and unravel the code from the transformers generation utils.

The Inference is not bad even at this level of loss. The Beam search almost got it right. The Input was the English sentence and the output was the Romanian translated sentence.
TensorBoard Logs Data — This will open TensorBoard within the Colab notebook and let you look at the various TensorBoard logs. Pytorch Lightning logs default to TensorBoard and this can be changed using a Logger callback.

NLP Deep Learning Training on Downstream tasks using Pytorch Lightning — Translation on English to Romanian WMT16 data — Part 7of 7

Written by Narayana Swamy