NLP Deep Learning Training on Downstream tasks using Pytorch Lightning — Summarization on XSum data — Part 6 of 7

Narayana Swamy
6 min readJul 23, 2021

This is part 5 and a continuation of the Series. Please go to the Intro article that talks about the motivation for this Series. We will look at the various sections of the Abstractive Summarization on the XSum public data in the Colab Notebook and make appropriate comments for each of the sections.

  • There are two types of Summarization — Extractive and Abstractive. Extractive Summarization will summarize the Source text by finding key sentences within the text and could be done similar to Question and Answer training. Abstractive Summarization will generate a summary where the exact sentences won’t be found within the Source text. This is a harder problem and will require Sequence to Sequence DL Models instead of the Encoder Models we have dealt with in the past. The SeqtoSeq Models are Transformer models that have both an Encoder Transformer and a Decoder Transformer. The Decoder decodes (or maps) the source text’s hidden representation to the target text. The SeqtoSeq models can be used for Translation use case and we will see that in Part 6.
  • Download and Import the Libraries — A library called rouge_score is installed — it is used to compute longest common subsequence (LCS) between two pieces of text and is a metric used for evaluation of Summarization predictions. Other than that, the regular Pytorch, transformers and Pytorch Lightning libraries are installed and imported.
  • Download the Data — The XSUM dataset is pre-processed and available for download from the Transformers Datasets Library. This dataset is used instead of pulling the data from the public repository. There are 204,045 Training samples, 11,332 Validation samples and 11,334 Test samples in the dataset. Each row of data looks like below.
{'document': 'Army explosives experts were called out to deal with a suspect package at the offices on the Newtownards Road on Friday night.\nRoads were sealed off and traffic diverted as a controlled explosion was carried out.\nThe premises, used by East Belfast MP Naomi Long, have been targeted a number of times.\nMost recently, petrol bomb attacks were carried out on the offices on consecutive nights in April and May.\nThe attacks began following a Belfast City Council vote in December 2012 restricting the flying of the union flag at the City Hall.\nCondemning the latest hoax, Alliance MLA Chris Lyttle said: "It is a serious incident for the local area, it causes serious disruption, it puts people\'s lives at risk, it can prevent emergency services reaching the area.\n"Ultimately we need people with information to share that with the police in order for them to do their job and bring these people to justice."',  
'id': '28381580',
'summary': 'A suspicious package left outside an Alliance Party office in east Belfast has been declared a hoax.'}
  • Define the Pre-Trained Model — The Pre-Trained Model used here is the T5-small Model. T5 is Google’s state of the art Seq2Seq Model( T5 Text-to-Text Transfer Transformer). Once you have trained successfully with this, other pre-Trained models can be tried by changing the model_checkpoint variable. Some of the other Seq2Seq models are: MBART (Translation), MarianMT (Translation). The T5 model can perform 8 different categories of tasks (like summarization, translation, mnli, stsb, cola etc.) and need the input properly prefixed for identification of the task at hand. For the Summarization task, we specify the prefix of “summarize:” before each input text.
  • Define the Pre-Process function or Dataset Class — Here we define the Pre-Process function that will create the train, val and test data in the Dataset format that is needed by the DataLoader. Pytorch uses a DataLoader class to build the data into mini-batches. The data is tokenized in this function using the pre-trained tokenizer. The document text and the output summary text are separately tokenized. The summary text is tokenized using the target_tokenizer method of the tokenizer class.
  • Define the DataModule Class — This is a Pytorch Lightning defined Class that contains all the code necessary to prepare the mini-batches of the data using the DataLoaders. At the start of the training, the Trainer class will call the prepare_data and setup functions first. There is a collate function here that does the padding of the mini-batches. The collate function helps in padding the input data of the mini-batch to just the longest length of data within that mini-batch. This provides for faster training and less memory usage. The ‘labels’ key needs to be padded manually as the tokenizer.pad method will not pad the target column. A ‘decoder_input_ids’ column is generated using the model’s ‘prepare_decoder_input_ids_from_labels’ method. The Decoder expects a ‘decoder_input_ids’ — similar to the tokenized ‘input_ids’ column for the encoder. If that model doesn’t have this method, the ‘decoder_input_ids’ is generated from right shifting the labels column. The ‘decoder input ids’ is used for teacher forcing in the training of the decoder — giving the decoder information about the past tokens of the summary text to help with faster training of the decoder.
  • Define the Model Class — the forward function of the DL Model is defined here. The ‘input_ids’ , ‘attention_mask’ and ‘decoder_input_ids’ are given as inputs to the model. The ‘decoder_input_ids’ is the same size as the Labels. During inference, the encoder_outputs are also given. A linear layer is applied to the decoded_outputs to generate a logits with a shape of decoder_input_ids X vocab size of 32,128 (vocab size of the pre-trained model).
  • Define the Pytorch Lightning Module Class — This is where the training, validation and test step functions are defined. The model loss and accuracy are calculated in the step functions. The optimizers and schedulers are defined here as well. The loss is calculated using a regular CrossEntropyLoss for loss calculation. The validation and test metrics are calculated using the Rouge metric package in the compute_metrics functions. The Rouge metric has a few different metrics like Rouge1, Rouge2, RougeL, RougeLsum, gen_len.
  • Define the Trainer Parameters — All the required Trainer parameters and Trainer callbacks are defined here. We have defined 3 different callbacks — EarlyStopping, LearningRate Monitor and Checkpointing. Instead of using the argparse to define the parameters, the latest Pytorch Lightning update allows the definition of the parameters using a .yaml file — this .yaml file can be provided as an argument to a python .py file in a CLI run. This way the Trainer parameters can be maintained separate from the Training code. Since we are using a Colab Notebook for the demo purposes, we stick with the argparse way.
  • Train the Model — This is done using the Trainer.fit() method. A profiler can be defined in the Trainer parameters to give more information on the Training run timings.
  • Evaluate Model Performance — After 1 Epoch of Training, we get Test metrics as below:
TEST RESULTS 
{'gen_len': 47.58674240112305,
'rouge1': 21.525144577026367,
'rouge2': 4.095888137817383,
'rougeL': 19.751005172729492,
'rougeLsum': 19.937564849853516,
'test_loss': 4.333302974700928}
  • These are not strong scores and will need training for additional Epochs. A single Epoch took 2 hrs 8 mins to run. The test loss is close to the training loss. The example notebook of Transformers using their Trainer shows a Training loss of 2.72 after 1 Epoch with a running time of 1 hr 22 mins. Both the loss and the running time is lower than this notebook using Pytorch Lightning. Is the lightning framework introducing delays in Training step? Not sure about why this notebook is not able to achieve same loss score as the Transformers notebook. The higher training loss means that the model is not able to generate good summary text in Inference. There will be a big difference in the Inference summary between a training loss of 4.33 and 2.72.
  • The SOTA score of 45.1 Rouge1 was achieved on the Test data using a BART model — this is a much larger model compared to the T5-small model. It is not possible to fine tune a BART model on the Colab free version as it will run out of memory. But it should be possible to reproduce the SOTA result by using a larger GPU machine.
  • Run Inference on the Trained Model — A Sample Inference code is shown here. Inference using a Seq2Seq model is more involved compared to inference using an Encoder model. The inference output is generated token by token and limited by the maximum length of text parameter. There are a few techniques to generate the output text like Greedy_search, Beam_search, Sampling, top-K Sampling. The code for Greedy_search and Beam_search is shown in the notebook — this took quite a number of hours to understand and unravel the code from the transformers generation utils.
  • The generated Summary is not good and the model will need additional training to improve its Summarization capabilities.
  • TensorBoard Logs Data — This will open TensorBoard within the Colab notebook and let you look at the various TensorBoard logs. Pytorch Lightning logs default to TensorBoard and this can be changed using a Logger callback.

Next we will take a look at the Translation task training in Part 7 of this Series.

--

--

Narayana Swamy

Over 17 years of diverse global experience in Data Science, Finance and Operations. Passionate about using data to unlock Business value.