Sequence Classification using Pytorch Lightning with BERT on IMDB data
This subject isn’t new. There are umpteen articles on Sequence classification using Bert Models. Transformers at huggingface.co has a bunch of pre-trained Bert models specifically for Sequence classification (like BertForSequenceClassification, DistilBertForSequenceClassification) that has the proper head at the bottom of the Bert Layer to do sequence classification for any multi-class use case. They also have a Trainer class that is optimized to training your own dataset on their Transformer models — it can be used to finetune a Bert model in just a few lines of code like shown in the notebook-https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM. The problem with all these approaches is that they would work very well within the defined area of the pre-defined Classes but can’t be used to experiment with changes to the model architecture or changes in the model parameters midway during an epoch or do any other advanced tuning techniques.
The purpose of this article is to show a generalized way of training deep learning models without getting muddled up writing the training and eval code in Pytorch through loops and if then statements. Pytorch lightning provides an easy and standardized approach to think and write code based on what happens during a training/eval batch, at batch end, at epoch end etc. Pytorch Lightning website also has many example code showcasing its abilities as well (https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pl_examples). Most of the example codes use datasets that is already pre-prepared in a way thru pytorch or tensorflow datasets. They don’t show the entire step of preparing the dataset from raw data, building a DL model architecture using pre-trained and user-defined forward classes, using different logger softwares, using different learning rate schedulers, how to use multi-gpus etc. This is what the article tries to accomplish by showing all the various important steps to getting a deep learning model working. The IMDB data used for training is almost a trivial dataset now but still a very good sample data to use in sentence classification problems like the Digits or CIFAR-10 for computer vision problems.
The relevant sections of the code are quoted here to draw attention to what they do. An average accuracy of 0.9238 was achieved on the Test IMDB dataset after 1 epoch of Training — a respectable accuracy after one epoch. The entire code can be seen here -https://github.com/kswamy15/pytorch-lightning-imdb-bert/blob/master/Bert_NLP_Pytorch_IMDB_v3.ipynb
Preparing the Data Section:
Once the Individual text files from the IMDB data are put into one large file, then it is easy to load it into a pandas dataframe, apply pre-processing and tokenizing the data that is ready for the DL model.
Defining the Dataset Class:
# custom dataset uses Bert Tokenizer to create the Pytorch Dataset
class ImdbDataset(Dataset):
def __init__(self, notes, targets, tokenizer, max_len):
self.notes = notes
self.targets = targets
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return (len(self.notes))
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
note = str(self.notes[idx])
target = self.targets[idx]
encoding = self.tokenizer.encode_plus(
note,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=True,
truncation=True,
padding='max_length',
return_attention_mask=True,
return_tensors='pt',
)
return {
#'text': note,
'label': torch.tensor(target, dtype=torch.long),
'input_ids': (encoding['input_ids']).flatten(),
'attention_mask': (encoding['attention_mask']).flatten(),
'token_type_ids': (encoding['token_type_ids']).flatten()
}
The Bert Transformer models expect inputs in these formats like input_ids, attention_mask etc. token_type_ids are more used in question-answer type Bert models. The transformer website has many different Tokenizers available to tokenize the text. The beauty of using Bert like models is that you don’t necessarily have to clean up the sentences for stop words or stemmatize/lemmatize words in the sentences. The tokenizer would have seen most of the raw words in the sentences before when the Bert model was trained on a large corpus. The tokenizer can also break up words into sub-words to make meaningful tokenization if it doesn’t recognize a word.
Pytorch Lightning Module: only part of it shown here for brevity
This is no different from constructing a Pytorch training module but what makes Pytorch Lightning good is that it will take a care a lot of the inner workings of a training/eval loop once the init and forward functions are defined. No special code needs to be written to train the model on a GPU — just specify the GPU parameter while calling the Pytorch Lightning Train method — it will take care of loading the data and model on cuda.
Training Step function:
def training_step(self, batch, batch_nb):
# batch
input_ids = batch['input_ids']
label = batch['label']
attention_mask = batch['attention_mask']
#token_type_ids = batch['token_type_ids']
# fwd
y_hat = self(input_ids, attention_mask, label)
# loss
loss_fct = torch.nn.CrossEntropyLoss()
loss = loss_fct(y_hat.view(-1, self.num_labels), label.view(-1))
#loss = F.cross_entropy(y_hat, label)
# logs
tensorboard_logs = {'train_loss': loss, 'learn_rate': self.optim.param_groups[0]['lr'] }
return {'loss': loss, 'log': tensorboard_logs}
The training step is constructed by defining a training_step function. The loss is returned from this function and any other logging values. Similar functions are defined for validation_step and test_step.
Changing Learning rate after every batch:
def on_batch_end(self):
# This is needed to use the One Cycle learning rate that needs the learning rate to change after every batch
# Without this, the learning rate will only change after every epoch
if self.sched is not None:
self.sched.step()
The Learning rate can be changed after every batch by specifying a scheduler.step() function in the on_batch_end function. This is actually key in training the IMDB data — the level of accuracy reached after one epoch can’t be reached by using a constant learning rate throughout the epoch.
Defining a CLI function:
def run_cli():
# ------------------------
# TRAINING ARGUMENTS
# ------------------------
# these are project-wide arguments
#root_dir = os.path.dirname(os.path.realpath(__file__))
root_dir = os. getcwd()
parent_parser = ArgumentParser(add_help=False)
parent_parser = pl.Trainer.add_argparse_args(parent_parser)
# each LightningModule defines arguments relevant to it
parser = ImdbModel.add_model_specific_args(parent_parser,root_dir)
parser.set_defaults(
profiler=True,
deterministic=True,
max_epochs=1,
gpus=1,
distributed_backend=None,
fast_dev_run=False,
model_load=False,
model_name='best_model',
)
#args = parser.parse_args()
args, extra = parser.parse_known_args()
# ---------------------
# RUN TRAINING
# ---------------------
main(args)
The run_cli() function is being declared here to enable running this jupyter notebook as a python script. Pytorch lightning models can’t be run on multi-gpus within a Juptyer notebook. To run on multi gpus within a single machine, the distributed_backend needs to be = ‘ddp’. The ‘dp’ parameter won’t work even though their docs claim it.
As per their website — Unfortunately any ddp_ is not supported in jupyter notebooks. Please use dp for multiple GPUs. This is a known Jupyter issue. If you feel like taking a stab at adding this support, feel free to submit a PR!
The run_cli can be put within a __main__() function in the python script. If one wants to use a checkpointed model to run for more epochs, the checkpointed model can be specified in the model_name.