NLP Deep Learning Training using SageMaker and Pytorch Lightning — IMDB Classification

Narayana Swamy
5 min readFeb 7, 2022

--

I have written two articles in the past on NLP training using the Pytorch Lightning framework and Huggingface published BERT models on the IMDB movies review dataset. Both those articles used the GPU on Google Colab to train the model to predict the movie review classification — either as positive or negative. It is a simple NLP classification exercise. In the current article, I discuss how to train the model using a Sagemaker GPU instance and share my experiences in getting the Pytorch Lightning model to work on Sagemaker. The AWS Sagemaker opens us to the possibility of using multiple GPUs to train more complex models compared to what is offered on Google Colab. The code repo can be found here.

The current work was built using the Sagemaker example found here — Bring your own Container. This AWS link also gives an introduction to using custom container on Sagemaker to train a scikit-learn decision tree model. I will talk about the basic steps to get the BERT model trained using a GPU instance on Sagemaker and a Pytorch Lightning docker container.

  1. Docker file — A Docker file with the latest Pytorch GPU based image needs to be created (in my repo, it is nvcr.io/nvidia/pytorch:21.07-py3). We can’t just use the pytorchlightning docker image as it does not have the CUDA based Pytorch image. A GPU Sagemaker instance requires a Docker image that has CUDA drivers built in. This brings up the problem that the docker image with CUDA can not be built in a Windows Docker environment. A Linux or Mac environment is needed to build this image locally. Because of that, I had to build the image locally in my Macbook instead of my Windows PC. It took a few days to figure this out as you will get strange errors when trying to build the docker image on a Windows PC.
  2. Docker build — The Docker file essentially builds the Pytorch image, installs pytorch lightning, copies the imdb folder (that contains all the Pytorch Lightning training code) and makes the imdb folder the working directory. This docker image can be built using the command
### Build the docker image
docker build -t pylightning-sagemaker-imdb

3. ECR upload — There is a build_and_push.sh shell script that essentially takes the local Docker image (specified as an arg to the script) and pushes it to the AWS ECR repository. You will need to install the AWS CLI version 2 and set up the user id and pass key for this to work. It is easy to install the CLI using a GUI installer. Refer here to configure the CLI. The pytorch Docker image is about 15 GB and will take 1 hour to upload to the AWS ECR — on a 10 Mbps upload speed connection. A docker upload to ECR must be done every time code changes are made to the python files but this will be very quick upload as only a few docker layers need to be updated.

4. Training files — There are three important python files in this example — train, imdb.py and IMDB_classifier.py. The train file is the starting place for the training. It calls the imdb.py file with hyperparameters and default parameters as arguments using a subprocess call. By default, the Sagemaker session calls the train file after docker image is downloaded to the instance. This can be overridden using some ENTRYPOINT information in the docker image. The train file should be an executable file and not have .py extension.

## Called by Sagemaker session by default
docker run image train

5. imdb and IMDBClassifier — The imdb.py file sets up the Pytorch Lightning Trainer with all the necessary arguments, checkpoints, earlystopping callbacks. It also invokes the trainer.fit() and trainer.test() function calls. The IMDBClassifier.py has all the Pytorch Lightning code to download the data, the pre-trained Bert Model, adding the Classification head and setting up the necessary Training, Validation and Testing steps. Refer to my earlier articles for more details on the Pytorch Lightning Training code.

6. Sagemaker Execution Role — A Sagemaker execution Role needs to be created following this AWS documentation. This allows the Sagemaker session to create a Sagemaker instance and access to S3 buckets for inputs and outputs.

7. Training — The Training on Sagemaker can be done using either the Training Notebook or running the main.py python file in the imdb folder of the repo. The Sagemaker role needs to be changed in those files to appropriate ones.

One of the things I noticed while training on Sagemaker is that it runs 2 epochs even if I specified epochs argument as 1. When I made the epochs argument as 0, it still ran 2 epochs (epochs 0 and 1). Not sure if this is a bug within Sagemaker. Also there was one time the Sagemaker instance crashed with an internal server error (1 out of 8 tries) — instances can crash!

8. Local Testing — The Model training script can be tested locally by running the train_local.sh script in the local_test folder. This will use the local docker image and run the train file. This should always be done before using the Sagemaker instance — to make sure there aren’t any syntax or logical errors in the code.

9. Sagemaker Logs — Sagemaker saves all the output from the Training into a Cloud Watch log. I have found this to be the best way to quickly see how the Training went — in terms of training loss, test loss, test accuracy etc. Go to the Training jobs on Sagemaker on the AWS console and choose the particular Training job and click on view Logs. This will take you to the particular Cloud Watch log. The log text can be easily searched for say test accuracy.

10. Model artifacts — Sagemaker Training job also has a link to Model artifacts. This saves the Model and the lightning logs in a s3 bucket. There will be a model.tar.gz and output.tar.gz. The output.tar.gz will contain the best checkpoint model and the various model metrics tracked within the Pytorch lightning code.

11. Reuse— The same code can be reused for training other deep learning use cases by replacing the IMDBClassifier.py with another custom pytorch lightning code.

12. Serving — The repo doesn’t have any serving or prediction script. The serving and prediction can be done using a serverless framework that can create a model serving endpoint. I will post another article soon on serving the model on AWS using Sagemaker GPU endpoint.

Do post comments if you run into any errors using the repo code. I initially used a MNIST training with a three layer neural network to get familiar with Sagemaker training and sort out the errors that came up. MNIST training will run very fast and can even be done without using the GPU.

--

--

Narayana Swamy

Over 17 years of diverse global experience in Data Science, Finance and Operations. Passionate about using data to unlock Business value.