The Training data is downloaded to docker container when Sagemaker spins up the docker container and runs the train.py file. The train.py file calls the imdb.py file. Imdb.py calls the ImdbDataModule class that is contained in IMDBClassifier.py file. If you look at the IMDBClassifier.py file, you will see that the ImdbData class pulls the Imdb data from data_source_url = “https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
So the Training data source is embedded in the python Training scripts.
You can give Sagemaker the location of the data in the input configuration as well — shown below. That will requires some changes in the Training script code.
tree.fit(
{'train': 's3://sagemaker-us-east-1-558105141721/samples/datasets/imdb/train',
'test': 's3://sagemaker-us-east-1-558105141721/samples/datasets/imdb/test'}
)