The Training data is downloaded to docker container when Sagemaker spins up the docker container and runs the train.py file. The train.py file calls the imdb.py file. Imdb.py calls the ImdbDataModule class that is contained in IMDBClassifier.py file. If you look at the IMDBClassifier.py file, you will see that the ImdbData class pulls the Imdb data from data_source_url = “https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

So the Training data source is embedded in the python Training scripts.

You can give Sagemaker the location of the data in the input configuration as well — shown below. That will requires some changes in the Training script code.

tree.fit(
{'train': 's3://sagemaker-us-east-1-558105141721/samples/datasets/imdb/train',
'test': 's3://sagemaker-us-east-1-558105141721/samples/datasets/imdb/test'}
)
Narayana Swamy
Narayana Swamy

Written by Narayana Swamy

Over 17 years of diverse global experience in Data Science, Finance and Operations. Passionate about using data to unlock Business value.

No responses yet