How to Read Data Files on S3 from Amazon SageMaker | by Mikhail Klassen | Nov, 2020


Keeping your data science workflow in the cloud

Photo by Sayan Nath on Unsplash

Amazon SageMaker is a powerful, cloud-hosted Jupyter Notebook service offered by Amazon Web Services (AWS). It’s used to create, train, and deploy machine learning models, but it’s also great for doing exploratory data analysis and prototyping.

While it may not be quite as beginner-friendly as some alternatives, such as Google CoLab or Kaggle Kernels, there are some good reasons why you may want to be doing data science work within Amazon SageMaker.

Let’s discuss a few.

Machine learning models must be trained on data. If you’re working with private data, then special care must be taken when accessing this data for model training. Downloading the entire data set to your laptop may be against your company’s policy or may be simply imprudent. Imagine having your laptop lost or stolen, knowing that it contains sensitive data. As a side note, this another reason why you should use always disk encryption.

The data being hosted in the cloud may also be too large to fit on your personal computer’s disk, so it’s easier just to keep it hosted in the cloud and accessed directly.

Working in the cloud means you can access powerful compute instances. AWS or your preferred cloud services provider will usually allow you select and configure your compute instances. Perhaps you need high CPU or high memory — more than what you have available on your personal machine. Or perhaps you need to train your models on GPUs. Cloud providers have a host of different instance types on offer.

How to deploy ML models directly from SageMaker is a topic for another article, but AWS gives you this option. You won’t need to build a complex deployment architecture. SageMaker will spin off a managed compute instance hosting a Dockerized version of your trained ML model behind an API for performing inference tasks.

Photo by Courtney Moore on Unsplash

Now let’s move on to the main topic of this article. I will show you how to load data saved as files in an S3 bucket using Python. The example data are pickled Python dictionaries that I’d like to load into my SageMaker notebook.

The process for loading other data types (such as CSV or JSON) would be similar, but may require additional libraries.

You will need to know the name of the S3 bucket. Files are indicated in S3 buckets as “keys”, but semantically I find it easier just to think in terms of files and folders.

Let’s define the location of our files:

bucket = 'my-bucket'
subfolder = ''

SageMaker and S3 are separate services offered by AWS, and for one service to perform actions on another service requires that the appropriate permissions are set. Thankfully, it’s expected that SageMaker users will be reading files from S3, so the standard permissions are fine.

Still, you’ll need to import the necessary execution role, which isn’t hard.

from sagemaker import get_execution_role
role = get_execution_role()

The boto3 Python library is designed to help users perform actions on AWS programmatically. It will facilitate the connection between the SageMaker notebook at the S3 bucket.

The code below lists all of the files contained within a specific subfolder on an S3 bucket. This is useful for checking what files exist.

You may adapt this code to create a list object in Python if you will be iterating over many files.

The pickle library in Python is useful for saving Python data structures to a file so that you can load them later.

In the example below, I want to load a Python dictionary and assign it to the data variable.

This requires using boto3 to get the specific file object (the pickle) on S3 that I want to load. Notice how in the example the boto3 client returns a response that contains a data stream. We must read the data stream with the pickle library into the data object.

This behavior is a bit different compared to how you would use pickle to load a local file.

Since this is something I always forget how to do right, I’ve compiled the steps into this tutorial so that others might benefit.

There are times you may want to download a file from S3 programmatically. Perhaps you want to download files to your local machine or to storage attached to your SageMaker instance.

To do this, the code is a bit different:

I have focussed on Amazon SageMaker in this article, but if you have the boto3 SDK set up correctly on your local machine, you can also read or download files from S3 there. Since much of my own data science work is done via SageMaker, where you need to remember to set the correct access permissions, I wanted to provide a resource for others (and my future self).

Obviously SageMaker is not the only game in town. There are a variety of different cloud-hosted data science notebook environments on offer today, a huge leap forward from five years ago (2015) when I was completing my Ph.D.

One consideration that I did not mention is cost: SageMaker is not free, but is billed by usage. Remember to shut down your notebook instances when you’re finished.


Source link

Write a comment