AWS launches major new features for Amazon SageMaker to simplify development of machine learning models

Machine learning continues to grow on AWS and they are putting serious effort on paving the way for customers’ machine learning development journey on AWS cloud. The Andy Jassy keynote in AWS Re:Invent was a fiesta for data scientists with the newly launched Amazon SageMaker features, including Experiments, Debugger, Model Monitor, AutoPilot and Studio.

AWS really aims to make the whole development life cycle of machine learning models as simple as possible for data scientists. With the newly launched features, they are addressing common, effort demanding problems: monitoring your data validity from your model’s perspective and monitoring your model performance (Model Monitor), experimenting multiple machine learning approaches in parallel for your problem (Experiments), enable cost efficiency of heavy model training with automatic rules (Debugger) and following these processes in a visual interface (Studio). These processes can even be orchestrated for you with AutoPilot, that unlike many services is not a black box machine learning solution but provides all the generated code your you. Announced features also included a SSO integrated login to SageMaker Studio and SageMaker Notebooks, a possibility to share notebooks with one click to other data scientists including the needed runtime dependencies and libraries (preview).

Compare and try out different models with SageMaker Experiments

Building a model is an iterative process of trials with different hyperparameters and how they affect the performance of the model. SageMaker Experiments aim to simplify this process. With Experiments, one can create trial runs with different parameters and compare those. It provides information about the hyperparameters and performance for each trial run, regardless of whether a data scientist has run training multiple times, has used automated hyperparameter tuning or has used AutoPilot. It is especially helpful in the case of automating some steps or the whole process, because the amount of training jobs run is typically much higher than with traditional approach.

Experiments makes it easy to compare trials, see what kind of hyperparameters was used and monitor the performance of the models, without having to set up the versioning manually. It makes it easy to choose and deploy the best model to production, but you can also always come back and look at the artifacts of your model when facing problems in production. It also provides more transparency for example to automated hyperparameter tuning and also for new SageMaker AutoPilot. Additionally, SageMaker Experiments has Experiments SDK so it is possible call the API with Python to get the best model programmatically and deploy endpoint for it.

Track issues in model training with SageMaker Debugger

During the training process of your model, many issues may occur that might prevent your model from correctly finishing or learning patterns. You might have, for example, initialized parameters inappropriately or used un efficient hyperparameters during the training process. SageMaker Debugger aims to help tracking issues related to your model training (unlike the name indicates, SageMaker Debugger does not debug your code semantics).

When you enable debugger in your training job, it starts to save the internal model state into S3 bucket during the training process. A state consists of for example weights for neural network, accuracies and losses, output of each layer and optimization parameters. These debug outputs will be analyzed against a collection of rules while the training job is executed. When you enable Debugger while running your training job in SageMaker, will start two jobs: a training job, and a debug processing job (powered by Amazon SageMaker Processing Jobs), which will run in parallel and analyze state data to check if the rule conditions are met. If you have, for example, an expensive and time consuming training job, you can set up a debugger rule and configure a CloudWatch alarm to it that kills the job once your rules trigger, e.g. loss has converged enough.

For now, the debugging framework of saving internal model states supports only TensorFlow, Keras, Apache MXNet, PyTorch and XGBoost. You can also configure your own rules that analyse model states during the training, or some preconfigured ones such as loss not changing or exploding/vanishing gradients. You can use the debug feature visually through the SageMaker Studio or alternatively through SDK and configure everything to it yourself.

Keep your model up-to-date with SageMaker Model Monitor

Drifts in data might have big impact on your model performance in production, if your training data and validation data start to have different statistical properties. Detecting those drifts requires efforts, like setting up jobs that calculate statistical properties of your data and also updating those, so that your model does not get outdated. SageMaker Model Monitor aims to solve this problem by tracking the statistics of incoming data and aims to ensure that machine learning models age well.

The Model Monitor forms a baseline from the training data used for creating a model. Baseline information includes statistics of the data and basic information like name and datatype of features in data. Baseline is formed automatically, but automatically generated baseline can be changed if needed. Model Monitor then continuously collects input data from deployed endpoint and puts it into a S3 bucket. Data scientists can then create own rules or use ready-made validations for the data and schedule validation jobs. They can also configure alarms if there are deviations from the baseline. These alarms and validations can indicate that the model deployed is actually outdated and should be re-trained.

SageMaker Model Monitor makes monitoring the model quality very easy but at the same time data scientists have the control and they can customize the rules, scheduling and alarms. The monitoring is attached to an endpoint deployed with SageMaker, so if inference is implemented in some other way, Model Monitor cannot be used. SageMaker endpoints are always on, so they can be expensive solution for cases when predictions are not needed continuously.

Start from scratch with SageMaker AutoPilot

SageMaker AutoPilot is an autoML solution for SageMaker. SageMaker has had automatic hyperparameter tuning already earlier, but in addition to that, AutoPilot takes care of preprocessing data and selecting appropriate algorithm for the problem. This saves a lot of time of preprocessing the data and enables building models even if you’re not sure which algorithm to use. AutoPilot supports linear learner, factorization machines, KNN and XGBboost at first, but other algorithms will be added later.

Running an AutoPilot job is as easy as just specifying a csv-file and response variable present in the file. AWS considers that models trained by SageMaker AutoPilot are white box models instead of black box, because it provides generated source code for training the model and with Experiments it is easy to view the trials AutoPilot has run.

SageMaker AutoPilot automates machine learning model development completely. It is yet to be seen if it improves the models, but it is a good sign that it provides information about the process. Unfortunately, the description of the process can only be viewed in SageMaker Studio (only available in Ohio at the moment). Supported algorithms are currently quite limited as well, so the AutoPilot might not provide the best performance possible for some problems. In practice running AutoPilot jobs takes a long time, so the costs of using AutoPilot might be quite high. That time is of course saved from data scientist’s working time. One possibility is, for example, when approaching a completely new data set and problem, one might start by launching AutoPilot and get a few models and all the codes as template. That could serve as a kick start to iterating your problem by starting from tuning the generated code and continuing development from there, saving time from initial setup.

SageMaker Studio – IDE for data science development

The launched SageMaker Studio (available in Ohio) is a fully integrated development environment (IDE) for ML, built on top of Jupyter lab. It pulls together the ML workflow steps in a visual interface, with it’s goal being to simplify the iterative nature of ML development. In Studio one can move between steps, compare results and adjust inputs and parameters.  It also simplifies the comparison of models and runs side by side in one visual interface.

Studio seems to nicely tie the newly launched features (Experiments, Debugger, Model Monitor and Autopilot) into a single web page user interface. While these new features are all usable through SDKs, using them through the visual interface will be more insightful for a data scientist.

Conclusion

These new features enable more organized development of machine learning models, moving from notebooks to controlled monitoring and deployment and transparent workflows. Of course several actions enabled by these features could be implemented elsewhere (e.g. training job debugging, or data quality control with some scheduled smoke tests), but it requires again more coding and setting up infrastructure. The whole public cloud journey of AWS has been aiming to simplify development and take load away by providing reusable components and libraries, and these launches go well with that agenda.

AWS Redshift breaks bond between compute and storage

AWS Redshift took a huge leap forwards with new releases. Decoupling the storage and compute are the first steps towards modern cloud data warehouse.

AWS Redshift is the world’s most popular data warehouse, but has faced some tough competition from the market. AWS Redshift has the compute and storage coupled, meaning that with the specific amount of instance you get set of storage that sometimes can be limiting. At the Andy Jassy keynote AWS released a new managed storage model for Redshift that allows you to scale the compute decoupled from the storage.

The storage model uses SSDs and S3 for the storage behind the scenes and is utilising architectural improvements of the infrastructure. This allows to users to keep the hot data in SSD and also query historical data stored in S3 seamlessly from Redshift. On top of this, you only pay for the SSD you use. It also comes with a new Nitro based compute instances. In Ireland RA3 instance has price of $15.578 per node/hour, but you get 48 vCPUs and 384 GB of memory and up to 64 TB of storage. You can cluster these up to 128 instances. AWS promises to give 3x the performance of any other cloud data warehouse service and Redshift Dense Storage (DS2) users are promised to get twice the performance and twice the storage at the same cost. RA3 is available now in Europe in EU (Frankfurt), EU (Ireland), and EU (London).

Related to the decoupling of the compute and storage, AWS released AWS AQUA. Advanced Query Accelerator promises 10 times better query performance. AQUA sits on top of S3 and is Redshift compatible. For this feature we have to wait for mid 2020 to get hands on. 

While AWS Redshift is the world’s most popular data warehouse, it is not practical to load all kind of data there. Sometimes data lakes are more suitable places for data, especially for unstructured data. Amazon S3 is the most popular choice for cloud data lakes. New Redshift features help to tie structured and unstructured data together to enable even better and more comprehensive insight.

With Federated Query feature (preview), it is now possible to query data in Amazon RDS PostgreSQL, and Amazon Aurora PostgreSQL from a Redshift cluster. The queried data can then be combined with data in the Redshift cluster, and Amazon S3. Federated queries allows data ingestion into Redshift, without any other ETL tool, by extracting data from above-mentioned data sources, transforming it on the fly, and loading data into Redshift. Data can also be uploaded from Redshift to S3 in Apache Parquet format using Data Lake Export feature. With this feature you are able to build some nice lifecycle features into your design. 

“One should use the best tool for the job”, reminded Andy Jassy at the keynote. With long awaited decoupling of storage and compute and big improvements to the core, Redshifts took a huge leap forward. It is extremely interesting to start designing new solutions with these features.

Http proxy through AWS EC2 Ubuntu for fake IP address.

Creating an http proxy server to cloud – A hobby project

Introducing an approach to reroute web traffic through a virtual machine in the cloud. This was a personal competence development project for which Solita allows their employees to spend some working hours.

Why I wanted to create a proxy server to cloud?

While I was on a business trip in Sweden I had some lazy time to watch a documentary from a video streaming service. Unfortunately the web service was available only in Finland, so I had to come up with other ways to spend my evening. The obvious choice was to try whether it would be theoretically possible to hack my location to watch such programs abroad.

To stay on the brighter side of the law, I decided to only validate approach in the conceptual level. I never tried the hack in the actual service, and you should neither.

And yes, there are tons of software products to make this easier. Rather, my goal was to learn new skills and sharpen my developer competence. The focus was more in the functionality rather than in the cyber security.

Disclaimer. Use this article on for legally valid business purposes such as rerouting traffic in your own web service. Always read the rules of the web services that you are using.

Choosing the cloud provider for the proxy server

I needed a virtual server located physically in Finland. Initially I planned to use Google Cloud virtual machines as Google has a data center in Finnish soil. Microsoft Azure and Amazon Web Services do not have data centers in Finland.

Well, the first attempt failed quickly, because Google Cloud did not assign the IP address of the proxy server to Finland. So I switched the focus on creating a general purpose http proxy server on AWS to fake the location for web services.

Next comes the instructions to replicate my approach. The examples are primarily for Windows users.

Creating a virtual server to AWS for http proxy

Create an AWS account if you don’t have one. Login to AWS console from the browser.

From services select EC2. Select the preferred region from the top right corner. I usually choose Ireland because it has one of the the most comprehensive service selections in Europe. By this choice the traffic would be rerouted through Ireland.

Click Launch instance.

Select Ubuntu 18.04 LTS as the image for the EC2 instance.

Click next until you are prompted to create an SSH key. Name the key as you wish, download it and launch the instance.

Go back to EC2 instance view and note the IP address. In my case it was 18.203.111.131. It is safe to publish the info here, as the virtual machine is already destroyed.

Connect to AWS EC2 instance and create a tunnel

You need to have PuTTY Key Generator and PuTTY installed. The key file was downloaded in pem format from AWS. Convert the pem file to ppk using PuTTY Key Generator. Load the pem file and click Save private key.

Normally you would never want to show the private key to anyone. The key and the EC2 instance for this tutorial have already been destroyed.

Go to PuTTY and give the username and IP address of the remote machine for PuTTY. For AWS EC2 Ubuntu instance the default user is ubuntu.

Go to ConnectionSSH > Auth and browse the ppk file that you just saved.

Create a tunnel that will route all traffic in your local machine port 8080 to port 3128 of the remote EC2 Ubuntu instance. 3128 is the default port for the squid proxy tool in Linux that we will install soon.

Click Open from the bottom of PuTTY. The terminal window appears.

Install squid in the virtual machine to make it an http proxy server

Install squid to the remote Ubuntu machine.

sudo apt update
sudo apt install squid

Find the line from the squid configuration file where the http access has been denied by default.

grep -n 'http_access deny all' /etc/squid/squid.conf

Open the squid configuration file for editing.

sudo nano /etc/squid/squid.conf

Simultaneously press CTRL and  to enter the line number to find the correct line.

Change the value to:

http_access allow localhost

Press CTRL and x to save. Choose y to confirm. Hit Enter to overwrite.

Finally restart the squid service.

sudo service squid restart

The proxy server is now successfully configured.

Configure Google Chrome browser for http proxy

Go to Google Chrome settings in  your laptop. Find the settings having keyword proxy.

Click Open proxy settings. The browser opens Windows settings for the Internet Properties. Click LAN settings.

Route all traffic through your local machine’s (127.0.0.1) port 8080. That port again was tunneled to AWS EC2 port 3128 where the squid proxy server is running on top of the Ubuntu operating system.

Click Ok to see the magic happening.

Checking if the http proxy server works in the browser

I went to a page which detects the client’s IP address and shows the geographical location. The web service thinks I’m in Ireland where the AWS data center is located.

Once I switch off the browser proxy from Chrome/Windows my IP address points to my actual location. At the moment of writing I was in Gothenburg, Sweden.

New call-to-action
Pyspark looks like regular python code, but the distributed nature of the execution requires the whole new way of thinking to optimize the code.

PySpark execution logic and code optimization

PySpark looks like regular python code. In reality the distributed nature of the execution requires the whole new way of thinking to optimize the PySpark code.

This article will focus on understanding PySpark execution logic and performance optimization. PySpark DataFrames are in an important role.

To try PySpark on practice, get your hands dirty with this tutorial: Spark and Python tutorial for data developers in AWS

DataFrames in pandas as a PySpark prerequisite

PySpark needs totally different kind of engineering compared to regular Python code.

If you are going to work with PySpark DataFrames it is likely that you are familiar with the pandas Python library and its DataFrame class.

Here comes the first source of potential confusion: despite their similar names, PySpark DataFrames and pandas DataFrames behave very differently. It is also easy to confuse them in your code. You might want to use suffix like _pDF for pandas DataFrames and _sDF for Spark DataFrames.

The pandas DataFrame object stores all the data represented by the data frame within the memory space of the Python interpreter. All of the data is easily and immediately accessible. The operations on the data are executed immediately when the code is executed, line by line. It is easy to print intermediate results to debug the code.

However, these advantages are offset by the fact that you are limited by the local computer’s memory and processing power constraints – you can only handle data which fits into the local memory. But since the operations are done in memory, with a basic data processing task you do not need to wait more than a few minutes at maximum.

PySpark DataFrames and their execution logic

The PySpark DataFrame object is an interface to Spark’s DataFrame API and a Spark DataFrame within a Spark application. The data in the DataFrame is very likely to be somewhere else than the computer running the Python interpreter – e.g. on a remote Spark cluster running in the cloud.

There are two distinct kinds of operations on Spark DataFrames: transformations and actions. Transformations describe operations on the data, e.g. filtering a column by value, joining two DataFrames by key columns, or sorting data. Actions are operations which take DataFrame(s) as input and output something else. Some examples from action would be showing the contents of a DataFrame or writing a DataFrame to a file system.

The key point to understand how Spark works is that transformations are lazy. Executing a Python command which describes a transformation of a PySpark DataFrame to another does not actually require calculations to take place. Ordering by a column and calculating aggregate values, returning another PySpark DataFrame would be such transformation. Rather, the operation is added to the graph describing what Spark should eventually do.

When an action is requested – e.g. return the contents of this Spark DataFrame as a Pandas DataFrame – Spark looks at the processing graph and then optimizes the tasks which needs to be done. This is the job of the Catalyst optimizer, and it enables Spark to optimize the operations to very high degree.

Also, the actual computation tasks run on the Spark cluster, meaning that you can have huge amounts of memory and processing cores available for the actual computation, even without resorting to the top-of-the-line virtual machines offered by cloud providers.

Consider caching to speed up PySpark

However, the highly optimized and parallelized execution comes at a cost: it is not as easy to see what is going on at each step. Looking at the data after some transformations means that you have to gather the data, or its subset, to a single computer. This is an action, so Spark has to determine the computation graph, optimize it, and execute it.

If your dataset is large, this may take quite some time. This is especially true if caching is not enabled and Spark has to start by reading the input data from a remote source – such as a database cluster or cloud object storage like S3.

You can alleviate this by caching the DataFrame at some suitable point. Caching causes the DataFrame partitions to be retained on the executors and not be removed from memory or disk unless there is a pressing need. In practice this means that the cached version of the DataFrame is available quickly for further calculations. However, playing around with the data is still not as easy or quick as with pandas DataFrames.

Use small scripts and multiple environments in PySpark

As a rule of thumb, one PySpark script should perform just one well defined task. This is due to the fact that any action triggers the transformation plan execution from the beginning. Managing and debugging becomes a pain if the code has lots of actions.

The normal flow is to read the data, transform the data and write the data. Often the write stage is the only place where you need to execute an action. Instead of debugging in the middle of the code, you can review the output of the whole PySpark job.

With large amounts of data this approach would be slow. You would have to wait a long time to see the results after each job.

My suggestion is to create environments that have different sizes of data. In the environment with little data you test the business logic and syntax. The test cycle is rapid as there’s no need process gigabytes of data. Running the PySpark script with the full dataset reveals the performance problems.

This goes well together with the traditional dev, test, prod environment split.

Favor DataFrame over RDD with structured data

RDD (Resilient Distributed Dataset) can be any set of items. For example, a shopping list.

["apple", "milk", "bread"]

RDD is the low-level data representation in Spark, and in earlier versions of Spark it was also the only way to access and manipulate data. However, the DataFrame API was introduced as an abstraction on top of the RDD API. As a rule of thumb, unless you are doing something very involved (and you really know what you are doing!), stick with the DataFrame API.

DataFrame is a tabular structure: a collection of Columns, each of which has a well defined data type. If you have a description and amount for each item in the shopping list, then a DataFrame would do better.

+-------+-----------+------+
|product|description|amount|
+-------+-----------+------+
|apple  |green      |5     |
|milk   |skimmed    |2     |
|bread  |rye        |1     |
+-------+-----------+------+

This is also a very intuitive representation for structured data, something that can be found from a database table. PySpark DataFrames have their own methods for data manipulation just like pandas DataFrames have.

Avoid User Defined Functions in PySpark

As a beginner I thought PySpark DataFrames would integrate seamlessly to Python. That’s why I chose to use UDFs (User Defined Functions) to transform the data.

A UDF is simply a Python function which has been registered to Spark using PySpark’s spark.udf.register method.

With the small sample dataset it was relatively easy to get started with UDF functions. When running the PySpark script with more data, spark popped an OutOfMemory error.

Investigating the issue revealed that the code could not be optimized when using UDFs.  To Spark’s Catalyst optimizer, the UDF is a black box. This means that Spark may have to read in all of the input data, even though the data actually used by the UDF comes from a small fragments in the input I.e. doing data filtering at the data read step near the data, i.e. predicate pushdown, cannot be used.

Additionally, there is a performance penalty: on the Spark executors, where the actual computations take place, data has to be converted (serialized) in the Spark JVM to a format Python can read, a Python interpreter spun up, the data deserialized in the Python interpreter, the UDF executed, and the result serialized and deserialized again to the Spark JVM. All of this takes significant amounts of time!

The recommendation is to stay in native PySpark dataframe functions whenever possible, since they are translated directly to native Scala functions running on Spark.

If you absolutely, positively need to do something with UDFs in PySpark, consider using the pandas vectorized UDFs introduced in Spark 2.3 – the UDFs are still a black box to the optimizer, but at least the performance penalty of moving data between JVM and Python interpreter is lot smaller.

Number of partitions and partition size in PySpark

In order to process data in a parallel fashion on multiple compute nodes, Spark splits data into partitions, smaller data chunks. A DataFrame of 1,000,000 rows could be partitioned to 10 partitions having 100,000 rows each. Additionally, the computation jobs Spark runs are split into tasks, each task acting on a single data partition. Spark cluster has a driver that distributes the tasks to multiple executors. This means that the datasets can be much larger than fits into the memory of a single computer – as long as the partitions fit into the memory of the computers running the executors.

In one of the projects our team encountered an out-of-memory error that we spent a long time figuring out. Finally we found out that the problem was a result of too large partitions. The data in a partition could simply not fit to the memory of a single executor node.

Too few partitions also make the execution inefficient. Some of the executor cores idle while others are working on a full steam, if there are not as many partitions as there are available cores (or, technically, available slots)

However, having a large amount of small partitions is not optimal either – shuffling the data in the small partitions is inefficient. Also reading and writing to disk (not to mention a network destination) in small chunks potentially increases the total execution time.

The Spark programming guide recommends 128 MB partition size as the default. For 128 GB of data this would mean 1000 partitions. Without going too deep in the details, consider partitioning as a crucial part of the optimization toolbox. If your partitions are too large or too small, you can use the coalesce() and repartition() methods of DataFrame to instruct Spark to modify the partition distribution. The number of partitions in a DataFrame sDF can be checked with sDF.rdd.getNumPartitions().

Summary – PySpark basics and optimization

PySpark offers a versatile interface for using powerful Spark clusters, but it requires a completely different way of thinking and being aware of the differences of local and distributed execution models. The functionality offered by the core PySpark interface can be extended by creating User-Defined Functions (UDFs), but as a tradeoff the performance is not as good as for native PySpark functions due to lesser degree of optimization. Partitioning the data correctly and with a reasonable partition size is crucial for efficient execution – and as always, good planning is the key to success.

New call-to-action
The editor to modify the python flavored spark code.

AWS Glue tutorial with Spark and Python for data developers

This AWS Glue tutorial is a hands-on introduction to create a data transformation script with Spark and Python. Basic Glue concepts such as database, table, crawler and job will be introduced.

In this tutorial you will create an AWS Glue job using Python and Spark. You can read the previous article for a high level Glue introduction.

In the context of this tutorial Glue could be defined as “A managed service to run Spark scripts”.

In some parts of the tutorial I reference to this GitHub code repository.

Create a data source for AWS Glue

Glue can read data either from database or S3 bucket. For this tutorial I created an S3 bucket called glue-blog-tutorial-bucket. You have to come up with another name on your AWS account.

Create two folders from S3 console called read and write.

The S3 bucket has two folders. In AWS folder is actually just a prefix for the file name.
The S3 bucket has two folders. In AWS a folder is actually just a prefix for the file name.

 

Upload this movie dataset to the read folder of the S3 bucket.

The data for this python and spark tutorial in Glue contains just 10 rows of data. Source: IMDB.
The data for this Python and Spark tutorial in Glue contains just 10 rows of data. Source: IMDB.

Crawl the data source to the data catalog

Glue has a concept of crawler. A crawler sniffs metadata from the data source such as file format, column names, column data types and row count. The metadata makes it easy for others to find the needed datasets. The Glue catalog enables easy access to the data sources from the data transformation scripts.

The crawler will catalog all files in the specified S3 bucket and prefix. All the files should have the same schema.

In Glue crawler terminology the file format is known as a classifier. The crawler identifies the most common classifiers automatically including CSV, json and parquet. It would be possible to create a custom classifiers where the schema is defined in grok patterns which are close relatives of regular expressions.

Our sample file is in the CSV format and will be recognized automatically.

Instructions to create a Glue crawler:

  1. In the left panel of the Glue management console click Crawlers.
  2. Click the blue Add crawler button.
  3. Give the crawler a name such as glue-blog-tutorial-crawler.
  4. In Add a data store menu choose S3 and select the bucket you created. Drill down to select the read folder.
  5. In Choose an IAM role create new. Name the role to for example glue-blog-tutorial-iam-role.
  6. In Configure the crawler’s output add a database called glue-blog-tutorial-db.

 

Summary of the AWS Glue crawler configuration.
Summary of the AWS Glue crawler configuration.

 

When you are back in the list of all crawlers, tick the crawler that you created. Click Run crawler.

Note: If your CSV data needs to be quoted, read this.

The crawled metadata in Glue tables

Once the data has been crawled, the crawler creates a metadata table from it. You find the results from the Tables section of the Glue console. The database that you created during the crawler setup is just an arbitrary way of grouping the tables.

Metadata for the Glue table. You can see properties as well as column names and data types from this view.
Metadata for the Glue table.

 

Glue tables don’t contain the data but only the instructions how to access the data.

Note: For large CSV datasets the row count seems to be just an estimation.

AWS Glue jobs for data transformations

From the Glue console left panel go to Jobs and click blue Add job button.

Follow these instructions to create the Glue job:

  1. Name the job as glue-blog-tutorial-job.
  2. Choose the same IAM role that you created for the crawler. It can read and write to the S3 bucket.
  3. Type: Spark.
  4. Glue version: Spark 2.4, Python 3.
  5. This job runsA new script to be authored by you.
  6. Security configuration, script libraries, and job parameters
    1. Maximum capacity2. This is the minimum and costs about 0.15$ per run.
    2. Job timeout10. Prevents the job to run longer than expected.
  7. Click Next and then Save job and edit the script.

Editing the Glue script to transform the data with Python and Spark

Copy this code from Github to the Glue script editor.

Remember to change the bucket name for the s3_write_path variable.

Save the code in the editor and click Run job.

The Glue editor to modify the python flavored spark code.
The Glue editor to modify the python flavored Spark code.

 

The detailed explanations are commented in the code. Here is the high level description:

  1. Read the movie data from S3
  2. Get movie count and rating average for each decade
  3. Write aggregated data back to S3

The execution time with 2 Data Processing Units (DPU) was around 40 seconds. Relatively long duration is explained by the start-up overhead.

The data transformation creates summarized movie data. For example 90's had 4 movies in the top 10 with the average score of 8.95.
The data transformation script creates summarized movie data. For example 1990 decade had 4 movies in the IMDB top 10 with the average score of 8.95.

 

You can download the result file from the write folder of your S3 bucket. Another way to investigate the job would be to take a look at the CloudWatch logs.

The data is stored back to S3 as a CSV in the "write" prefix. The number of partitions equals number of output files.
The data is stored back to S3 as a CSV in the “write” prefix. The number of partitions equals the number of the output files.

Speeding up Spark development with Glue dev endpoint

Developing Glue transformation scripts is slow, if you just run a job after another. Provisioning the computation cluster takes minutes and you don’t want to wait after each change.

Glue has a dev endpoint functionality where you launch a temporary environment that is constantly available. For development and testing it’s both faster and cheaper.

Dev endpoint provides the processing power, but a notebook server is needed to write your code. Easiest way to get started is to create a new SageMaker notebook by clicking Notebooks under the Dev endpoint in the left panel.

About Glue performance

In the code example we did read the data first to Glue’s DynamicFrame and then converted that to native PySpark DataFrame. This method makes it possible to take advantage of Glue catalog but at the same time use native PySpark functions.

However, our team has noticed Glue performance to be extremely poor when converting from DynamicFrame to DataFrame. This applies especially when you have one large file instead of multiple smaller ones. If the execution time and data reading becomes the bottleneck, consider using native PySpark read function to fetch the data from S3.

Summary about the Glue tutorial with Python and Spark

Getting started with Glue jobs can take some time with all the menus and options. Hopefully this tutorial gave some idea what is the role of database, table, job and crawler.

The focus of this tutorial was in a single script, but Glue also provides tools to manage larger group of jobs. You can schedule jobs with triggers or orchestrate relationships between triggers, jobs and crawlers with workflows.

Learning the Glue console is one thing, but the actual logic lies in the Spark scripts. Tuning the code impacts significantly to the execution performance. That will be the topic of the next blog post.

AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

Introduction to AWS Glue for big data ETL

AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

AWS Glue is a serverless ETL tool in cloud. In brief ETL means extracting data from a source system, transforming it for analysis and other applications and then loading back to data warehouse for example.

In this blog post I will introduce the basic idea behind AWS Glue and present potential use cases.

The emphasis is in the big data processing. You can read more about Glue catalogs here and data catalogs in general here.

Why to use AWS Glue?

Replacing Hadoop. Hadoop can be expensive and a pain to configure. AWS Glue is simple. Some say that Glue is expensive, but it depends where you compare. Because of on demand pricing you only pay for what you use. This fact might make AWS Glue significantly cheaper than a fixed size on-premise Hadoop cluster.

AWS Lambda can not be used. A wise man said, use lambda functions in AWS whenever possible. Lambdas are simple, scalable and cost efficient. They can also be triggered by events. For big data lambda functions are not suitable because of the 3 GB memory limitation and 15 minute timeout. AWS Glue is specifically built to process large datasets.

Apply DataOps practices. Drag and drop ETL tools are easy for users, but from the DataOps perspective code based development is a superior approach. With AWS Glue both code and configuration can be stored in version control. The data development becomes similar to any other software development. For example the data transformation scripts written by scala or python are not limited to AWS cloud. Environment setup is easy to automate and parameterize when the code is scripted.

An example use case for AWS Glue

Now a practical example about how AWS Glue would work in practice.

A production machine in a factory produces multiple data files daily. Each file is a size of 10 GB. The server in the factory pushes the files to AWS S3 once a day.

The factory data is needed to predict machine breakdowns. For that, the raw data should be pre-processed for the data science team.

Lambda is not an option for the pre-processing because of the memory and timeout limitation. Glue seems to be reasonable option when work hours and costs are compared to alternative tools.

The simplest way of get started with the ETL process is to create a new Glue job and write code to the editor. The script can be either in scala or python programming language.

Extract. The script first reads all the files from the specified S3 bucket to a single data frame. You can think a data frame as a table in Excel. The reading can be just a one-liner.

Transform. This is the most of the code. Let’s say that the original data had 100 records per second. The data science team wants the data to be aggregated per each 1 minute with a specific logic. This could be just tens of code lines if the logic is simple.

Load. Write data back to another S3 bucket for the data science team. It’s possible that a single line of code will do.

The code runs on top of the spark framework which is configured automatically in Glue. Thanks to spark, data will be divided to small chunks and processed in parallel on multiple machines simultaneously.

What makes AWS Glue serverless?

Serverless means you don’t have machines to configure. AWS provisions and allocates the resources automatically.

The processing power is adjusted by the number of data processing units (DPU). You can do additional configuration, but it’s likely that a proof of concept works out of the box.

In an on-premise environment you would have to make a decision about the computation cluster size. A big cluster is expensive but fast. A small cluster would be cheaper but slow to run.

With AWS Glue your bill is the result the following equation:

[ETL job price] = [Processing time] * [Number of DPUs]

 

The on demand pricing means that the increase in processing power does not compromise with the price of the ETL job. At least in theory, as too many DPUs might cause overhead in processing time.

When is AWS Glue a wrong choice?

This is not an advertisement, so let’s give some critique for Glue as well.

Lots of small ETL jobs. Glue has a minimum billing of 10 minutes and 2 DPUs. With the price of 0.44$ per DPU hour, the minimum cost for a run becomes around 0.15$. The starting price makes Glue unappealing alternative to process small amount of data often.

Specific networking requirements. If you spin up a standard EC2 virtual machine, an IP address will be attached to it. The serverless nature of Glue means you have to put more effort on network planning in some cases. One such scenario would be whitelisting a Glue job in a firewall to extract data from an external system.

Summary about AWS Glue

The most common argument against Glue is “It’s expensive”. True, in a sense that the first few test runs can already cost a few dollars. In a nutshell, Glue is cost efficient for infrequent big data workloads.

In the big picture AWS Glue saves a lot of time and unnecessary hardware engineering. The costs should be compared against alternative options such as on-premise Hadoop cluster or development hours required for a custom solution.

As Glue pricing model is predictable, the business cases are straightforward to calculate. It might be enough to test just the critical parts of the ETL pipeline to become confident about the performance and costs.

I feel that optimizing the code for distributed computing has been more of a challenge than the Glue service itself. The next blog post will focus on how data developers get started with Glue using python and spark.

Building machine learning models with AWS SageMaker

A small group of Solita employees visited AWS London office last November and participated in a workshop. There we got to know the AWS service called SageMaker. SageMaker turned out to be easy to learn and use and in this blog post I'm going to tell more about it and demonstrate with short code snippets how it works.

AWS SageMaker

SageMaker is an Amazon service that was designed to build, train and deploy machine learning models easily. For each step there are tools and functions that make the development process faster. All the work can be done in Jupyter Notebook, which has pre-installed packages and libraries such as Tensorflow and pandas. One can easily access data in their S3 buckets from SageMaker notebooks, too. SageMaker provides multiple example notebooks so that getting started is very easy. I introduce more information about different parts of SageMaker in this blog post and the picture below summarises how they work together with different AWS services.

Picture of how SageMake interacts with other AWS services during build, train and deploy phase

Dataset

In the example snippets I use the MNIST dataset which contains labeled pictures of alphabets in sign language. They are 28×28 grey-scale pictures, which means each pixel is represented as an integer value between 0-255. Training data contains 27 455 pictures and test data 7 127 pictures and they’re stored in S3.

For importing and exploring the dataset I simply use pandas libraries. Pandas is able to read data from S3 bucket:

import pandas as pd

bucket = ''
file_name = 'data-file.csv'

data_location = 's3://{}/{}'.format(bucket, file_name)

df = pd.read_csv(data_location)

From the dataset I can see that its first column is a label for picture, and the remaining 784 columns are pixels. By reshaping the first row I can get the first image:

from matplotlib import pyplot as plt
pic=df.head(1).values[0][1:].reshape((28,28))

plt.imshow(pic, cmap='gray')

plt.show()

Image with alphabet d in sign language

Build

The build phase in AWS SageMaker means exploring and cleaning the data. Keeping it in csv format would require some changes to data if we’d like to use SageMaker built-in algorithms. Instead, we’ll convert the data into RecordIO protobuf format, which makes built-in algorithms more efficient and simple to train the model with. This can be done with the following code and should be done for both training and test data:

from sagemaker.amazon.common import write_numpy_to_dense_tensor
import boto3

def convert_and_upload(pixs, labels, bucket_name, data_file):
	buf = io.BytesIO()
	write_numpy_to_dense_tensor(buf, pixs, labels)
	buf.seek(0)

	boto3.resource('s3').Bucket(bucket_name).Object(data_file).upload_fileobj(buf)

pixels_train=df.drop('label', axis=1).values
labels_train=df['label'].values

convert_and_upload(pixels_train, labels_train, bucket, 'sign_mnist_train_rec')

Of course, in this case the data is very clean already and usually a lot more work is needed in order to explore and clean it properly before it can be used to train a model. Data can also be uploaded back to S3 after the cleaning phase for example if cleaning and training are kept in separate notebooks. Unfortunately, SageMaker doesn’t provide tools for exploring and cleaning data, but pandas is very useful for that.

Train

Now that the data is cleaned, we can either use SageMaker’s built-in algorithms or use our own, provided by for example sklearn. When using other than SageMaker built-in algorithms you would have to provide a Docker container for the training and validation tasks. More information about it can be found in SageMaker documentation. In this case as we want to recognise alphabets from the pictures we use k-Nearest Neighbors -algorithm which is simple and fast algorithm for classification tasks. It is one of the built-in algorithms in SageMaker, and can be used with very few lines of code:

knn=sagemaker.estimator.Estimator(get_image_uri(
	boto3.Session().region_name, "knn"),
	get_execution_role(),
	train_instance_count=1,
	train_instance_type='ml.m4.xlarge',
	output_path='s3://{}/output'.format(bucket),
	sagemaker_session=sagemaker.Session())

knn.set_hyperparameters(**{
	'k': 10,
	'predictor_type': 'classifier',
	'feature_dim': 784,
	'sample_size': 27455
})

in_config_test = sagemaker.s3_input(
	   s3_data='s3://{}/{}'.format(bucket,'sign_mnist_test_rec'))

in_config_train = sagemaker.s3_input(
	   s3_data='s3://{}/{}'.format(bucket,'sign_mnist_train_rec'))

knn.fit({'train':in_config_train, 'test': in_config_test})

So let’s get into what happens there. Estimator is an interface for creating training tasks in SageMaker. We simply tell it which algorithm we want to use, how many ML instances we want for training, which type of instances they should be and where the trained model should be stored.

Next we define hyperparameters for the algorithm, in this case k-Nearest Neighbors classifier. Instead of the classifier we could have a regressor for some other type of machine learning task. Four parameters shown in the snippet are mandatory, and the training job will fail without them.  By tuning hyperparameters the accuracy of the model can be improved. SageMaker also provides automated hyperparameter tuning but we won’t be using them in this example.

Finally we need to define the path to the training data. We do it by using Channels which are just named input sources for training algorithms. In this case as our data is in S3, we use s3_input class. Only the train channel is required, but if a test channel is given, too, the training job also measures the accuracy of the resulting model. In this case I provided both.

For kNN-algorithm the only allowed datatypes are RecordIO protobuf and CSV formats. If we were to use CSV format, we would need to define it in configuration by defining the named parameter content_type and assigning ‘text/csv;label_size=0’ as value. As we use RecordIO protobuf type, only s3_data parameter is mandatory. There are also optional parameters for example for shuffling data and for defining whether the whole dataset should be replicated in every instance as a whole. When the fit-function is called, SageMaker creates a new training job and logs its the training process and duration into the notebook. Past training jobs with their details can be found by selecting ‘Training jobs’ in the SageMaker side panel. There you can find given training/test data location and find information about model accuracy and logs of the training job.

Deploy

The last step on our way to getting predictions from the trained model is to set up an endpoint for it. This means that we automatically set up an endpoint for real-time predictions and deploy trained model for it to use. This will create a new EC2 instance which will take data as an input and provide prediction as a response. The following code is all that is needed for creating an endpoint and deploying the model for it:

import time

def get_predictor(knn_estimator, estimator_name, instance_type, endpoint_name=None): 
    knn_predictor = knn_estimator.deploy(initial_instance_count=1, instance_type=instance_type,
                                        endpoint_name=endpoint_name)
    knn_predictor.content_type = 'text/csv'
    return knn_predictor


instance_type = 'ml.m5.xlarge'
model_name = 'knn_%s'% instance_type
endpoint_name = 'knn-ml-%s'% (str(time.time()).replace('.','-'))
predictor = get_predictor(knn, model_name, instance_type, endpoint_name=endpoint_name)

and it can be called for example in the following way:

file = open("path_to_test_file.csv","rb")

predictor.predict(file)

which would return the following response:

b'{"predictions": [{"predicted_label": 6.0}, {"predicted_label": 3.0}, {"predicted_label": 21.0}, {"predicted_label": 0.0}, {"predicted_label": 3.0}]}'

In that case we got five predictions, because the input file contains five pictures. In a real life case we could use API Gateway and Lambda functions for providing interface for real-time predictions. The Lambda function can use boto3 library to connect to the created endpoint and fetch a prediction. In the API gateway we can setup an API that calls the lambda function once it gets a POST request and returns the prediction in response.

Conclusions

AWS SageMaker is a very promising service that allows reading data, training a model and deploying the endpoint with less than a hundred lines of code. It provides many good functions for training but also allows using Docker for custom training jobs. Jupyter Notebook is familiar tool to data scientists, so it’s very nice that it is used in SageMaker. SageMaker also integrates very easily with other AWS Services and allocating resources for training and endpoints is very easy. The machine learning algorithms are optimised for AWS, so their performance is very high.

The amount of code needed for training a model is not the biggest challenge in a data scientist’s everyday job, though. There are already very good libraries for that purpose, and one of the most time consuming part is usually cleaning and altering the data so that it can be used for training. For that SageMaker doesn’t provide help.

All in all, optimised algorithms, automated hyperparameter tuning, easy integration and interaction with other AWS services saves a lot time and trouble for data scientists. Trying out SageMaker is definitely worthwhile.