AWS SageMaker Pipelines – Making MLOps easier for the Data Scientist

SageMaker Pipelines is a machine learning pipeline creation SDK designed to make deploying machine learning models to production fast and easy. I recently got to use the service in an edge ML project and here are my thoughts about its pros and cons. (For more about the said project refer to Solita data blog series about IIoT and connected factories https://data.solita.fi/factory-floor-and-edge-computing/)

Example pipeline

Why do we need MLOps?

First, there were statistics then came the emperor’s new clothes – machine learning, a rebranding of old methods accompanied with new ones emerged. Fast forward to today and we’re all the time talking about this thing called “AI”, the hype is real, it’s palpable because of products like Siri and Amazon Alexa.

But from a Data Scientist point of view, what does it take to develop such a model? Or even a simpler model, say a binary classifier? The amount of work is quite large, and this is only the tip of the iceberg. How much more work is needed to put that model into the continuous development and delivery cycle?

For a Data Scientist, it can be hard to visualize what kind of systems you need to automate everything your model needs to perform its task. Data ETL, feature engineering, model training, inference, hyperparameter optimization, performance monitoring etc. Sounds like a lot to automate?

(Hidden technical debt in machine learning https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf)

 

This is where MLOps comes to the picture, bridging DevOps CI/CD practices to the data science world and bringing in some new aspects as well. You can see more information about MLOps from previous Solita content such as https://www.solita.fi/en/events/webinar-what-is-mlops-and-how-to-benefit-from-it/ 

Building an MLOps infrastructure is one thing but learning to use it fluently is also a task of its own. For a Data Scientist at the beginning of his/her career, it could seem too much to learn how to use cloud infrastructure as well as learn how to develop Python code that is “production” ready. A Jupyter notebook outputting predictions to a CSV file simply isn’t enough at this stage of the machine learning revolution.

(The “first” standard on MLOps, Uber Michelangelo Platform https://eng.uber.com/michelangelo-machine-learning-platform/)

 

A Jupyter notebook outputting predictions to a CSV file simply isn’t enough at this stage of the machine learning revolution.

Usually, companies that have a long track record of Data Science projects have a few DevOps, Data Engineer/Machine Learning Engineer roles working closely with their Data Scientists teams to distribute the different tasks of production machine learning deployment. Maybe they even have built the tooling and the infrastructure needed to deploy models into production more easily. But there are still quite a few Data Science teams and data-driven companies figuring out how to do this MLOps thing.

Why should you try SageMaker Pipelines?

AWS is the biggest cloud provider ATM so it has all the tooling imaginable that you’d need to build a system like this. They are also heavily invested in Data Science with their SageMaker product and new features are popping up constantly. The problem so far has been that there are perhaps too many different ways of building a system like this.

AWS tries to tackle some of the problems with the technical debt involving production machine learning with their SageMaker Pipelines product. I’ve recently been involved in project building and deploying an MLOps pipeline for edge devices using SageMaker Pipelines and I’ll try to provide some insight on why it is good and what is lacking compared to a completely custom-built MLOps pipeline.

The SageMaker Pipelines approach is an ambitious one. What if, Data Scientists, instead of having to learn to use this complex cloud infrastructure, you could deploy to production just by learning how to use a single Python SDK (https://github.com/aws/sagemaker-python-sdk)? You don’t even need the AWS cloud to get started, it also runs locally (to a point).

SageMaker Pipelines aims at making MLOps easy for Data Scientists. You can define your whole MLOps pipeline in f.ex. A Jupyter Notebook and automate the whole process. There are a lot of prebuilt containers for data engineering, model training and model monitoring that have been custom-built for AWS. If these are not enough you can use your containers enabling you to do anything that is not supported out of the box. There are also a couple of very niche features like out-of-network training where your model will be trained in an instance that has no access to the internet mitigating the risk of somebody from the outside trying to influence your model training with f.ex. Altered training data.

You can version your models via the model registry. If you have multiple different use cases for the same model architectures with differences being in the datasets used for training it’s easy to select the suitable version from SageMaker UI or the python SDK and refactor the pipeline to suit your needs.  With this approach, the aim is that each MLOps pipeline has a lot of components that are reusable in the next project. This enables faster development cycles and the time to production is reduced. 

SageMaker Pipelines logs every step of the workflow from training instance sizes to model hyperparameters automatically. You can seamlessly deploy your model to the SageMaker Endpoint (a separate service) and after deployment, you can also automatically monitor your model for concept drifts in the data or f.ex. latencies in your API. You can even deploy multiple versions of your models and do A/B testing to select which one is proving to be the best.

And if you want to deploy your model to the edge, be it a fleet of RaspberryPi4s or something else, SageMaker provides tooling for that also and it seamlessly integrates with Pipelines.

You can recompile your models for a specific device type using SageMaker Neo Compilation jobs (basically if you’re deploying to an ARM etc. device you need to do certain conversions for everything to work as it should) and deploy to your fleet using SageMaker fleet management.

Considerations before choosing SageMaker Pipelines

By combining all of these features to a single service usable through SDK and UI, Amazon has managed to automate a lot of the CI/CD work needed for deploying machine learning models into production at scale with agile project development methodologies. You can also leverage all of the other SageMaker products f.ex. Feature Store or Forekaster if you happen to need them. If you’re already invested in using AWS you should give this a try.

Be it a great product to get started with machine learning pipelines it isn’t without its flaws. It is quite capable for batch learning settings but there is no support as of yet for streaming/online learning tasks. 

And for the so-called Citizen Data Scientist, this is not the right product since you need to be somewhat fluent in Python. Citizen Data Scientists are better off with BI products like Tableau or Qlik (which use SageMaker Autopilot as their backend for ML) or perhaps with products like DataRobot. 

And in a time where software products are high availability and high usage the SageMaker EndPoints model API deployment scenario where you have to pre-decide the number of machines serving your model isn’t quite enough.

 In e-commerce applications, you could run into situations where your API is receiving so much traffic that it can’t handle all the requests because you didn’t select a big enough cluster to serve the model with. The only way to increase the cluster size in SageMaker Pipelines is to redeploy a new revision within a bigger cluster. It is pretty much a no brainer to use a Kubernetes cluster with horizontal scaling if you want to be able to serve your model as the traffic to the API keeps increasing.

Overall it is a very nicely packaged product with a lot of good features. The problem with MLOps in AWS has been that there are too many ways of doing the same thing and SageMaker Pipelines is an effort for trying to streamline and package all those different methodologies together for machine learning pipeline creation.

It’s a great fit if you work with batch learning models and want to create machine learning pipelines really fast. If you’re working with online learning or reinforcement models you’ll need a custom solution. And if you are adamant that you need autoscaling then you need to do the API deployments yourself, SageMaker endpoints aren’t quite there yet. For references to a “complete” architecture refer to the AWS blog https://aws.amazon.com/blogs/machine-learning/automate-model-retraining-with-amazon-sagemaker-pipelines-when-drift-is-detected/

 

Factory Floor and Edge computing

Happened last time

In the first part of this blog series I discussed the industry 4.0 phenomenon: Smart and Connected Factory, what benefits it brings, what is IT/OT convergence and gave a short intro about Solita’s Connected Factory Kickstart

This part is more focused on the data at Factory floor and how AWS services can help in ingesting the data from factory machinery.

Access the data and gain benefits from Edge computing

So what is the data at the Factory floor? It is generated by machinery systems using many sensors and actuators. See the following picture where on the left there is a traditional ISA-95 pyramid for factory data integrating each layer with the next. The right side represents new thinking where we can ingest data from each layer and take advantage of IT/OT convergence using AWS edge and cloud services.

PLC (Programmable Logic Controller) typically has dedicated modules for inputs and dedicated modules for outputs. An input module can detect the status of input signals like switches and an output module controls devices such as relays and motors.

Sensors are typically connected to PLC’s. To access the data and use it in other systems, PLC’s can be connected to an OPC-UA server. The server can provide access to the data. One traditional use case is to connect PLC to factory SCADA systems for high level supervision of machines and processes. OPC-UA defines a generic object model and each object can be associated with data type, timestamp, data quality and current value and they can have a hierarchy. Every kind of device, function, and system information can be described using this meta model.

 

AWS services that ease data access at the factory

AWS Greengrass is an open source edge software which integrates to AWS Cloud. It enables local processing, messaging, Machine Learning (ML) inference, device mesh and many pre-baked software components for speed up application development. 

AWS Sitewise is a cloud service for collecting and analyzing data from factory environments. It provides Greengrass compatible edge components for example for data collection from OPC-UA server and for streaming data to AWS Sitewise. Sitewise has a built in time-series database, data modeling capabilities, API layer and portal, which can be deployed and run at the edge as well (which is amazing!). 

The AWS Sitewise asset and data modeling is for making a virtual presentation of industrial equipment or process. Data model supports hierarchies, metrics and real time calculations, for example for calculating OEE (Overall Equipment Effectiveness). Each asset is enforced to use data mode that validates incoming data and schema.

Why industrial use cases with AWS?

I prefer more hands-on work than reading Gartner papers; anyhow AWS has been named as a Leader for the eleventh consecutive year and has secured the highest and furthest position on the ability to execute and completeness of vision axes in the 2021 Magic Quadrant for Cloud Infrastructure and Platform Services. It’s very nice to see how AWS is taking industrial solutions seriously and packaging those to a model that is easy to take in use for building digital services to the factory floor and cloud.  

 1. AWS Sitewise – The power of data model, ingest, analyze and visualize

Sitewise packages nice features which I feel are the greatest are the data and asset modeling, near real time metric calculations (even on edge), visualization and build in time series database. Sitewise is nicely supported by CloudFormation, so you can automate the deployment and even build data models according to your OPC-UA data model automatically (Meta driven Industry standard data model). The Fact that there are edge processing and monitoring capabilities with a portal available makes the Sitewise a really competitive package.

2. AWS Greengrass – Edge computing and secure cloud integration

Speeds up edge application development with public components, like the OPC-UA collector, StreamManager and Kinesis Firehose publisher. The latest Greengrass version 2.x has evolved and has lots of great features. You can provision and run a solution on real hardware or simulate on an EC2 instance or Docker, as you wish. One way to provision Greengrass devices to AWS cloud is to use IoT Fleet Provisioning, where certificates for the device are created on the first connection attempt to the cloud. Applications are easy to deploy from cloud IoT Core to edge Greengrass instances. You can also run serverless AWS Lambdas at the edge, which is really superb! All in all, the complete Greengrass 2.0 package will speed up development.

3. Cloud and Edge – Extra layer of Security

SItewise and Greengrass use AWS IoT Core security features, like certificate based authentication, IoT policies, TLS 1.2 on transport and device defender, which brings the security to a new level. It’s also possible to use custom Certificate Authorities (CA) to issue edge device certificates. Custom CA’s can be stored in AWS CloudHSM and AWS Certificate Manager. Now I can really say that security is our best friend.

4. Agile integration to other solutions

Easy way for integrating data to other solutions is to use Sitewise Edge and Cloud APIs. If you deploy Sitewise to the edge the API is usable there as well, and you can use the data for other factory systems, like MES (Manufacturing Execution System). At least I think this will combine the IT and OT worlds like never before.

5. AutoML for Edge computing

AutoML is for people like me and citizen data scientists, something that will speed up business insights when creating a lot of notebooks or python code is not needed anymore.

These AutoML services are used to organize, track and compare Machine Learning training. When auto deploy is turned on the best model from the experiment is deployed to the endpoint and the best model is automatically selected using the Bandit algorithm. Besides these Amazon SageMaker model monitor will continuously monitor the quality of your machine learning models in real time and I can focus on talking with people and not only machines. 

 

Stay tuned for more

I think that AWS is making it easier to combine cloud workloads with edge computing. Stay tuned for the next blog post where we dig more into the cloud side of this, including Sitewise, Asset and data model, visualization and alarms. And please take a look at the “Predictive maintenance data kickstart” if you haven’t yet:

https://www.solita.fi/en/solita-connected/

 

Smart and Connected Factories

Smart connected factories are a phenomenon of the fourth industrial revolution, Industry 4.0.

What is a connected factory?

Connected Factories utilizes machinery automation systems and additional sensors to collect data from manufacturing devices and processes. The data can be analyzed and processed on site at the factory, before being sent to the cloud platforms for historical and real time data analysis. Connected factory enables a holistic view for data over all customer factories. Connectivity is a key enabler of  IT/OT convergence.

Operational Technology (OT) consists of software and hardware systems that control and execute processes at the factory floor. Typically these are MES (Manufacturing Execution System), SCADA (Supervisory Control And Data Acquisition) and PLCs (Programmable Logic Controllers) at manufacturing factories. 

Whereas Information Technology (IT) refers to the information infrastructure covering network, software and hardware components for storing, processing, securing and exchanging data. IT consists of laptops and servers, software, enterprise systems software like ERPs, CRMs, inventory management programs, and other business related tools.

Historically OT is separated from IT. In recent years industrial digitalization, connectivity and cloud computing have made it possible for OT and IT systems to join and share data with each other. On IT/OT data convergence factory floor OT data is combined with IT data:

IT/OT Convergence
IT/OT Convergence

 

When IT and OT collide we need to align things like “How to handle different networks and control the boundaries between them”. IT and OT networks are totally for different purposes and they have different security, availability and maintainability principles. IT/OT convergence can definitely be beneficial for the company but at the same time it might pop up new challenges for the traditional OT world, like “How often and what kind of data should we upload to the cloud?” and “What are the key attributes to combine different data assets?”. Here are few examples where IT/OT is  converged:

  • Welding station monitoring with laboratory data. Combining with IT data we can improve customer specific welding quality. 
  • Getting OT data from equipment and merging that with customer contract data we can start upselling predictive maintenance solutions.
  • Getting real time metrics it is also possible to create subscription based billing. For this we need asset basic information and CRM customer contract information.
  • Creating a Digital service book is easy when you have full traceability based on OT data joint to IT product lifecycle data.

I think that in order to combine IT and OT together is nowadays much easier than just a few years ago thanks to hyper scalers like AWS and others. Now we can see in action how Cloud can enable smart manufacturing using purpose built components like AWS Greengrass and SiteWise. Stay tuned for next blog posts where I will explain basics on Edge computing in a harsh factory environment.

 

Kickstart towards smart and connected factory

Solita has made a kickstart for companies to start a risk free journey. We package pre-baked components for edge data ingestion, edge ML, AWS Sitewise data modelling, visualization, data integration API and MLOps in one deliverable using only 4 weeks time.

Check it out from https://www.solita.fi/en/solita-connected/ and let’s connect!