MLOps on Databricks: Streamlining Machine Learning Workflows

Databricks is a cloud-based platform that seamlessly integrates data engineering, machine learning, and analytics to simplify the process of building, training, and deploying Machine Learning models. With its unified platform built on top of Lakehouse architecture, Databricks empowers Data Scientist and ML engineers to unleash their full potential, providing a collaborative workspace and offering comprehensive tooling that streamline the entire ML process, including tools to support DevOps to model development, deployment and management.

February 14, 2023

Jyoti Prasad Bartaula

MLOps in a nutshell

While many companies and businesses are investing in AI and machine learning to stay competitive and capture the untapped business opportunity, they are not reaping the benefits of those investments as their journey of operationalizing machine learning is stuck as a jupyter notebook level data science project. And that’s where MLOps comes to the rescue.

MLOps is a set of tools and practices for the development of machine learning systems. It aims to enhance the reliability, efficiency, and speed of productionizing machine learning. In the meantime, adhering to governance requirements. MLOps facilitate collaboration among data scientists, ML engineers, and other stakeholders and automate processes for a quicker production cycle of machine learning models. MLOps takes a few pages out of DevOps book; a methodology of modern software development but differs in asset management, as it involves managing source code, data, and machine learning models together for version control and model comparison, as well as for model reproducibility. Therefore, in essence, MLOps involves jointly managing source code (DevOps), data (DataOps) and Machine Learning models (ModelOps), while also continuously monitoring both the software system and the machine learning models to detect performance degradation.

MLOps = DevOps + DataOps + ModelOps

MLOps on Databricks

Recently, I had a chance to test and try out the Databricks platform. And in this blog post, I will attempt to summarise what Databricks has to offer in terms of MLOps capability.

First of all, what is Databricks ?

Databricks is a web based multi-cloud platform that aims to unify data engineering, machine learning, and analytics solutions under single service. The standalone aspect of Databricks is its LakeHouse architecture that provides data warehousing capabilities to a data lake. As a result, Databricks lakehouse eliminates the data silos due to pushing data into multiple data warehouses or data lakes, thereby providing data teams the single source of data.

Databricks aims to consolidate, streamline and standardise the productionizing machine learning with Databricks Machine Learning service. With MLOps approach built on their Lakehouse architecture, Databricks provides suits of tools to manage the entire ML lifecycle, from data preparation to model deployment.

MLOps approach on Databricks is built on their Lakehouse Platform which involves jointly managing code, data, and models. Fig:Databricks

For the DevOps part of MLOps, Databricks provides capability to integrate various git providers, DataOps uses DeltaLake and for ModelOps they come integrated with MLflow: an open-source machine learning model life cycle management platform.

DevOps

Databricks provides Repos that support git integration from various git providers like Github, Bitbucket, Azure DevOps, AWS CodeCommit and Gitlab and their associated CI/CD tools. Databricks repos also support various git operations such as cloning a repository, committing. and pushing, pulling, branch management, and visual comparison of diffs when committing, helping to sync notebooks and source code with Databricks workspaces.

DataOps

DataOps is built on top of Delta Lake. Databricks manages all types of data (raw data, log, features, prediction, monitoring data etc) related to the ML system with Delta Lake. As the feature table can be written as a Delta table on top of delta lake, every data we write to delta lake is automatically versioned. And as Delta Lake is equipped with time travel capability, we can access any historical version of the data with a version number or a timestamp.

In addition, Databricks also provides this nice feature called Feature Store. Feature Store is a centralised repository for storing, sharing, and discovering features across the team. There are a number of benefits of adding feature stores in machine learning learning development cycle. First, having a centralised feature store brings the consistency in terms of feature input between model training and inference eliminating online/offline skew there by increasing the model accuracy in production. It also eliminates the separate feature engineering pipeline for training and inference reducing the technical dept of the team. As the feature store integrates with other services in Databricks, features are reusable and discoverable to other teams as well; like analytics and BI teams can use the same set of features without needing to recreate them. Databricks’s Feature store also allows for versioning and lineage tracking of features like who created features, what services/models are using them etc thereby making it easier to apply any governance like access control list over them.

ModelOps

ModelOps capability in Databricks is built on a popular open-source framework called MLFlow. MLflow provides various components and apis to track and log machine learning experiments and manage model’s lifecycle stage transition.

Two of the main components of MLFlow are MLFlow tracking and MLFlow model registry.

The MLflow tracking component provides an api to log and query and an intuitive UI to view parameters, metrics, tags, source code version and artefacts related to machine learning experiments where experiment is aggregation of runs and runs are executions of code. This capability to track and query experiments helps in understanding how different models perform and how their performance depends on the input data, hyperparameter etc.

Another core component of MLflow is Model Registry: a collaborative model hub, which let’s manage MLflow models and their lifecycle centrally. Model registry is designed to take a model from model tracking to put it through staging and into production. Model registry manages model versioning, model staging (assign “Staging” and “Production” to represent the lifecycle of a model version), model lineage (which MLflow Experiment and Run produced the model) and model annotation (e.g. tags and comments). Model registry provides webhooks and api to integrate with continuous delivery systems.

The MLflow Model Registry enables versioning of a single corresponding registered model where we can seamlessly perform stage transitions of those versioned models.

Databricks also supports the deployment of Model Registry’s production model in multiple modes: batch and streaming jobs or as a low latency REST API, making it easy to meet the specific requirements of an organisation.

For model monitoring, Databricks allows logging the input queries and predictions of any deployed model to Delta tables.

Conclusion

MLOps is a relatively nascent field and there are a myriad of tools and MLOps platforms out there to choose from. Apples to apples comparison of those platforms might be difficult as the best MLOps tool for one case might differ to another case. After all, choosing the fitting MLOps tools highly depends on various factors like business need, current setup, available resources at disposal etc.

However, with the experience of using a few other platforms, personally, I find Databricks the most comprehensive platform of all. I believe Databricks make it easy for organisations to streamline their ML operations at scale. Platform’s collaboration and sharing capabilities should make it easy for teams to work together on data projects using multiple technologies in parallel. One particular tool which I found pleasing to work with is Databricks notebook. It is a code development tool, which supports multiple programming languages (R, SQL, Python, Scala ) in a single notebook, while also supporting real time co-editing and commenting. In addition, as the entire project can be version controlled by a tool of choice and integrates very well with their associated CI/CD tools, it adds flexibility to manage, automate and execute the different pipelines.

To sum up, Databricks strength lies in its collaborative, comprehensive and integrated environment for running any kind of data loads whether it is data engineering, data science or machine learning on top of their Lakehouse architecture. While many cloud based tools come tightly coupled with their cloud services, Databricks is cloud agnostic making it easy to set up if one’s enterprise is already running on a major cloud provider (AWS, Azure or Google cloud).

Finally, if you would like to hear more about Databricks as an unified Analytics, Data, and Machine Learning platform and learn how to leverage Databricks services in your Data journey, please don’t hesitate to contact me our Business Lead – Data Science, AI & Analytics, Mikael Ruohonen at +358414516808 or mikael.ruohonen@solita.fi or me at jyotiprasad.bartaula@solita.fi.

Positioning data build tool, dbt, in the data tooling landscape

dbt has gained a lot of traction, and we’ve seen some projects and offers, where dbt is already chosen as the tool. But what dbt actually is? Where can you position it in the data tooling landscape? What does it actually do, what it sort-of does, and what it doesn’t do?

June 21, 2021

Mikko Sulonen Data Engineer

There’s a lot of buzz going around the data build tool, or dbt. It has gained a lot of traction. We’ve seen some projects and offers, where dbt is already chosen as the tool to be used. But what dbt actually is? Where can you position it in the data tooling landscape? What does it actually do, what it sort-of does, and what it doesn’t do? This blog post focuses on how to position dbt, what are its strengths and weaknesses, and how you should compare it to other tools in the market.

What is data build tool?

Data build tool, dbt for short, is a tool to manage and, to an extent, orchestrate the transformations inside a database. It’s an open source command line tool written in Python. Every entity is defined as a “SELECT”-statement in SQL with the possibility to use Jinja templating. Configurations are written in yml files. It’s a “unixey” command line tool designed to do one task, “Transformations” of “ETL”, do them well, and provide easy integrations to other parts of your data toolbox.

dbt, fully supports modern cloud data warehouses: Snowflake, Redshift and BigQuery. Azure Synapse (and Azure SQL Server for that matter) is supported by community created plugins. The full list of supported databases is listed at: https://docs.getdbt.com/docs/supported-databases/

dbt comes in two flavors: “dbt core” which is the open source cli tool, and a paid “dbt cloud”. We will focus on the free open source cli version. dbt cloud offers for example ci/cd pipelines and a browser-based IDE, but runs using the same dbt core. We’ll create our own pipelines, environments and a production ready setup in a later post.

https://www.getdbt.com/product/

What dbt does well

Data build tool really focuses on doing one thing and doing it well straight from the command line. In that sense, it can be seen sort of an “Unixey” tool to do transformations inside a data warehouse. But there’s much more of what dbt does well than just being a nice cli tool.

The already mentioned data transformations are at the heart of dbt. Data is transformed step by step by simply writing the required SQL statements as “SELECT nnn, from <a reference syntax to another dbt entity>”. There is no DDL needed as dbt takes care of that based on configuration. Different SQL statements are stored in a project structure, one file per one database entity. The project and file structure is pretty much freeform, and there obviously are some design considerations to make here. More on the project structure in a later post about creating a production worthy dbt environment.

As mentioned, dbt is a command line interface tool. Its setup on your own machine is really as simple as brew install dbt. Of course, you might want to run it in a virtual environment, but that’s not the point. The point is that getting dbt up and running is really simple and fast:

Install dbt
Setup project structure
Setup your database connection in yml profile
Write familiar SQL in files
dbt run
git commit

In addition to the entities.sql files in the project, there are .yml configurations. These provide a great way to have wide yet granular configurations on all aspects of your project. It is easy to tell staging entities to be tables in “staging” schema, transformations to be views in “transform” schema, and publish layer to be incremental loads in “publish” layer. And then, as the project progresses, you can just as easily make exceptions to these.

To make things better, the SQL used to write the transformations is spiced with Jinja-templating language. You use special syntax to reference your source tables, you can do for-loops, and what makes it especially useful, you can create macros. Macros and different packages really extend what the basic dbt can do out-of-the-box. More about packages in dbt (data build tool) – Explore Packages (getdbt.com). As an example, there’s a really nice small macro to do a “select * expect a and b columns”, which then evaluates to include all the named columns without a and b. On the other end of the spectrum, there’s a package called dbtvault, which contains plenty of macros to aid in creating a Data Vault in Snowflake using dbt.

Besides transformation, dbt does data testing. As with the transformations, you can write your own SQL or use ready-made statements like “unique” or “not_null” to test for those properties in any table/column.

Last thing dbt does really well is the documentation. There’s a built-in documentation generation, which documents the current project and SQLs as a static webpage, which can then be viewed either locally or put up on a simple Blob Storage or S3 (since it’s a static webpage).

Documentation | dbt Docs (getdbt.com)

To put it short, dbt really shines in doing the transformations. The transformation capabilities are greatly enhanced with Jinja templating, macros and hooks (different times to fire a macro or SQL, like “before a table” or “at the end of a run”). To top it off, built-in testing capabilities and the documentation generation make it really easy to start building a data warehouse.

What dbt sort-of does

In addition to the transformations, there are some other capabilities data build tool can do, but you might find them quite lacking quite quickly: Orchestration and SQL generation.

The orchestration here is a bit of a mixed bag. dbt can do “runs” against a target database, running all the transformations you tell it to run using a pretty simple syntax or the whole project. But any more fine grained workflow orchestration requires a lot more work and moving parts than just dbt. Say you want to run one source system data on an hourly schedule, the rest daily, and do the runs going “run stage -> test stage -> run transform -> test transform -> run publish -> test publish -> switch to new publish”. Pretty basic need for a larger setup, but pretty difficult, if not impossible, to achieve with (the current version) of dbt. That is not to say you can’t use dbt to orchestrate your runs, you can and probably should start with it. But you should be aware of the limitations when the business requirements come.

The other lacking part is the SQL generation. This has partly to do with the SQL, Jinja, macros and packages. They do expand on the capabilities, but out-of-the-box, dbt doesn’t contain any SQL generation capabilities besides DDL. Say you want to have a common structure and metadata columns for all staging and publish tables: you either need to find a package for it, or create your own macro. Of course this is a double edged sword: dbt is really expandable in this way, but limited out-of-the-box.

What dbt doesn’t do

Data build tool is not an integration tool. Its only focus is managing things inside a data warehouse. What this means is that you should manage your data loads from storage to the database somewhere else, and you should also account for your persistent staging somehow.

dbt is built on the assumption that you can always completely rebuild your data warehouse from the source data. This comes up here and there when you get to the details of running a dbt project, but perhaps the biggest thing to note here, together with the lack of integrations, is how to account for persistent staging.

This means that not only you need to have a persistent staging available for dbt to rebuild the database, you also need to manage it outside dbt since you really can’t run integrations with it. This comes down to managing landing tables and populating those tables with some other tool, and possibly integrating that other tool with dbt for seamless scheduling.

As stated earlier, dbt is a cli tool using file based project structure to manage your data warehouse. What this means is that dbt in itself doesn’t contain any way to manage deployments to different environments or anything. For this, you need to build your version control and ci/cd pipelines. This, again, is something for the next part of this blog.

Things to consider

Then there are things you need to consider in your project and in the environment in which you develop. Not the dev/qa/prod environments, but the size, number and capabilities of developers and so on. As a “Unixey” tool with pretty much no guidance, there are some aspects which come in to play in different sized projects and different environments.

For larger projects and large and/or multiple teams, using dbt can have it’s own problems. For larger projects, you will need to design your project structure in a way that it supports your development. Having a huge all-in-one dbt project has its own pros and cons: consistent way of using dbt, common packages and macros, managing a huge repository, deploying consistently working versions and dbt performance (as of writing this blog post). Splitting a data warehouse project into multiple dbt projects again has its own problems: not having all of the entities available for simple ref-syntax, managing packages and macros among projects, deploying consistently working versions to name a few.

Another thing to consider is the fact that dbt works straight on top of the target database. There are no abstraction layers between. At first, this might seem like not a big deal, but what this means is that you need to have some understanding of the underlying database when using dbt: even if you are writing the SQL with Jinja, what you type is what you get. There are no abstractions, no SQL generators based on the target database engine. Out-of-the-box. Again, you can create your own macros and use packages. Your SQL files are the entities, there are no more abstract models and the relations between them: it’s “select nn from aa”.

So when to use dbt?

As presented above, dbt shines in doing (mostly) one thing and one thing only: managing the SQL transformations on your data. It’s fast to set up, excels at creating transformations and a data lineage based load ordering is easy. You can write data quality tests. And you even have a way to execute those runs and tests.

Equally important are the parts that dbt doesn’t do anything about: it doesn’t do integrations. It doesn’t do load orchestration.

And you should also know the pitfalls: you should plan and divide your dbt project in advance. Manage testing and releasing new builds. How to keep scaling and making sure that all isn’t behind “the-one-dev-that-set-it-all-up and knows everything about dbt”?

So in conclusion, dbt really is a “Unixey” tool. It does transformations and their management from the command line and provides integration points for tools before and tools after it in the load chain. To have a working DataOps toolbox in addition to just the cli dbt you would require at least:

Version control and the processes surrounding it
Managing different versions in different target environments
Source data integrations and data loads
Creating and managing data load workflows

But this is not all bad. As with all projects and business cases, you want to slice them up into smaller and more manageable parts: in a sense, dbt does just that with focusing on being only one tool in the toolbox.

This also means that you can’t easily compare just dbt to tools like Fivetran, Matillion, WhereScape, Airflow, Azure Data Factory or Solita Agile Data Engine. You need to know what are the capabilities that are missing from your data project toolbox, what capabilities different tools bring to the table, and what the benefits and drawbacks of each are. And if you don’t know what capabilities you should have, go check out Vesa’s nice post about DataOps platforms: What to look for in a DataOps platform? – Solita Data

What to look for in a DataOps platform?

DataOps series is back at last. This time we are going through what to look for when choosing your technical approach for DataOps.

October 20, 2020

Vesa Hammarberg

Building a resilient data capability can be a tricky task as there are vast number of viable options on how to proceed on the matter. Price and marketing materials don’t tell it all and you need to decide what is valuable for your organization when building a data platform.

Before going in to actual features and such I want to give few thoughts on cloud computing platforms and their native services and components. Is it always best to just stay platform native? All those services are “seamlessly integrated” and are easy for you to take into use. But up to which extent do those components support building a resilient data platform? You can say that depends on how you define a “data platform”.

My argument is that a data warehouse still is a focal component in enterprise wide data platform and when building one, the requirements for the platform are broader than what the platforms themselves can offer at the moment.

But let’s return to this argument later and first go through the requirements you might have for DataOps tooling. There are some basic features like version control and control over the environments as code (Infastructure as code) but let’s concentrate on the more interesting ones.

Model your data and let the computer generate the rest

The beef in classic data warehousing. Even if you call it a data platform instead of a data warehouse, build it on the cloud and use development methods that originate from software development, you still need to model your data and also integrate and harmonize it across sources. There sometimes is a misconception that if you build a data platform on the cloud and utilize some kind of data lake in it, you would not need to mind about data models.

This tends to lead to different kinds of “data swamps” which can be painful to use and the burden of creating a data model is inherited to the applications built on top of the lake.

Sure there are different schema-on-read type of services that sit on top of data lakes but they tend to have some shortcomings when compared to real analytical databases (like in access control, schema changes, handling deletes and updates, performance of the query optimizer, concurrent query limits, etc.).

To minimize the manual work related to data modelling, you only want the developers to do the required logical data modelling and derive the physical model and data loads from that model. This saves a lot of time as developers can concentrate on the big picture instead of technical details. Generation of the physical model also keeps it more consistent because there won’t be different personal ways in the implementation as developers don’t write the code by hand but it is automatically generated based to modelling and commonly shared rules.

Deploy all the things

First of all, use several separate environments. Developers need an environment where it is safe to try out things without the fear of breaking things or hampering the functionality of the production system. You also need an environment where production ready features can be tested with production-like data. And of course you need the production environment which is there only to serve the actual needs of the business. Some can cope with two and some prefer four environments but the separation of environments and automation regarding deployments are the key.

In the past, it has been a hassle to move data models and pipelines from one environment to another but it should not be like that anymore.

Continuous integration and deployment are the heart of a modern data platform. Process can be defined once and automation handles changes between environments happily ever after.

It would also be good if your development tooling supports “late building” meaning that you can do the development in a DBMS (Database Management System) agnostic way and your target environment is taken into account only in the deployment phase. This means that you are able to change the target database engine to another without too much overhead and you potentially save a lot of money in the future. To put it short, by utilizing late building, you can avoid database lock-in.

Automated orchestration

Handling workflow orchestration can be a heck of a task if done manually. When can a certain run start, what dependencies does it have and so on. DataOps way of doing orchestrations is to fully automate them. Orchestrations can be populated based on the logical data model and references and dependencies it contains.

In this approach the developer can concentrate on the data model itself and automation optimizes the workflow orchestration to fully utilize the parallel performance of the used database. If you scale the performance of your database, orchestrations can be generated again to take this into account.

The orchestrations should be populated in a way that concurrency is controlled and fully utilized so that things that can be ran parallel are ran so. When you add some step in the pipeline, changes to orchestrations should automatically be deployed. Sound like a small thing but in a large environment something that really saves time and nerves.

Assuring the quality

It’s important that you can control the data quality. At best, you could integrate the data testing in your workflows so that the quality measures are triggered every time your flow runs and possible problems are reacted to.

Data lineage can help you understand what happens with the data before it ends up in the use of end users.

It can also be a tool for the end users to better understand what the data is and where it comes from. You could also use tags for elements in your warehouse so that it’s easier to find everything that is related to for example personally identifiable information (PII).

Cloud is great but could it be more?

So about the cloud computing platforms. Many of the previously mentioned features can be found or built with native cloud platform components. But how about keeping at least the parts you heavily invest work in platform agnostic if at some point for some reason you have to make a move regarding the platform (reasons can be changes in pricing, corporate mergers & acquisitions, etc.). For big corporations it’s also more common that they utilize more than one cloud platform and common services and tooling over the clouds can be a guiding factor as well.

Data modelling especially is an integral part of data platform development but it still is not something that’s integrated in the cloud platforms themselves on a sufficient level.

In the past, data professionals have used to quite seamless developer experience with limited amount of software required to do the development. On cloud platforms this changed as you needed more services most having a bit different user experience. This has changed a bit since as cloud vendors have started to “package” the services (such like AWS Sagemaker or Azure DevOps) but we still are in the early phases of packaging that kind of tooling.

If the DataOps capabilities are something you would like to see out-of-the-box in your data platform, go check out our DataOps Platform called Agile Data Engine. It enables you to significantly reduce time to value from business perspective, minimize total cost of ownership and make your data platform future-proof.

Also check previous posts in DataOps series

DataOps – new kid on the data block

Why and how to enable DataOps in an organization?

Snowflake external functions, Part 2 – How to do Natural Language Processing and analyze product reviews stored in Snowflake

This tutorial is a hands-on tutorial for Snowflake external functions and shows how you can translate and categorize your Snowflake data using Amazon Translate and Amazon Comprehend services. This tutorial will translate Finnish product comments to English and analyze them.

September 15, 2020

Mika Heino Data Architect, Snowflake Tech Lead

This the second part of the external functions blog post series and teaches how you can trigger Amazon services like Translate and Comprehend using Snowflake external functions. I will also explain and go through the limits of external functions in this blog post.

In the first blog post, I taught how you can set up your first Snowflake external function and trigger simple Python code inside AWS Lambda. In case external functions are a new concept for you, I suggest reading the first blog post before diving into this.

External functions limitations

Previously I left the limitations of external functions purposely out, but now when we are building something actually useful with them, you should understand the playground what you have and what are the boundaries.

Let’s start with the basics. As all the traffic is going through AWS API Gateway, you’re bound to the limitations of API Gateway. Max payload size for AWS API Gateway is 10MB and that can’t be increased. Assuming that you will call AWS Lambda through API Gateway, you will face the Lambda limits, which are maxed to 6MB per synchronous requests. Depending on the use-case or the data pipeline you’re building, there are workarounds, for example, ingesting the raw data directly through S3.

Snowflake also sets limitations; for example, the remote service at a cloud provider, in our case AWS, must callable from a proxy service. Limitations include that external functions must be scalar functions which mean single value for each input row. It doesn’t though mean that you can’t process only one row at the time. The data is sent as a JSON body which can contain multiple “rows”. Additional limitations set by Snowflake is that Snowflake optimizer can’t be used with external functions, external functions can’t be shared through Secure Data Sharing and you can’t use them in DEFAULT clause of a CREATE TABLE statement or with COPY transformations.

Things to consider

The cloud infrastructure, AWS in this case, sets also limitations or rather things to consider. How does the underlying infrastructure handle your requests? You should think how your function scales, acts on concurrency cases and how reliable it is. Doing a simple function which is called by a single developer usually functions without any issues, but you must design your function in a way that it works with multiple developers who are calling the function numerous times within hour contrasted to the single call which you made during development.

Concurrency is an even more important issue as Snowflake can parallelize external function calls. Can the function you wrote handle multiple parallel calls at once or does it crash? Another thing to understand is that with multiple parallel calls, you end up in a situation where the functions are in different states. This means that you should not write function where the result depends upon the order in which user’s rows are processed.

Error handling is also a topic which should not be forgotten. Snowflake external functions understand only HTTP 200 status code. All other status codes are considered as an error. This means that you need to build the error-handling to the function itself. External functions also have poor error messages as stated above. This means that you need to log all those “other than 200 status codes” to somewhere for later analysis.

Moneywise you’re also on the blindside. Calling out Snowflake SQL function hides all the costs what are related to the AWS services. An external function which is implemented poorly can lead to high costs.

Example data format

External functions call the resources by issuing HTTP POST request. This means that the sent data must be in a specific format to work. The returned data must also conform to a specific format. Due to these factors, the data sent and returned might look unusual. For example, we must always send integer value with the data. This value appears as a row number for the 0-based index. The actual data is converted to JSON data types, i.e.

Numbers are serialized as JSON numbers.
Booleans are serialized as JSON booleans.
Strings are serialized as JSON strings.
Variants are serialized as JSON objects.
All other supported data types are serialized as JSON strings.

It’s essential also to recognise that this means that dates, times, and timestamps are serialized as strings. Each row of data is a JSON array of one or more columns and can sent data can be compressed COMPRESSION syntax with CREATE EXTERNAL FUNCTION -SQL clause. It’s good to though understand that Amazon API Gateway automatically compresses/decompresses requests.

What are Amazon Translate and Amazon Comprehend?

As Amazon advertises, Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. What does that truly mean? It means that AWS Translate is a fully automated service where you can transmit text data to Translate API and you get the text data translated back in the chosen language. Underneath the hood, Amazon uses its own deep learning API to do the translation. In our use case, Amazon Translate is easy to use as the Translate API can guess the source language. Normally you would force the API to translate text from French to English, but with Translate API, we can set the source language to ‘auto’ and Amazon Translate will guess that we’re sending them French text. This means that we only need minimal configuration to get Translate API to work.

Amazon Translate can even detect and translate Finnish language, which is sometimes consider a hard language to learn.

For demo purposes Translate billing is also a great fit, as you can translate 2M characters monthly in your first 12 months, which start from your first translation. After that period the bill is 15.00$ per million characters.

Amazon Comprehend is also a fully managed language processing (NLP) service that uses machine learning to find insights and relationships in a text. You can use your own models or use built-in models to recognize entities, extract key phrases, detect dominant languages, analyze syntax, or determine sentiment. Like Amazon Translate, the service is called through an API.

As Finnish is not supported language for Amazon Comprehend, the translated text is run through the Comprehend API to get insights.

Example – Translating product comments made in Finnish to English with Amazon Translate and Snowflake external functions

As we have previously learned how to create the connection between Snowflake and AWS, we can focus on this example on the Python code and external function itself which is going to trigger the Amazon Translate API. As with all Amazon services, calling Translate API is really easy. You only need to import the boto3 class and use the client session to call the translate service.

After setting up the translate, you call the service with a few mandatory parameters and you’re good to go. In this example, we are going to leverage the Python -code, which was used in the previous blog post.

Instead of doing simple addition of string to the input, we’re going to pass the input to Translate API, translate the text to English and get the result back in JSON -format for later use. You can follow the instructions in the previous example and replace the Python -code with this new code stored in my Github -account.

After changing the Python -code, we can try it right away, because the external function does not need any change and data input is done in the same way as previously. In this case, I have created a completely new external function, but it works in a similar way as previously. I have named my Lambda -function as translate and I’m calling it with my Snowflake lambda_translate_function as shown.

Calling the function is easy as we have previously seen, but when we call the Translate API directly we will the get full JSON answer which contains a lot of data which we do not need.

Because of this, we need to parse the data and only fetch the translated text.

As you can see, creating functions which do more than simple calculations is easy with external functions. We could gather a list of product comments in multiple languages and translate them into one single language for better analysis e.g. understanding in this case that Finnish comment means that snickers sold are rubbish in quality.

Example – Categorizing product comments with Amazon Comprehend and Snowflake external functions

Extending the previous example, we have now translated the Finnish product comment to English -language. We can extend this furthermore by doing sentiment analysis for the comment using Amazon Comprehend. This is also straight forward job and requires only you to either create a new Python function which calls the Comprehend API or modify the existing Python code for demo purposes.

Only needed changes are needed for the Python code and to the IAM role which the Lambda uses. The Python code is again really similar as previously. This time we call comprehend service using the same boto3 class.

To detect sentiment we use the similarly named sub-class and provide the input source language and text to analyze. You can use the same test data which was used with Translate demo and with the first blog. Comprehend will though give NEUTRAL -analysis for the test data.

Before heading to Snowflake and trying out the new Lambda -function, go to IAM -console and grant enough rights for the role that Lambda -function uses. These rights are used for demo purposes and ideally only read rights for DetectSentiment action are enough.

These are just example rights and do contain more than are needed.

Once you have updated the IAM role rights, jump into the Snowflake console and try out the function in action. As we want to stick with the previous demo data, I will translate the outcome of the previous translation. For demo purposes, I have left out the single apostrophe as those are used by Snowflake.

As you can see, getting instant analysis for the text was right. To be sure that we getting correct results, we can test out with new test data i.e. with positive product comment.

As you can, with Snowflake external functions it’s really easy to leverage Machine Learning, Natural Language Processing or AI -services together with Snowflake. External functions are new feature so this means that this service will only grow in the future. Adding Azure and Google compatibility is already on the roadmap, but in the meantime, you can start your DataOps and MLOps journey with Snowflake and Amazon.

Snowflake external functions, Part 1 – Hello World tutorial for triggering AWS Lambda

This tutorial is a hands-on Hello World introduction and tutorial to external functions in Snowflake and shows how to trigger basic Python code inside AWS Lambda

August 25, 2020

Mika Heino Data Architect, Snowflake Tech Lead

External functions are new functionality published by Snowflake and already available for all accounts as a preview feature. With external functions, it is now possible to trigger for example Python, C#, Node.js code or native cloud services as part of your data pipeline using simple SQL.

I will publish two blog posts explaining what external functions are in Snowflake, show how to trigger basic Hello World Python code in AWS Lambda with the result showing in Snowflake and finally show how you can trigger Amazon services like Translate and Comprehend using external functions and enable concrete use cases for external functions.

In this first blog post, I will focus on the showing on how you can set up your first external function and trigger Python code which echoes your input result back to Snowflake.

What external functions are?

At the simplest form, external functions are scalar functions which return values based on the input. Under the hood, they are much more. Compared to traditional scalar SQL function where you are limited using SQL, external functions open up the possibility to use for example Python, C# or Go as part of your data pipelines. You can also leverage third-party services and call for example Amazon services if they support the set requirements. To pass the requirements, the external function must be able to accept JSON payload and return JSON output. The external function must also be accessed through HTTPS endpoint.

Example – How to trigger AWS Lambda -function

This example follows instructions from Snowflake site and shows you in more detail on how you can trigger Python code running on AWS Lambda using external functions like illustrated below.

To complete this example, you will need to have AWS account where you have the necessary rights to create AWS IAM (Identity and Access Management) roles, API Gateway endpoints and Lambda -functions. You will need also a Snowflake ACCOUNTADMIN -privileges or role which has CREATE INTEGRATION rights.

These instructions consist of the following chapters.

Creating a remote service (Lambda Function on AWS)
Creating an IAM role for Snowflake use
Creating a proxy service on AWS API Gateway.
Securing AWS API Gateway Proxy
Creating an API Integration in Snowflake.
Setting up trust between Snowflake and IAM role
Creating an external function in Snowflake.
Calling the external function.

These instructions are written for a person who has some AWS knowledge as the instructions will not explain the use of services. We will use the same template as the Snowflake instruction to record authentication-related information. Having already done a few external function integrations, I highly recommend using this template.

Cloud Platform (IAM) Account Id: _____________________________________________
Lambda Function Name...........: _____________________________________________
New IAM Role Name..............: _____________________________________________
Cloud Platform (IAM) Role ARN..: _____________________________________________
Proxy Service Resource Name....: _____________________________________________
Resource Invocation URL........: _____________________________________________
Method Request ARN.............: _____________________________________________
API_AWS_IAM_USER_ARN...........: _____________________________________________
API_AWS_EXTERNAL_ID............: _____________________________________________

Creating a remote service (Lambda Function on AWS)

Before we create Lambda function we will need to obtain our AWS platform id. The easiest way to do this is to open AWS console and open “Support Center” under “Support” on the far right.

This will open a new window which will show your AWS platform id.

Record this 12-digit number into template shared previously. Now we will create a basic Lambda -function for our use. From the main console search Lambda

Once you have started Lambda, create a new function called snowflake_test using Python 3.7 runtime. For the execution role, select the option where you create a new role with basic Lambda permissions.

After pressing the “Create function” button, you should be greeted with the following view where you can paste the example code. The example code will echo the input provided and add text to confirm that the Snowflake to AWS connection is working. You can consider this as a Hello World -type of example which can be leveraged later on.

Copy-paste following Python code from my Github account into Function code view. We can test the Python code with following test data which should create following end result:

After testing the Lambda function we can move into creating an IAM role which is going to be used by Snowflake.

Creating an IAM role for Snowflake use

Creating an IAM role for Snowflake use is a straight forward job. Open up the Identity and Access Management (IAM) console and select “Roles” from right and press “Create role”.

You should be greeted with a new view where you can define which kind of role you want to create. Create a role which has Another AWS account as a trusted entity. In the box for Account ID, give the same account id which was recorded earlier in the instructions.

Name the new role as snowflake_role and record the role name into the template. Record also the role ARN.

Creating a proxy service on AWS API Gateway

Create an API Gateway endpoint to be used. Snowflake will use this API endpoint to contact the Lambda -service which we created earlier. To create this, choose API Gateway service from the AWS console and select “Create API”. Call this new API snowflake_test_api and remember to select “Regional” as the endpoint type as currently, they are the only supported type.

Create a Resource for the new API. Call the resource snowflake and record the same to the template as Proxy Service Resource Name.

Create Method for the new API from the “Actions” menu, choose POST and press grey checkmark to create.

During the creation choose Lambda Function as Integration type and select “Use Lambda Proxy Integration”. Finally, choose the Lambda function created earlier.

Save your API and deploy your API to a stage.

Creating a new stage can be done at the same time as the deploy happens.

Once deployed, record the Invoke URL from POST.

Now were done creating the API Gateway. Next step is to secure the API Gateway that only your Snowflake account can access it.

Securing AWS API Gateway Proxy

In the API Gateway console, go to your API method and choose Method Request.

Inside Method Request, choose “AWS_IAM” as the Authorization mode.

Record the Method Request ARN to the template to be used later on. You can get the value underneath the Method Request.

Once done, go to Resource Policy and deploy the following policy from my Github account. You can also copy the policy from the Snowflake -example. In AWS Principal, replace the <12-digit number> and <external_function_role> with your AWS platform id and with IAM role created earlier. In AWS Resource, replace the resource with the Method Request ARN recorded earlier. Save the policy once done and deploy the API again.

Creating an API Integration in Snowflake

Next steps will happen on the Snowflake console, so open up that with your user who has the necessary rights to create the integration.

With necessary rights type in following SQL where <cloud_platform_role_ARN> is the ARN of the IAM role created previously and api_allowed_prefixes is the resource invocation URL.

CREATE OR REPLACE API INTEGRATION snowflake_test_api
api_provider = aws_api_gateway
api_aws_role_arn = ‘<cloud_platform_role_ARN>’
enabled = true
api_allowed_prefixes = (‘https://’)
;

The end result should like something like this

When done, obtain API_AWS_IAM_USER_ARN and API_AWS_EXTERNAL_ID values by describing the API.

Setting up trust between Snowflake and the IAM role

Next steps are done in the AWS console using the values obtained from Snowflake.

In the IAM console, choose the previously created role and select “Edit trust relationships” from “Trust relationships” -tab.

In Edit Trust Relationships modify the Statement.Principal.AWS field and replace the value (not the key) with the API_AWS_IAM_USER_ARN that you saved earlier.

In the Statement.Condition field Paste “StringEquals”: { “sts:ExternalId”: “xxx” } between curly brackets. Replace the xxx with API_AWS_EXTERNAL_ID. The final result should look something like this.

Update the policy when done and go back to the Snowflake console.

Creating an external function in Snowflake

In Snowflake create the external function as follows. The <api_integration name> is the same we created previously in the Snowflake console. The <invocation_url> is the resource invocation URL saved before. Include also the resource name this time.

CREATE EXTERNAL FUNCTION my_external_function(n INTEGER, v VARCHAR)
RETURNS VARIANT
API_INTEGRATION = <api_integration_name>
AS ‘<invocation_url>’
;

End result should like something like this

Calling the external function

You can now finally that the connection is working, by selecting the function with an integer value and any given string. The output should be as shown in the image. As you can see, this example is really basic and shows only that the connection is working.

If you face in errors during the execution, check the troubleshooting page at Snowflake for possible solutions. I can say from experience that you must follow the instructions really carefully and remember to deploy the API at AWS often to reflect your changes.

This blog post has now covered the basics of external functions e.g. how you can trigger basic Python code running inside AWS Lambda. Next time I will show how you can build something concrete using the same tools and Amazon services.

Are we there yet?

External functions are currently a Preview Feature and are open to all accounts, but they support currently only services behind Amazon AWS API Gateway.

MLOps: from data scientist’s computer to production

MLOps refers to the concept of automating the lifecycle of machine learning models from data preparation and model building to production deployment and maintenance. MLOps is not only some machine learning platform or technology, but instead it requires an entire change in the mindset of developing machine learning models towards best practises of software development. In this blog post we introduce this concept and its benefits for anyone having or planning to have machine learning models running in production.

May 8, 2020

Miia Niemelä

Anniina Sallinen

Operationalizing data platforms, DataOps, has been among the hottest topics during the past few years. Recently, also MLOps has become one of the hottest topics in the field of data science and machine learning. Building operational data platforms has made data available for analytics purposes and enabled development of machine learning models in a completely new scale. While development of machine learning models has expanded, the processes of maintaining and managing the models have not followed in the same pace. This is where the concept of MLOps becomes relevant.

What is MLOps?

Machine learning operations, or MLOps, is a similar concept as DevOps (or DataOps), but specifically tailored to needs of data science and more specifically machine learning. DevOps was introduced to software development over a decade ago. DevOps practices aim to improve application delivery by combining the entire life cycle of the application – development, testing and delivery – to one process, instead of having a separate development team handing over the developed solution for the operations team to deploy. The definite benefits of DevOps are shorter development cycles, increased deployment velocity, and dependable releases.

Similarly as DevOps aims to improve application delivery, MLOps aims to productionalize machine learning models in a simple and automated way.

As for any software service running in production, automating the build and deployment of ML models is equally important. Additionally, machine learning models benefit from versioning and monitoring, and the ability to retrain and deploy new versions of the model, not only to be more reliable when data is updated but also from the transparency and AI ethics perspective.

Why do you need MLOps?

Data scientists’ work is research and development, and requires essentially skills from statistics and mathematics, as well as programming. It is iterative work of building and training to generate various models. Many teams have data scientists who can build state-of-the-art models, but their process for building and deploying those models can be entirely manual. It might happen locally, on a personal laptop with copies of data and the end product might be a csv file or powerpoint slides. These types of experiments don’t usually create much business value if they never go live to production. And that’s where data scientists in many cases struggle the most, since engineering and operations skills are not often data scientists’ core competences.

In the best case scenario in this type of development the model ends up in production by a data scientist handing over the trained model artifacts to the ops team to deploy, whereas the ops team might lack knowledge on how to best integrate machine learning models into their existing systems. After deployment, the model’s predictions and actions might not be tracked, and model performance degradation and other model behavioral drifts can not be detected. In the best case scenario your data scientist monitors model performance manually and manually retrains the model with new data, with always a manual handover again in deployment.

The described process might work for a short time when you only have a few models and a few data scientists, but it is not scalable in the long term. The disconnection between development and operations is what DevOps originally was developed to solve, and the lack of monitoring and re-deployment is where MLOps comes in.

How can MLOps help?

Instead of going back-and-forth between the data scientists and operations team, by integrating MLOps into the development process one could enable quicker cycles of deployment and optimization of algorithms, without always requiring a huge effort when adding new algorithms to production or updating existing ones.

MLOps can be divided into multiple practices: automated infrastructure building, versioning important parts of data science experiments and models, deployments (packaging, continuous integration and continuous delivery), security and monitoring.

Versioning

In software development projects it is typical that source code, its configurations and also infrastructure code are versioned. Tracking and controlling changes to the code enables roll-backs to previous versions in case of failures and helps developers to understand the evolution of the solution. In data science projects source code and infrastructure are important to version as well, but in addition to them, there are other parts that need to be versioned, too.

Typically a data scientist runs training jobs multiple times with different setups. For example hyperparameters and used features may vary between different runs and they affect the accuracy of the model. If the information about training data, hyperparameters, model itself and model accuracy with different combinations are not saved anywhere it might be hard to compare the models and choose the best one to deploy to production.

Templates and shared libraries

Data scientists might lack knowledge on infrastructure development or networking, but if there is a ready template and framework, they only need to adapt the steps of a process. Templating and using shared libraries frees time from data scientists so they can focus on their core expertise.

Existing templates and shared libraries that abstract underlying infrastructure, platforms and databases, will speed up building new machine learning models but will also help in on-boarding any new data scientists.

Project templates can automate the creation of infrastructure that is needed for running the preprocessing or training code. When for example building infrastructure is automated with Infrastructure as a code, it is easier to build different environments and be sure they’re similar. This usually means also infrastructure security practices are automated and they don’t vary from project to project.

Templates can also have scripts for packaging and deploying code. When the libraries used are mostly the same in different projects, those scripts very rarely need to be changed and data scientists don’t have to write them separately for every project.

Shared libraries mean less duplicate code and smaller chance of bugs in repeating tasks. They can also hide details about the database and platform from data scientists, when they can use ready made functions for, for instance, reading from and writing to database or saving the model. Versioning can be written into shared libraries and functions as well, which means it’s not up to the data scientist to remember which things need to be versioned.

Deployment pipeline

When deploying either a more traditional software solution or ML solution, the steps in the process are highly repetitive, but also error-prone. An automated deployment pipeline in CI/CD service can take care of packaging the code, running automated tests and deployment of the package to a selected environment. This will not only reduce the risk of errors in deployment but also free time from the deployment tasks to actual development work.

Tests are needed in deployment of machine learning models as in any software, including typical unit and integration tests of the system. In addition to those, you need to validate data and the model, and evaluate the quality of the trained model. Adding the necessary validation creates a bit more complexity and requires automation of steps that are manually done before deployment by data scientists to train and validate new models. You might need to deploy a multi-step pipeline to automatically retrain and deploy models, depending on your solution.

Monitoring

After the model is deployed to production some people might think it remains functional and decays like any traditional software system. In fact, machine learning models can decay in more ways than traditional software systems. In addition to monitoring the performance of the system, the performance of models themselves needs to be monitored as well. Because machine learning models make assumptions of real-world based on the data used for training the models, when the surrounding world changes, accuracy of the model may decrease. This is especially true for the models that try to model human behavior. Decreasing model accuracy means that the model needs to be retrained to reflect the surrounding world better and with monitoring the retraining is not done too seldom or often. By tracking summary statistics of your data and monitoring the performance of your model, you can send notifications or roll back when values deviate from the expectations made in the time of last model training.

Applying MLOps

Bringing MLOps thinking to the machine learning model development enables you to actually get your models to production if you are not there yet, makes your deployment cycles faster and more reliable, reduces manual effort and errors, and frees time from your data scientists from tasks that are not their core competences to actual model development work. Cloud providers (such as AWS, Azure or GCP) are especially good places to start implementing MLOps in small steps, with ready made software components you can use. Moreover, all the CPU / GPU that is needed for model training with pay as you go model.

If the maturity of your AI journey is still in early phase (PoCs don’t need heavy processes like this), robust development framework and pipeline infra might not be the highest priority. However, any effort invested in automating the development process from the early phase will pay back later and reduce the machine learning technical debt in the long run. Start small and change the way you develop ML models towards MLOps by at least moving the development work on top of version control, and automating the steps for retraining and deployment.

DevOps was born as a reaction to systematic organization needed around rapidly expanding software development, and now the same problems are faced in the field of machine learning. Take the needed steps towards MLOps, like done successfully with DevOps before.

Introduction to AWS Glue for big data ETL

AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

August 9, 2019

Mikael Ahonen Data Scientist

AWS Glue is a serverless ETL tool in cloud. In brief ETL means extracting data from a source system, transforming it for analysis and other applications and then loading back to data warehouse for example.

In this blog post I will introduce the basic idea behind AWS Glue and present potential use cases.

The emphasis is in the big data processing. You can read more about Glue catalogs here and data catalogs in general here.

Why to use AWS Glue?

Replacing Hadoop. Hadoop can be expensive and a pain to configure. AWS Glue is simple. Some say that Glue is expensive, but it depends where you compare. Because of on demand pricing you only pay for what you use. This fact might make AWS Glue significantly cheaper than a fixed size on-premise Hadoop cluster.

AWS Lambda can not be used. A wise man said, use lambda functions in AWS whenever possible. Lambdas are simple, scalable and cost efficient. They can also be triggered by events. For big data lambda functions are not suitable because of the 3 GB memory limitation and 15 minute timeout. AWS Glue is specifically built to process large datasets.

Apply DataOps practices. Drag and drop ETL tools are easy for users, but from the DataOps perspective code based development is a superior approach. With AWS Glue both code and configuration can be stored in version control. The data development becomes similar to any other software development. For example the data transformation scripts written by scala or python are not limited to AWS cloud. Environment setup is easy to automate and parameterize when the code is scripted.

An example use case for AWS Glue

Now a practical example about how AWS Glue would work in practice.

A production machine in a factory produces multiple data files daily. Each file is a size of 10 GB. The server in the factory pushes the files to AWS S3 once a day.

The factory data is needed to predict machine breakdowns. For that, the raw data should be pre-processed for the data science team.

Lambda is not an option for the pre-processing because of the memory and timeout limitation. Glue seems to be reasonable option when work hours and costs are compared to alternative tools.

The simplest way of get started with the ETL process is to create a new Glue job and write code to the editor. The script can be either in scala or python programming language.

Extract. The script first reads all the files from the specified S3 bucket to a single data frame. You can think a data frame as a table in Excel. The reading can be just a one-liner.

Transform. This is the most of the code. Let’s say that the original data had 100 records per second. The data science team wants the data to be aggregated per each 1 minute with a specific logic. This could be just tens of code lines if the logic is simple.

Load. Write data back to another S3 bucket for the data science team. It’s possible that a single line of code will do.

The code runs on top of the spark framework which is configured automatically in Glue. Thanks to spark, data will be divided to small chunks and processed in parallel on multiple machines simultaneously.

What makes AWS Glue serverless?

Serverless means you don’t have machines to configure. AWS provisions and allocates the resources automatically.

The processing power is adjusted by the number of data processing units (DPU). You can do additional configuration, but it’s likely that a proof of concept works out of the box.

In an on-premise environment you would have to make a decision about the computation cluster size. A big cluster is expensive but fast. A small cluster would be cheaper but slow to run.

With AWS Glue your bill is the result the following equation:

[ETL job price] = [Processing time] * [Number of DPUs]

The on demand pricing means that the increase in processing power does not compromise with the price of the ETL job. At least in theory, as too many DPUs might cause overhead in processing time.

When is AWS Glue a wrong choice?

This is not an advertisement, so let’s give some critique for Glue as well.

Lots of small ETL jobs. Glue has a minimum billing of 10 minutes and 2 DPUs. With the price of 0.44$ per DPU hour, the minimum cost for a run becomes around 0.15$. The starting price makes Glue unappealing alternative to process small amount of data often.

Specific networking requirements. If you spin up a standard EC2 virtual machine, an IP address will be attached to it. The serverless nature of Glue means you have to put more effort on network planning in some cases. One such scenario would be whitelisting a Glue job in a firewall to extract data from an external system.

Summary about AWS Glue

The most common argument against Glue is “It’s expensive”. True, in a sense that the first few test runs can already cost a few dollars. In a nutshell, Glue is cost efficient for infrequent big data workloads.

In the big picture AWS Glue saves a lot of time and unnecessary hardware engineering. The costs should be compared against alternative options such as on-premise Hadoop cluster or development hours required for a custom solution.

As Glue pricing model is predictable, the business cases are straightforward to calculate. It might be enough to test just the critical parts of the ETL pipeline to become confident about the performance and costs.

I feel that optimizing the code for distributed computing has been more of a challenge than the Glue service itself. The next blog post will focus on how data developers get started with Glue using python and spark.

Why and how to enable DataOps in an organization?

It can be a daunting task to drive a DataOps initiative forward in an organization. By understanding it's implications, you will increase your odds to succeed.

February 8, 2019

Vesa Hammarberg

When talking with my colleague about the introductory post to the blog series I was asked if we can already jump into the technological side of DataOps in the second part. Unfortunately not. Technology is an important part of the phenomenon, but the soft side is even more important.

Defining the undefined

There still are no official standards or frameworks regarding DataOps. So how can we then even talk about enabling DataOps in an organization if we only understand it on a high level? To start get things started, we have to break up the concept.

The DataOps Manifesto [1] that DataKitchen has brewed together does a good job describing the principles that are part of DataOps. However, it is somewhat limited to analytics/machine learning point of view. Modern data platforms can be much more than just that. Gartner [2] has the same spirit in their definition but it is not as tightly scoped to analytics. The focus is more in thinking DataOps as a tool for organizational change regarding the data domain. CAMS-model which was coined by Damon Edwards and John Willis [3] for describing DevOps does work fine also with DataOps. CAMS stands for Culture, Automation, Measurement and Sharing. As you can see, automation is only one of the four elements. Today we will dive into the cultural aspect.

DataOps culture everywhere

How to build DataOps culture? One does not simply build culture. Culture is a puzzle that is put together piece-by-piece. Culture will savour your beloved DataOps initiative as breakfast if you try to drive it forward as a technological project. You can’t directly change culture. But you can change behavior, and behavior becomes culture [3].

Let’s take an example of what the implications of cultural aspects can be. I like the “Make it work, make it fast, make it last” mentality which prioritizes delivering value fast and making things last once business already benefits from the solution. The problem is that culture seldom supports the last of the three.

Once value has been delivered, no one prioritizes the work related to making lasting solutions as it does not produce imminent business value.

By skipping the last part, you slowly add technical dept which means that larger part of development time goes to unplanned work instead of producing new features to the business. The term death spiral (popularized in the book Phoenix Project [4]) describes the phenomenon well.

改善

Important part of DataOps is that the organization makes a collective commitment to high quality. By compromising this the maintenance cost of your data platform slowly starts to rise and new development will also get slower. Related to this we also need some Kaizen mentality. Kaizen (kai/改=change, zen/善=good) means continuous improvement that involves everyone. In the data development context this means that we continuously try to find inefficiencies in our development processes and also prioritize the work that removes that waste. But can you delimit the effects there? Not really. This can affect all stakeholders that are involved with your data, meaning you should understand your data value streams in order to control the change.

As Gartner [2] states “focus and benefit of DataOps is as a lever for organizational change, to steer behaviour and enable agility”. DataOps could be utilized as a tool for organizational transformation. Typical endgame goals for DataOps are faster lead time from idea to business value, reduced total cost of ownership and empowered developers and users.

Faster time to value

This usually is the main driver for DataOps. The time from idea to business value is crucial for an organization to flourish. Lead time reduction comes from faster development process and less waiting between different phases but also from the fact that building and making releases in smaller fragments makes it possible to take the solutions in use gradually. Agile methodology and lean thinking play big part in this and the technology is there to support.

If your data development cycle is too slow it tends to lead to shadow IT meaning each business will build their own solution as they feel they have no other choice. Long development cycles also mean that you will build solutions no one uses. Faster you get feedback better you can steer your development and build a solution the customer needs instead of what the initial request was (let’s face it, usually at the beginning you have no clue about all the details needed to get the solution done).

All in all faster time to value should be quite universal goal because of its positive effects on the business.

Reduced Total Cost of Ownership (TCO)

Reduced TCO is a consequence of many drivers. The hypothesize is that the quality of solutions is better resulting in less error fixing, faster recovery times and less unplanned work in general.

Many cloud data solutions have started small and gradually grown larger and larger. By the time you realize that you might need some sort of governance and practices the environment can already be challenging to manage. By utilizing DataOps you can make the environment a lot more manageable, secure and easier to develop to.

Empowered developers and users

One often overlooked factor is that how does DataOps affects the people that are responsible for building the solutions. Less manual work and more automation means that developers can focus more on the business problems and less on doing monkey work. This can lead to more sensible content of work. But at the same time it can lead to skill caps that can be agonizing for the individual and also a challenge for the organization on how to organize the work. Development work can actually also feel more stressful as there will be less waiting (for loads to complete, etc.) and more of the actual concentrated development.

Some definitions of DataOps [5] emphasize collaborational and communicational side of DataOps. Better collaboration builds trust towards the data and between different stakeholders. Faster development cycles can in part bring developers and users closer to each other and engage the user to take part in the development itself. This can raise enthusiasm among end users and break the disbelief that data development can’t support business processes fast enough.

Skills and know-how

One major reason DataOps is hard to approach is that doing it technically requires remarkably different skillset than traditional ETL-/Data Warehouse -development. You still need to model your data (There seems to be a common misconception that you just put all your data to a data lake and then you utilise it. Believe me database management systems (DBMS) were not invented by accident and there really is a need for curated data models etc. But this is a different story.).

You also need to understand the business logic behind your data. This remains to be the trickiest part of data integrations as it requires a human to interpret and integrate the logic.

So on higher level you are still doing the same things, integrating and modelling your data. But the technologies and development methods used are different.

Back in the day as a ETL/DW-developer you could have done almost all of your work with one GUI-oriented tool be it Informatica, SSIS or Data Stage for example. This changes in the DataOps world. As a developer you should know cloud ecosystems and their components, be able to code (Python, NodeJS and C# are a good start), understand serverless and its implications, be committed to continuous integration and things it requires.

And the list goes on. It’s overwhelming! Well it can be if you try to convert your developers to DataOps without help. There are ways to make the change easier by using automation and prebuilt modular components, but I still tease you a bit on this one and come to the solutions later as this is a big and important subject.

Yesterday’s news

One could argue that data engineers and organizations have been doing this DataOps stuff for years but at least from what I have seen the emphasis has been on the data development side and the operations part has been an afterthought.

This has led to “data platforms” that are technologically state of the art but when you show the environment to an experienced data professional she is horrified. Development has been done to produce quick business value, but the perceived value is rigged. Value will start to fade as the data platform is a burden to maintain and point solutions created all live their own lives instead of producing cumulative value on top of each other.

Success factors for enabling DataOps

In the end I would like to share a few pointers on how to improve your chances in succeeding if you dare to embark on your DataOps journey. Unfortunately, by blindly adopting “best practices” you can fall a victim of cargo cult meaning that you try to adopt practices that do not work in your organization. But still there are some universal things that can help you in making DataOps run in your organization.

Start small and show results early

You need to build trust towards what you are building. What has worked in organisations is that you utilise vertical slicing (building a narrow end-to-end solution) and delivering value as soon as possible. This can prove that new ways of working bring the promised benefits and you’ll get mandate to go forward with your initiative.

Support from senior executives

Oh, the classic. Even if it might sound a bit clichéd you still can’t get away from it. You will need a high-level sponsor in order to improve your odds to succeed. Organizational changes are built bottom-up, but it will even your way if you get support from someone high up. As DataOps is not only a technological initiative you need to break a few established processes along the way. People will question you and by having someone backing you up can help you great deal!

Build cross-functional teams

If you don’t have a team full of unicorns, you will be better off mixing different competences in your development team and not framing their roles too tightly. Also mix your own people with consultants that have deep knowledge in certain fields. You need expertise in data development and also in operating the cloud. The key is to enable close collaboration so that people learn from each other and the flow of information is seamless.

But remember the most important thing! Without it you will not succeed. The thing is to actually start and commit to the change. Like all initiatives that affect people’s daily routines, this will also be protested against. Show the value, engage people and you will prevail!

[1] The DataOps Manifesto. http://dataopsmanifesto.org/dataops-manifesto.html
[2] Nick Heudecker, Ted Friedman, Alan Dayley. Innovation Insight for DataOps. 2018. https://www.gartner.com/document/3896766
[3] John Willis. DevOps Culture (Part 1). 2012. https://itrevolution.com/devops-culture-part-1/
[4] Gene Kim, Kevin Behr, George Spafford. The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win. 2013. IT Revolution Press.
[5] Andy Palmer. From DevOps to DataOps. 2015. https://www.tamr.com/from-devops-to-dataops-by-andy-palmer/

DataOps – new kid on the data block

DataOps is a set of practices that aim to automate the delivery of data and data models and make the change management more predictable. DataOps phenomenon illustrates the change from traditional data warehousing to modern data platforms. It is something that can't be simply brought in as an add-on as it also requires a more fundamental change in mindset instead of just starting to use a new set of tools. This blog post starts a new series where we will get ourselves familiar with the concept of DataOps, go through the benefits that utilizing DataOps can offer and examples how it can be applied when building data platforms.

January 11, 2019

Vesa Hammarberg

When meeting customers that are starting to build new data warehouses and platforms, I have started hearing the requirement that “We want our solution to follow DataOps principles” or “to be DataOps compatible”. At the same time Gartner [1] recognizes DataOps on their latest Data Management Hype Cycle as being the most emergent concept on the chart. While being in the Innovation Trigger phase on the chart Gartner sees that DataOps will likely inhibit adoption of the practice in the next 12 to 18 months.

So DataOps is clearly not yet mainstream thing when building data solutions but it has raised a lot of interest among people that actively follow the market evolution. As we at Solita have been utilizing DataOps practices in our solutions for several years already it is easy to forget that the rest of the world is not there yet. Gartner [2] states that the current adoption rate of DataOps is estimated at less than 1% of the addressable market so if DataOps really is something eligible there is a lot to be done before it will become the common way of doing.

Several software vendors provide solutions for DataOps (like Composable, DataKitchen and Nexla for instance) but are they the real deal or are they selling snake oil? It’s hard to tell. Then again should DataOps even be something to go after? We evidently need to understand what forces drive DataOps from emergent concept to main stream.

From DevOps to DataOps

Before going any further into DataOps itself let’s first look at where its coming from and what has triggered the fact that it’s now more relevant than before. We’ll start from DevOps. DevOps has become the prevalent methodology in software development in the recent years. It has changed the way of thinking regarding delivering new features and fixes to production more frequently while ensuring high quality. DevOps is nowadays the default way of doing when developing and operating new software. But how does this relate to DataOps and what do we actually know about it?

To better understand the concept of DataOps we need to go through how building data solutions has changed in the recent past.

Few years ago, the predominant way of developing data solutions was to pick an ETL tool and a database, install and configure them on your own or leased (IaaS) hardware and start bringing those source databases and csv files in. This basically meant a lot of manual work in creating tables and making hundreds and thousands of field mappings in your beloved ETL tool.

Reach for the clouds

Cloud platforms such as Microsoft Azure or Amazon Web Services have changed the way data solutions are developed. In addition, the Big Data trend brought new types of data storage (Hadoop and NoSQL) solutions to the table. When speaking with different customers I have noticed that there has even been a terminological shift from “data warehousing” to “building data platforms”. Why is this? Traditionally the scope of data warehouses has been to serve finance, HR and other functions in their mandatory reporting obligations. However, both the possibilities and the ambition level have taken steps forward and nowadays these solutions have much more diverse usage. Reporting has not gone anywhere but the same data is used on all strategic, tactical and operative levels. Enriching the organizations data assets with machine learning can mean better and more efficient data driven processes and improvements in value chains that lead to actual competitive advantage as we have seen in several customer cases.

There is also more volume, velocity and variety in the source data (I hate the term Big Data but its definition is fine).

In addition to internal operative systems the data can come from IoT devices, different SaaS services in divergent semi-structured formats and what not. It is also common that the same architecture supports use cases that combine hot and cold data and some parts must update near real time.

You can build a “cloud data warehouse” by lifting and shifting your on premises solution. This can mean for example installing SQL Server on an Azure Virtual Machine or running a RDS database on AWS and using the same data model as you have used for years. You can load this database with your go-to ETL tool in the same way as you have done previously. By repeating your on-premises architecture to cloud will unfortunately not bring you the benefits (performance, scalability, globality, new development models) of a cloud native solution and might even bring in new challenges in managing the environment.

To boldly go to the unknown

Building data solutions to cloud platforms can feel like a daunting task. Instead of your familiar tools you will face a legion of new services and components that require more coding than using a graphical user interface. It is true that if you start from scratch and try to build your data platform all by yourself there is a lot of coding to be done before you start creating value even if you use all available services on the selected platform.

This is where DataOps kicks in!

As DevOps, DataOps will set out a framework and practices (and possibly even tools) so that you can concentrate on creating business value instead of using time on non-profitable tasks. DataOps covers infrastructure management, development practices, orchestration, testing, deploying and monitoring. If truly embraced, DataOps can improve your data warehouses accustomed release schedule. Instead of involving several testing managers in burdenous testing process you may be able to move to a fully automated continuous release pipeline and make several releases to production each day. This is something that is hard to believe in data warehousing context before you see it in action.

Innovator’s dilemma

Competition forces companies to constantly innovate. As data has become the central resource in making new innovations it is crucial that data architectures support experimentation and subsequently innovation. Unfortunately, legacy data warehouse architectures seldom spur innovative solutions as they can be clunky to develop and most of your limited budget goes to maintaining the solution. You will also have to cope with the burden of tradition as many of the processes and procedures have been in place for ages and are hard to change. Also, the skill set of your personnel focuses on the old data architecture and it takes time teach them new ways of doing. Still, I believe that cloud data platforms are the central piece for an organization to be able to do data driven innovations. By watching from the sidelines for too long you risk letting your competitors too far ahead of you.

If you want to increase innovation, you need to cut the cost of failure. If failure (or learning) actually improves your standing, then you will take risks. Without risk there will be no reward.

On next parts of this blog series we will take a closer look on different parts of DataOps, think how actual implementations can be made and what implications DataOps has on needed competences, organizational structures and processes. Stay tuned!

REFERENCES

[1] Donald Feinberg, Adam Ronthal. Hype Cycle for Data Management. 2018. https://www.gartner.com/document/3884077

[2] Nick Heudecker, Ted Friedman, Alan Dayley. Innovation Insight for DataOps. 2018. https://www.gartner.com/document/3896766