AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

Introduction to AWS Glue for big data ETL

AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

AWS Glue is a serverless ETL tool in cloud. In brief ETL means extracting data from a source system, transforming it for analysis and other applications and then loading back to data warehouse for example.

In this blog post I will introduce the basic idea behind AWS Glue and present potential use cases.

The emphasis is in the big data processing. You can read more about Glue catalogs here and data catalogs in general here.

Why to use AWS Glue?

Replacing Hadoop. Hadoop can be expensive and a pain to configure. AWS Glue is simple. Some say that Glue is expensive, but it depends where you compare. Because of on demand pricing you only pay for what you use. This fact might make AWS Glue significantly cheaper than a fixed size on-premise Hadoop cluster.

AWS Lambda can not be used. A wise man said, use lambda functions in AWS whenever possible. Lambdas are simple, scalable and cost efficient. They can also be triggered by events. For big data lambda functions are not suitable because of the 3 GB memory limitation and 15 minute timeout. AWS Glue is specifically built to process large datasets.

Apply DataOps practices. Drag and drop ETL tools are easy for users, but from the DataOps perspective code based development is a superior approach. With AWS Glue both code and configuration can be stored in version control. The data development becomes similar to any other software development. For example the data transformation scripts written by scala or python are not limited to AWS cloud. Environment setup is easy to automate and parameterize when the code is scripted.

An example use case for AWS Glue

Now a practical example about how AWS Glue would work in practice.

A production machine in a factory produces multiple data files daily. Each file is a size of 10 GB. The server in the factory pushes the files to AWS S3 once a day.

The factory data is needed to predict machine breakdowns. For that, the raw data should be pre-processed for the data science team.

Lambda is not an option for the pre-processing because of the memory and timeout limitation. Glue seems to be reasonable option when work hours and costs are compared to alternative tools.

The simplest way of get started with the ETL process is to create a new Glue job and write code to the editor. The script can be either in scala or python programming language.

Extract. The script first reads all the files from the specified S3 bucket to a single data frame. You can think a data frame as a table in Excel. The reading can be just a one-liner.

Transform. This is the most of the code. Let’s say that the original data had 100 records per second. The data science team wants the data to be aggregated per each 1 minute with a specific logic. This could be just tens of code lines if the logic is simple.

Load. Write data back to another S3 bucket for the data science team. It’s possible that a single line of code will do.

The code runs on top of the spark framework which is configured automatically in Glue. Thanks to spark, data will be divided to small chunks and processed in parallel on multiple machines simultaneously.

What makes AWS Glue serverless?

Serverless means you don’t have machines to configure. AWS provisions and allocates the resources automatically.

The processing power is adjusted by the number of data processing units (DPU). You can do additional configuration, but it’s likely that a proof of concept works out of the box.

In an on-premise environment you would have to make a decision about the computation cluster size. A big cluster is expensive but fast. A small cluster would be cheaper but slow to run.

With AWS Glue your bill is the result the following equation:

[ETL job price] = [Processing time] * [Number of DPUs]

 

The on demand pricing means that the increase in processing power does not compromise with the price of the ETL job. At least in theory, as too many DPUs might cause overhead in processing time.

When is AWS Glue a wrong choice?

This is not an advertisement, so let’s give some critique for Glue as well.

Lots of small ETL jobs. Glue has a minimum billing of 10 minutes and 2 DPUs. With the price of 0.44$ per DPU hour, the minimum cost for a run becomes around 0.15$. The starting price makes Glue unappealing alternative to process small amount of data often.

Specific networking requirements. If you spin up a standard EC2 virtual machine, an IP address will be attached to it. The serverless nature of Glue means you have to put more effort on network planning in some cases. One such scenario would be whitelisting a Glue job in a firewall to extract data from an external system.

Summary about AWS Glue

The most common argument against Glue is “It’s expensive”. True, in a sense that the first few test runs can already cost a few dollars. In a nutshell, Glue is cost efficient for infrequent big data workloads.

In the big picture AWS Glue saves a lot of time and unnecessary hardware engineering. The costs should be compared against alternative options such as on-premise Hadoop cluster or development hours required for a custom solution.

As Glue pricing model is predictable, the business cases are straightforward to calculate. It might be enough to test just the critical parts of the ETL pipeline to become confident about the performance and costs.

I feel that optimizing the code for distributed computing has been more of a challenge than the Glue service itself. The next blog post will focus on how data developers get started with Glue using python and spark.

Why and how to enable DataOps in an organization?

It can be a daunting task to drive a DataOps initiative forward in an organization. By understanding it's implications, you will increase your odds to succeed.

When talking with my colleague about the introductory post to the blog series I was asked if we can already jump into the technological side of DataOps in the second part. Unfortunately not. Technology is an important part of the phenomenon, but the soft side is even more important.

Defining the undefined

There still are no official standards or frameworks regarding DataOps. So how can we then even talk about enabling DataOps in an organization if we only understand it on a high level? To start get things started, we have to break up the concept.

The DataOps Manifesto [1] that DataKitchen has brewed together does a good job describing the principles that are part of DataOps. However, it is somewhat limited to analytics/machine learning point of view. Modern data platforms can be much more than just that. Gartner [2] has the same spirit in their definition but it is not as tightly scoped to analytics. The focus is more in thinking DataOps as a tool for organizational change regarding the data domain. CAMS-model which was coined by Damon Edwards and John Willis [3] for describing DevOps does work fine also with DataOps. CAMS stands for Culture, Automation, Measurement and Sharing. As you can see, automation is only one of the four elements. Today we will dive into the cultural aspect.

DataOps culture everywhere

How to build DataOps culture? One does not simply build culture. Culture is a puzzle that is put together piece-by-piece. Culture will savour your beloved DataOps initiative as breakfast if you try to drive it forward as a technological project. You can’t directly change culture. But you can change behavior, and behavior becomes culture [3].

Let’s take an example of what the implications of cultural aspects can be. I like the “Make it work, make it fast, make it last” mentality which prioritizes delivering value fast and making things last once business already benefits from the solution. The problem is that culture seldom supports the last of the three.

Once value has been delivered, no one prioritizes the work related to making lasting solutions as it does not produce imminent business value.

By skipping the last part, you slowly add technical dept which means that larger part of development time goes to unplanned work instead of producing new features to the business. The term death spiral (popularized in the book Phoenix Project [4]) describes the phenomenon well.

改善

Important part of DataOps is that the organization makes a collective commitment to high quality. By compromising this the maintenance cost of your data platform slowly starts to rise and new development will also get slower. Related to this we also need some Kaizen mentality. Kaizen (kai/改=change, zen/善=good) means continuous improvement that involves everyone. In the data development context this means that we continuously try to find inefficiencies in our development processes and also prioritize the work that removes that waste. But can you delimit the effects there? Not really. This can affect all stakeholders that are involved with your data, meaning you should understand your data value streams in order to control the change.

As Gartner [2] states “focus and benefit of DataOps is as a lever for organizational change, to steer behaviour and enable agility”. DataOps could be utilized as a tool for organizational transformation. Typical endgame goals for DataOps are faster lead time from idea to business value, reduced total cost of ownership and empowered developers and users.

Faster time to value

This usually is the main driver for DataOps. The time from idea to business value is crucial for an organization to flourish. Lead time reduction comes from faster development process and less waiting between different phases but also from the fact that building and making releases in smaller fragments makes it possible to take the solutions in use gradually. Agile methodology and lean thinking play big part in this and the technology is there to support.

If your data development cycle is too slow it tends to lead to shadow IT meaning each business will build their own solution as they feel they have no other choice. Long development cycles also mean that you will build solutions no one uses. Faster you get feedback better you can steer your development and build a solution the customer needs instead of what the initial request was (let’s face it, usually at the beginning you have no clue about all the details needed to get the solution done).

All in all faster time to value should be quite universal goal because of its positive effects on the business.

Reduced Total Cost of Ownership (TCO)

Reduced TCO is a consequence of many drivers. The hypothesize is that the quality of solutions is better resulting in less error fixing, faster recovery times and less unplanned work in general.

Many cloud data solutions have started small and gradually grown larger and larger. By the time you realize that you might need some sort of governance and practices the environment can already be challenging to manage. By utilizing DataOps you can make the environment a lot more manageable, secure and easier to develop to.

Empowered developers and users

One often overlooked factor is that how does DataOps affects the people that are responsible for building the solutions. Less manual work and more automation means that developers can focus more on the business problems and less on doing monkey work. This can lead to more sensible content of work. But at the same time it can lead to skill caps that can be agonizing for the individual and also a challenge for the organization on how to organize the work. Development work can actually also feel more stressful as there will be less waiting (for loads to complete, etc.) and more of the actual concentrated development.

Some definitions of DataOps [5] emphasize collaborational and communicational side of DataOps. Better collaboration builds trust towards the data and between different stakeholders. Faster development cycles can in part bring developers and users closer to each other and engage the user to take part in the development itself. This can raise enthusiasm among end users and break the disbelief that data development can’t support business processes fast enough.

Skills and know-how

One major reason DataOps is hard to approach is that doing it technically requires remarkably different skillset than traditional ETL-/Data Warehouse -development. You still need to model your data (There seems to be a common misconception that you just put all your data to a data lake and then you utilise it. Believe me database management systems (DBMS) were not invented by accident and there really is a need for curated data models etc. But this is a different story.).

You also need to understand the business logic behind your data. This remains to be the trickiest part of data integrations as it requires a human to interpret and integrate the logic.

So on higher level you are still doing the same things, integrating and modelling your data. But the technologies and development methods used are different.

Back in the day as a ETL/DW-developer you could have done almost all of your work with one GUI-oriented tool be it Informatica, SSIS or Data Stage for example. This changes in the DataOps world. As a developer you should know cloud ecosystems and their components, be able to code (Python, NodeJS and C# are a good start), understand serverless and its implications, be committed to continuous integration and things it requires.

And the list goes on. It’s overwhelming! Well it can be if you try to convert your developers to DataOps without help. There are ways to make the change easier by using automation and prebuilt modular components, but I still tease you a bit on this one and come to the solutions later as this is a big and important subject.

Yesterday’s news

One could argue that data engineers and organizations have been doing this DataOps stuff for years but at least from what I have seen the emphasis has been on the data development side and the operations part has been an afterthought.

This has led to “data platforms” that are technologically state of the art but when you show the environment to an experienced data professional she is horrified. Development has been done to produce quick business value, but the perceived value is rigged. Value will start to fade as the data platform is a burden to maintain and point solutions created all live their own lives instead of producing cumulative value on top of each other.

Success factors for enabling DataOps

In the end I would like to share a few pointers on how to improve your chances in succeeding if you dare to embark on your DataOps journey. Unfortunately, by blindly adopting “best practices” you can fall a victim of cargo cult meaning that you try to adopt practices that do not work in your organization. But still there are some universal things that can help you in making DataOps run in your organization.

Start small and show results early

You need to build trust towards what you are building. What has worked in organisations is that you utilise vertical slicing (building a narrow end-to-end solution) and delivering value as soon as possible. This can prove that new ways of working bring the promised benefits and you’ll get mandate to go forward with your initiative.

Support from senior executives

Oh, the classic. Even if it might sound a bit clichéd you still can’t get away from it. You will need a high-level sponsor in order to improve your odds to succeed. Organizational changes are built bottom-up, but it will even your way if you get support from someone high up. As DataOps is not only a technological initiative you need to break a few established processes along the way. People will question you and by having someone backing you up can help you great deal!

Build cross-functional teams

If you don’t have a team full of unicorns, you will be better off mixing different competences in your development team and not framing their roles too tightly. Also mix your own people with consultants that have deep knowledge in certain fields. You need expertise in data development and also in operating the cloud. The key is to enable close collaboration so that people learn from each other and the flow of information is seamless.

But remember the most important thing! Without it you will not succeed. The thing is to actually start and commit to the change. Like all initiatives that affect people’s daily routines, this will also be protested against. Show the value, engage people and you will prevail!

[1] The DataOps Manifesto. http://dataopsmanifesto.org/dataops-manifesto.html
[2] Nick Heudecker, Ted Friedman, Alan Dayley. Innovation Insight for DataOps. 2018. https://www.gartner.com/document/3896766
[3] John Willis. DevOps Culture (Part 1). 2012. https://itrevolution.com/devops-culture-part-1/
[4] Gene Kim, Kevin Behr, George Spafford. The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win. 2013. IT Revolution Press.
[5] Andy Palmer. From DevOps to DataOps. 2015. https://www.tamr.com/from-devops-to-dataops-by-andy-palmer/

DataOps – new kid on the data block

DataOps is a set of practices that aim to automate the delivery of data and data models and make the change management more predictable. DataOps phenomenon illustrates the change from traditional data warehousing to modern data platforms. It is something that can't be simply brought in as an add-on as it also requires a more fundamental change in mindset instead of just starting to use a new set of tools. This blog post starts a new series where we will get ourselves familiar with the concept of DataOps, go through the benefits that utilizing DataOps can offer and examples how it can be applied when building data platforms.

When meeting customers that are starting to build new data warehouses and platforms, I have started hearing the requirement that “We want our solution to follow DataOps principles” or “to be DataOps compatible”. At the same time Gartner [1] recognizes DataOps on their latest Data Management Hype Cycle as being the most emergent concept on the chart. While being in the Innovation Trigger phase on the chart Gartner sees that DataOps will likely inhibit adoption of the practice in the next 12 to 18 months.

So DataOps is clearly not yet mainstream thing when building data solutions but it has raised a lot of interest among people that actively follow the market evolution. As we at Solita have been utilizing DataOps practices in our solutions for several years already it is easy to forget that the rest of the world is not there yet. Gartner [2] states that the current adoption rate of DataOps is estimated at less than 1% of the addressable market so if DataOps really is something eligible there is a lot to be done before it will become the common way of doing.

Several software vendors provide solutions for DataOps (like Composable, DataKitchen and Nexla for instance) but are they the real deal or are they selling snake oil? It’s hard to tell. Then again should DataOps even be something to go after? We evidently need to understand what forces drive DataOps from emergent concept to main stream.

From DevOps to DataOps

Before going any further into DataOps itself let’s first look at where its coming from and what has triggered the fact that it’s now more relevant than before. We’ll start from DevOps. DevOps has become the prevalent methodology in software development in the recent years. It has changed the way of thinking regarding delivering new features and fixes to production more frequently while ensuring high quality. DevOps is nowadays the default way of doing when developing and operating new software. But how does this relate to DataOps and what do we actually know about it?

To better understand the concept of DataOps we need to go through how building data solutions has changed in the recent past. 

Few years ago, the predominant way of developing data solutions was to pick an ETL tool and a database, install and configure them on your own or leased (IaaS) hardware and start bringing those source databases and csv files in. This basically meant a lot of manual work in creating tables and making hundreds and thousands of field mappings in your beloved ETL tool.

Reach for the clouds

Cloud platforms such as Microsoft Azure or Amazon Web Services have changed the way data solutions are developed. In addition, the Big Data trend brought new types of data storage (Hadoop and NoSQL) solutions to the table. When speaking with different customers I have noticed that there has even been a terminological shift from “data warehousing” to “building data platforms”. Why is this? Traditionally the scope of data warehouses has been to serve finance, HR and other functions in their mandatory reporting obligations. However, both the possibilities and the ambition level have taken steps forward and nowadays these solutions have much more diverse usage. Reporting has not gone anywhere but the same data is used on all strategic, tactical and operative levels. Enriching the organizations data assets with machine learning can mean better and more efficient data driven processes and improvements in value chains that lead to actual competitive advantage as we have seen in several customer cases.

There is also more volume, velocity and variety in the source data (I hate the term Big Data but its definition is fine).

In addition to internal operative systems the data can come from IoT devices, different SaaS services in divergent semi-structured formats and what not. It is also common that the same architecture supports use cases that combine hot and cold data and some parts must update near real time.

You can build a “cloud data warehouse” by lifting and shifting your on premises solution. This can mean for example installing SQL Server on an Azure Virtual Machine or running a RDS database on AWS and using the same data model as you have used for years. You can load this database with your go-to ETL tool in the same way as you have done previously. By repeating your on-premises architecture to cloud will unfortunately not bring you the benefits (performance, scalability, globality, new development models) of a cloud native solution and might even bring in new challenges in managing the environment.

To boldly go to the unknown

Building data solutions to cloud platforms can feel like a daunting task. Instead of your familiar tools you will face a legion of new services and components that require more coding than using a graphical user interface. It is true that if you start from scratch and try to build your data platform all by yourself there is a lot of coding to be done before you start creating value even if you use all available services on the selected platform.

This is where DataOps kicks in!

As DevOps, DataOps will set out a framework and practices (and possibly even tools) so that you can concentrate on creating business value instead of using time on non-profitable tasks. DataOps covers infrastructure management, development practices, orchestration, testing, deploying and monitoring. If truly embraced, DataOps can improve your data warehouses accustomed release schedule. Instead of involving several testing managers in burdenous testing process you may be able to move to a fully automated continuous release pipeline and make several releases to production each day. This is something that is hard to believe in data warehousing context before you see it in action.

Innovator’s dilemma

Competition forces companies to constantly innovate. As data has become the central resource in making new innovations it is crucial that data architectures support experimentation and subsequently innovation. Unfortunately, legacy data warehouse architectures seldom spur innovative solutions as they can be clunky to develop and most of your limited budget goes to maintaining the solution. You will also have to cope with the burden of tradition as many of the processes and procedures have been in place for ages and are hard to change. Also, the skill set of your personnel focuses on the old data architecture and it takes time teach them new ways of doing. Still, I believe that cloud data platforms are the central piece for an organization to be able to do data driven innovations. By watching from the sidelines for too long you risk letting your competitors too far ahead of you.

If you want to increase innovation, you need to cut the cost of failure. If failure (or learning) actually improves your standing, then you will take risks. Without risk there will be no reward.

On next parts of this blog series we will take a closer look on different parts of DataOps, think how actual implementations can be made and what implications DataOps has on needed competences, organizational structures and processes. Stay tuned!

REFERENCES

[1] Donald Feinberg, Adam Ronthal. Hype Cycle for Data Management. 2018. https://www.gartner.com/document/3884077

[2] Nick Heudecker, Ted Friedman, Alan Dayley. Innovation Insight for DataOps. 2018. https://www.gartner.com/document/3896766