DataOps machine

What to look for in a DataOps platform?

DataOps series is back at last. This time we are going through what to look for when choosing your technical approach for DataOps.

Building a resilient data capability can be a tricky task as there are vast number of viable options on how to proceed on the matter. Price and marketing materials don’t tell it all and you need to decide what is valuable for your organization when building a data platform.

Before going in to actual features and such I want to give few thoughts on cloud computing platforms and their native services and components. Is it always best to just stay platform native? All those services are “seamlessly integrated” and are easy for you to take into use. But up to which extent do those components support building a resilient data platform? You can say that depends on how you define a “data platform”.

My argument is that a data warehouse still is a focal component in enterprise wide data platform and when building one, the requirements for the platform are broader than what the platforms themselves can offer at the moment.

But let’s return to this argument later and first go through the requirements you might have for DataOps tooling. There are some basic features like version control and control over the environments as code (Infastructure as code) but let’s concentrate on the more interesting ones.

Model your data and let the computer generate the rest

The beef in classic data warehousing. Even if you call it a data platform instead of a data warehouse, build it on the cloud and use development methods that originate from software development, you still need to model your data and also integrate and harmonize it across sources. There sometimes is a misconception that if you build a data platform on the cloud and utilize some kind of data lake in it, you would not need to mind about data models.

This tends to lead to different kinds of “data swamps” which can be painful to use and the burden of creating a data model is inherited to the applications built on top of the lake.

Sure there are different schema-on-read type of services that sit on top of data lakes but they tend to have some shortcomings when compared to real analytical databases (like in access control, schema changes, handling deletes and updates, performance of the query optimizer, concurrent query limits, etc.).

To minimize the manual work related to data modelling, you only want the developers to do the required logical data modelling and derive the physical model and data loads from that model. This saves a lot of time as developers can concentrate on the big picture instead of technical details. Generation of the physical model also keeps it more consistent because there won’t be different personal ways in the implementation as developers don’t write the code by hand but it is automatically generated based to modelling and commonly shared rules.

Deploy all the things

First of all, use several separate environments. Developers need an environment where it is safe to try out things without the fear of breaking things or hampering the functionality of the production system. You also need an environment where production ready features can be tested with production-like data. And of course you need the production environment which is there only to serve the actual needs of the business. Some can cope with two and some prefer four environments but the separation of environments and automation regarding deployments are the key.

In the past, it has been a hassle to move data models and pipelines from one environment to another but it should not be like that anymore.

Continuous integration and deployment are the heart of a modern data platform. Process can be defined once and automation handles changes between environments happily ever after.

It would also be good if your development tooling supports “late building” meaning that you can do the development in a DBMS (Database Management System) agnostic way and your target environment is taken into account only in the deployment phase. This means that you are able to change the target database engine to another without too much overhead and you potentially save a lot of money in the future. To put it short, by utilizing late building, you can avoid database lock-in.

Automated orchestration

Handling workflow orchestration can be a heck of a task if done manually. When can a certain run start, what dependencies does it have and so on. DataOps way of doing orchestrations is to fully automate them. Orchestrations can be populated based on the logical data model and references and dependencies it contains.

In this approach the developer can concentrate on the data model itself and automation optimizes the workflow orchestration to fully utilize the parallel performance of the used database. If you scale the performance of your database, orchestrations can be generated again to take this into account.

The orchestrations should be populated in a way that concurrency is controlled and fully utilized so that things that can be ran parallel are ran so. When you add some step in the pipeline, changes to orchestrations should automatically be deployed. Sound like a small thing but in a large environment something that really saves time and nerves.

Assuring the quality

It’s important that you can control the data quality. At best, you could integrate the data testing in your workflows so that the quality measures are triggered every time your flow runs and possible problems are reacted to.

Data lineage can help you understand what happens with the data before it ends up in the use of end users.

It can also be a tool for the end users to better understand what the data is and where it comes from. You could also use tags for elements in your warehouse so that it’s easier to find everything that is related to for example personally identifiable information (PII).

Cloud is great but could it be more?

So about the cloud computing platforms. Many of the previously mentioned features can be found or built with native cloud platform components. But how about keeping at least the parts you heavily invest work in platform agnostic if at some point for some reason you have to make a move regarding the platform (reasons can be changes in pricing, corporate mergers & acquisitions, etc.). For big corporations it’s also more common that they utilize more than one cloud platform and common services and tooling over the clouds can be a guiding factor as well.

Data modelling especially is an integral part of data platform development but it still is not something that’s integrated in the cloud platforms themselves on a sufficient level.

In the past, data professionals have used to quite seamless developer experience with limited amount of software required to do the development. On cloud platforms this changed as you needed more services most having a bit different user experience. This has changed a bit since as cloud vendors have started to “package” the services (such like AWS Sagemaker or Azure DevOps) but we still are in the early phases of packaging that kind of tooling.

If the DataOps capabilities are something you would like to see out-of-the-box in your data platform, go check out our DataOps Platform called Agile Data Engine. It enables you to significantly reduce time to value from business perspective, minimize total cost of ownership and make your data platform future-proof.

Also check previous posts in DataOps series

DataOps – new kid on the data block

Why and how to enable DataOps in an organization?

Why and how to enable DataOps in an organization?

It can be a daunting task to drive a DataOps initiative forward in an organization. By understanding it's implications, you will increase your odds to succeed.

When talking with my colleague about the introductory post to the blog series I was asked if we can already jump into the technological side of DataOps in the second part. Unfortunately not. Technology is an important part of the phenomenon, but the soft side is even more important.

Defining the undefined

There still are no official standards or frameworks regarding DataOps. So how can we then even talk about enabling DataOps in an organization if we only understand it on a high level? To start get things started, we have to break up the concept.

The DataOps Manifesto [1] that DataKitchen has brewed together does a good job describing the principles that are part of DataOps. However, it is somewhat limited to analytics/machine learning point of view. Modern data platforms can be much more than just that. Gartner [2] has the same spirit in their definition but it is not as tightly scoped to analytics. The focus is more in thinking DataOps as a tool for organizational change regarding the data domain. CAMS-model which was coined by Damon Edwards and John Willis [3] for describing DevOps does work fine also with DataOps. CAMS stands for Culture, Automation, Measurement and Sharing. As you can see, automation is only one of the four elements. Today we will dive into the cultural aspect.

DataOps culture everywhere

How to build DataOps culture? One does not simply build culture. Culture is a puzzle that is put together piece-by-piece. Culture will savour your beloved DataOps initiative as breakfast if you try to drive it forward as a technological project. You can’t directly change culture. But you can change behavior, and behavior becomes culture [3].

Let’s take an example of what the implications of cultural aspects can be. I like the “Make it work, make it fast, make it last” mentality which prioritizes delivering value fast and making things last once business already benefits from the solution. The problem is that culture seldom supports the last of the three.

Once value has been delivered, no one prioritizes the work related to making lasting solutions as it does not produce imminent business value.

By skipping the last part, you slowly add technical dept which means that larger part of development time goes to unplanned work instead of producing new features to the business. The term death spiral (popularized in the book Phoenix Project [4]) describes the phenomenon well.

改善

Important part of DataOps is that the organization makes a collective commitment to high quality. By compromising this the maintenance cost of your data platform slowly starts to rise and new development will also get slower. Related to this we also need some Kaizen mentality. Kaizen (kai/改=change, zen/善=good) means continuous improvement that involves everyone. In the data development context this means that we continuously try to find inefficiencies in our development processes and also prioritize the work that removes that waste. But can you delimit the effects there? Not really. This can affect all stakeholders that are involved with your data, meaning you should understand your data value streams in order to control the change.

As Gartner [2] states “focus and benefit of DataOps is as a lever for organizational change, to steer behaviour and enable agility”. DataOps could be utilized as a tool for organizational transformation. Typical endgame goals for DataOps are faster lead time from idea to business value, reduced total cost of ownership and empowered developers and users.

Faster time to value

This usually is the main driver for DataOps. The time from idea to business value is crucial for an organization to flourish. Lead time reduction comes from faster development process and less waiting between different phases but also from the fact that building and making releases in smaller fragments makes it possible to take the solutions in use gradually. Agile methodology and lean thinking play big part in this and the technology is there to support.

If your data development cycle is too slow it tends to lead to shadow IT meaning each business will build their own solution as they feel they have no other choice. Long development cycles also mean that you will build solutions no one uses. Faster you get feedback better you can steer your development and build a solution the customer needs instead of what the initial request was (let’s face it, usually at the beginning you have no clue about all the details needed to get the solution done).

All in all faster time to value should be quite universal goal because of its positive effects on the business.

Reduced Total Cost of Ownership (TCO)

Reduced TCO is a consequence of many drivers. The hypothesize is that the quality of solutions is better resulting in less error fixing, faster recovery times and less unplanned work in general.

Many cloud data solutions have started small and gradually grown larger and larger. By the time you realize that you might need some sort of governance and practices the environment can already be challenging to manage. By utilizing DataOps you can make the environment a lot more manageable, secure and easier to develop to.

Empowered developers and users

One often overlooked factor is that how does DataOps affects the people that are responsible for building the solutions. Less manual work and more automation means that developers can focus more on the business problems and less on doing monkey work. This can lead to more sensible content of work. But at the same time it can lead to skill caps that can be agonizing for the individual and also a challenge for the organization on how to organize the work. Development work can actually also feel more stressful as there will be less waiting (for loads to complete, etc.) and more of the actual concentrated development.

Some definitions of DataOps [5] emphasize collaborational and communicational side of DataOps. Better collaboration builds trust towards the data and between different stakeholders. Faster development cycles can in part bring developers and users closer to each other and engage the user to take part in the development itself. This can raise enthusiasm among end users and break the disbelief that data development can’t support business processes fast enough.

Skills and know-how

One major reason DataOps is hard to approach is that doing it technically requires remarkably different skillset than traditional ETL-/Data Warehouse -development. You still need to model your data (There seems to be a common misconception that you just put all your data to a data lake and then you utilise it. Believe me database management systems (DBMS) were not invented by accident and there really is a need for curated data models etc. But this is a different story.).

You also need to understand the business logic behind your data. This remains to be the trickiest part of data integrations as it requires a human to interpret and integrate the logic.

So on higher level you are still doing the same things, integrating and modelling your data. But the technologies and development methods used are different.

Back in the day as a ETL/DW-developer you could have done almost all of your work with one GUI-oriented tool be it Informatica, SSIS or Data Stage for example. This changes in the DataOps world. As a developer you should know cloud ecosystems and their components, be able to code (Python, NodeJS and C# are a good start), understand serverless and its implications, be committed to continuous integration and things it requires.

And the list goes on. It’s overwhelming! Well it can be if you try to convert your developers to DataOps without help. There are ways to make the change easier by using automation and prebuilt modular components, but I still tease you a bit on this one and come to the solutions later as this is a big and important subject.

Yesterday’s news

One could argue that data engineers and organizations have been doing this DataOps stuff for years but at least from what I have seen the emphasis has been on the data development side and the operations part has been an afterthought.

This has led to “data platforms” that are technologically state of the art but when you show the environment to an experienced data professional she is horrified. Development has been done to produce quick business value, but the perceived value is rigged. Value will start to fade as the data platform is a burden to maintain and point solutions created all live their own lives instead of producing cumulative value on top of each other.

Success factors for enabling DataOps

In the end I would like to share a few pointers on how to improve your chances in succeeding if you dare to embark on your DataOps journey. Unfortunately, by blindly adopting “best practices” you can fall a victim of cargo cult meaning that you try to adopt practices that do not work in your organization. But still there are some universal things that can help you in making DataOps run in your organization.

Start small and show results early

You need to build trust towards what you are building. What has worked in organisations is that you utilise vertical slicing (building a narrow end-to-end solution) and delivering value as soon as possible. This can prove that new ways of working bring the promised benefits and you’ll get mandate to go forward with your initiative.

Support from senior executives

Oh, the classic. Even if it might sound a bit clichéd you still can’t get away from it. You will need a high-level sponsor in order to improve your odds to succeed. Organizational changes are built bottom-up, but it will even your way if you get support from someone high up. As DataOps is not only a technological initiative you need to break a few established processes along the way. People will question you and by having someone backing you up can help you great deal!

Build cross-functional teams

If you don’t have a team full of unicorns, you will be better off mixing different competences in your development team and not framing their roles too tightly. Also mix your own people with consultants that have deep knowledge in certain fields. You need expertise in data development and also in operating the cloud. The key is to enable close collaboration so that people learn from each other and the flow of information is seamless.

But remember the most important thing! Without it you will not succeed. The thing is to actually start and commit to the change. Like all initiatives that affect people’s daily routines, this will also be protested against. Show the value, engage people and you will prevail!

[1] The DataOps Manifesto. http://dataopsmanifesto.org/dataops-manifesto.html
[2] Nick Heudecker, Ted Friedman, Alan Dayley. Innovation Insight for DataOps. 2018. https://www.gartner.com/document/3896766
[3] John Willis. DevOps Culture (Part 1). 2012. https://itrevolution.com/devops-culture-part-1/
[4] Gene Kim, Kevin Behr, George Spafford. The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win. 2013. IT Revolution Press.
[5] Andy Palmer. From DevOps to DataOps. 2015. https://www.tamr.com/from-devops-to-dataops-by-andy-palmer/

New call-to-action

DataOps – new kid on the data block

DataOps is a set of practices that aim to automate the delivery of data and data models and make the change management more predictable. DataOps phenomenon illustrates the change from traditional data warehousing to modern data platforms. It is something that can't be simply brought in as an add-on as it also requires a more fundamental change in mindset instead of just starting to use a new set of tools. This blog post starts a new series where we will get ourselves familiar with the concept of DataOps, go through the benefits that utilizing DataOps can offer and examples how it can be applied when building data platforms.

When meeting customers that are starting to build new data warehouses and platforms, I have started hearing the requirement that “We want our solution to follow DataOps principles” or “to be DataOps compatible”. At the same time Gartner [1] recognizes DataOps on their latest Data Management Hype Cycle as being the most emergent concept on the chart. While being in the Innovation Trigger phase on the chart Gartner sees that DataOps will likely inhibit adoption of the practice in the next 12 to 18 months.

So DataOps is clearly not yet mainstream thing when building data solutions but it has raised a lot of interest among people that actively follow the market evolution. As we at Solita have been utilizing DataOps practices in our solutions for several years already it is easy to forget that the rest of the world is not there yet. Gartner [2] states that the current adoption rate of DataOps is estimated at less than 1% of the addressable market so if DataOps really is something eligible there is a lot to be done before it will become the common way of doing.

Several software vendors provide solutions for DataOps (like Composable, DataKitchen and Nexla for instance) but are they the real deal or are they selling snake oil? It’s hard to tell. Then again should DataOps even be something to go after? We evidently need to understand what forces drive DataOps from emergent concept to main stream.

From DevOps to DataOps

Before going any further into DataOps itself let’s first look at where its coming from and what has triggered the fact that it’s now more relevant than before. We’ll start from DevOps. DevOps has become the prevalent methodology in software development in the recent years. It has changed the way of thinking regarding delivering new features and fixes to production more frequently while ensuring high quality. DevOps is nowadays the default way of doing when developing and operating new software. But how does this relate to DataOps and what do we actually know about it?

To better understand the concept of DataOps we need to go through how building data solutions has changed in the recent past. 

Few years ago, the predominant way of developing data solutions was to pick an ETL tool and a database, install and configure them on your own or leased (IaaS) hardware and start bringing those source databases and csv files in. This basically meant a lot of manual work in creating tables and making hundreds and thousands of field mappings in your beloved ETL tool.

Reach for the clouds

Cloud platforms such as Microsoft Azure or Amazon Web Services have changed the way data solutions are developed. In addition, the Big Data trend brought new types of data storage (Hadoop and NoSQL) solutions to the table. When speaking with different customers I have noticed that there has even been a terminological shift from “data warehousing” to “building data platforms”. Why is this? Traditionally the scope of data warehouses has been to serve finance, HR and other functions in their mandatory reporting obligations. However, both the possibilities and the ambition level have taken steps forward and nowadays these solutions have much more diverse usage. Reporting has not gone anywhere but the same data is used on all strategic, tactical and operative levels. Enriching the organizations data assets with machine learning can mean better and more efficient data driven processes and improvements in value chains that lead to actual competitive advantage as we have seen in several customer cases.

There is also more volume, velocity and variety in the source data (I hate the term Big Data but its definition is fine).

In addition to internal operative systems the data can come from IoT devices, different SaaS services in divergent semi-structured formats and what not. It is also common that the same architecture supports use cases that combine hot and cold data and some parts must update near real time.

You can build a “cloud data warehouse” by lifting and shifting your on premises solution. This can mean for example installing SQL Server on an Azure Virtual Machine or running a RDS database on AWS and using the same data model as you have used for years. You can load this database with your go-to ETL tool in the same way as you have done previously. By repeating your on-premises architecture to cloud will unfortunately not bring you the benefits (performance, scalability, globality, new development models) of a cloud native solution and might even bring in new challenges in managing the environment.

To boldly go to the unknown

Building data solutions to cloud platforms can feel like a daunting task. Instead of your familiar tools you will face a legion of new services and components that require more coding than using a graphical user interface. It is true that if you start from scratch and try to build your data platform all by yourself there is a lot of coding to be done before you start creating value even if you use all available services on the selected platform.

This is where DataOps kicks in!

As DevOps, DataOps will set out a framework and practices (and possibly even tools) so that you can concentrate on creating business value instead of using time on non-profitable tasks. DataOps covers infrastructure management, development practices, orchestration, testing, deploying and monitoring. If truly embraced, DataOps can improve your data warehouses accustomed release schedule. Instead of involving several testing managers in burdenous testing process you may be able to move to a fully automated continuous release pipeline and make several releases to production each day. This is something that is hard to believe in data warehousing context before you see it in action.

Innovator’s dilemma

Competition forces companies to constantly innovate. As data has become the central resource in making new innovations it is crucial that data architectures support experimentation and subsequently innovation. Unfortunately, legacy data warehouse architectures seldom spur innovative solutions as they can be clunky to develop and most of your limited budget goes to maintaining the solution. You will also have to cope with the burden of tradition as many of the processes and procedures have been in place for ages and are hard to change. Also, the skill set of your personnel focuses on the old data architecture and it takes time teach them new ways of doing. Still, I believe that cloud data platforms are the central piece for an organization to be able to do data driven innovations. By watching from the sidelines for too long you risk letting your competitors too far ahead of you.

If you want to increase innovation, you need to cut the cost of failure. If failure (or learning) actually improves your standing, then you will take risks. Without risk there will be no reward.

On next parts of this blog series we will take a closer look on different parts of DataOps, think how actual implementations can be made and what implications DataOps has on needed competences, organizational structures and processes. Stay tuned!

REFERENCES

[1] Donald Feinberg, Adam Ronthal. Hype Cycle for Data Management. 2018. https://www.gartner.com/document/3884077

[2] Nick Heudecker, Ted Friedman, Alan Dayley. Innovation Insight for DataOps. 2018. https://www.gartner.com/document/3896766

New call-to-action