Building a resilient data capability can be a tricky task as there are vast number of viable options on how to proceed on the matter. Price and marketing materials don’t tell it all and you need to decide what is valuable for your organization when building a data platform.
Before going in to actual features and such I want to give few thoughts on cloud computing platforms and their native services and components. Is it always best to just stay platform native? All those services are “seamlessly integrated” and are easy for you to take into use. But up to which extent do those components support building a resilient data platform? You can say that depends on how you define a “data platform”.
My argument is that a data warehouse still is a focal component in enterprise wide data platform and when building one, the requirements for the platform are broader than what the platforms themselves can offer at the moment.
But let’s return to this argument later and first go through the requirements you might have for DataOps tooling. There are some basic features like version control and control over the environments as code (Infastructure as code) but let’s concentrate on the more interesting ones.
Model your data and let the computer generate the rest
The beef in classic data warehousing. Even if you call it a data platform instead of a data warehouse, build it on the cloud and use development methods that originate from software development, you still need to model your data and also integrate and harmonize it across sources. There sometimes is a misconception that if you build a data platform on the cloud and utilize some kind of data lake in it, you would not need to mind about data models.
This tends to lead to different kinds of “data swamps” which can be painful to use and the burden of creating a data model is inherited to the applications built on top of the lake.
Sure there are different schema-on-read type of services that sit on top of data lakes but they tend to have some shortcomings when compared to real analytical databases (like in access control, schema changes, handling deletes and updates, performance of the query optimizer, concurrent query limits, etc.).
To minimize the manual work related to data modelling, you only want the developers to do the required logical data modelling and derive the physical model and data loads from that model. This saves a lot of time as developers can concentrate on the big picture instead of technical details. Generation of the physical model also keeps it more consistent because there won’t be different personal ways in the implementation as developers don’t write the code by hand but it is automatically generated based to modelling and commonly shared rules.
Deploy all the things
First of all, use several separate environments. Developers need an environment where it is safe to try out things without the fear of breaking things or hampering the functionality of the production system. You also need an environment where production ready features can be tested with production-like data. And of course you need the production environment which is there only to serve the actual needs of the business. Some can cope with two and some prefer four environments but the separation of environments and automation regarding deployments are the key.
In the past, it has been a hassle to move data models and pipelines from one environment to another but it should not be like that anymore.
Continuous integration and deployment are the heart of a modern data platform. Process can be defined once and automation handles changes between environments happily ever after.
It would also be good if your development tooling supports “late building” meaning that you can do the development in a DBMS (Database Management System) agnostic way and your target environment is taken into account only in the deployment phase. This means that you are able to change the target database engine to another without too much overhead and you potentially save a lot of money in the future. To put it short, by utilizing late building, you can avoid database lock-in.
Handling workflow orchestration can be a heck of a task if done manually. When can a certain run start, what dependencies does it have and so on. DataOps way of doing orchestrations is to fully automate them. Orchestrations can be populated based on the logical data model and references and dependencies it contains.
In this approach the developer can concentrate on the data model itself and automation optimizes the workflow orchestration to fully utilize the parallel performance of the used database. If you scale the performance of your database, orchestrations can be generated again to take this into account.
The orchestrations should be populated in a way that concurrency is controlled and fully utilized so that things that can be ran parallel are ran so. When you add some step in the pipeline, changes to orchestrations should automatically be deployed. Sound like a small thing but in a large environment something that really saves time and nerves.
Assuring the quality
It’s important that you can control the data quality. At best, you could integrate the data testing in your workflows so that the quality measures are triggered every time your flow runs and possible problems are reacted to.
Data lineage can help you understand what happens with the data before it ends up in the use of end users.
It can also be a tool for the end users to better understand what the data is and where it comes from. You could also use tags for elements in your warehouse so that it’s easier to find everything that is related to for example personally identifiable information (PII).
Cloud is great but could it be more?
So about the cloud computing platforms. Many of the previously mentioned features can be found or built with native cloud platform components. But how about keeping at least the parts you heavily invest work in platform agnostic if at some point for some reason you have to make a move regarding the platform (reasons can be changes in pricing, corporate mergers & acquisitions, etc.). For big corporations it’s also more common that they utilize more than one cloud platform and common services and tooling over the clouds can be a guiding factor as well.
Data modelling especially is an integral part of data platform development but it still is not something that’s integrated in the cloud platforms themselves on a sufficient level.
In the past, data professionals have used to quite seamless developer experience with limited amount of software required to do the development. On cloud platforms this changed as you needed more services most having a bit different user experience. This has changed a bit since as cloud vendors have started to “package” the services (such like AWS Sagemaker or Azure DevOps) but we still are in the early phases of packaging that kind of tooling.
If the DataOps capabilities are something you would like to see out-of-the-box in your data platform, go check out our DataOps Platform called Agile Data Engine. It enables you to significantly reduce time to value from business perspective, minimize total cost of ownership and make your data platform future-proof.