Cloud data transformation
Tipi shared thoughts on how data assets could be utilized on Cloud. We had few question after blog post and one of those was “how to tackle real time requirements ?
Let’s go real time ?
Real-time business intelligence is a concept describing the process of delivering business intelligence or information about business operations as they occur. Real time means near to zero latency and access to information whenever it is required.
We all remember those nightly batch loads and preprocessing data – waiting a few hours before data is ready for reports. Someone is looking if sales numbers are dropped and the manager will ask quality reports from production. Report is evidence to some other team what is happening in our business.
Let’s go back to the definition that says “information whenever it is required” so actually for some of the team(s) even one week or day can be realtime. Business processes and humans are not software robots so taking action based on any data will take more than a few milliseconds so where is this real time requirement coming from ?
Marko had a nice article related to OT systems and Factory Floor and Edge computing. Any factory issue can be a major pain and downtime is not an option and explained how most of the data assets like metrics and logs must be available immediately in order to recover and understand the root cause.
Hyperscalers and real time computing
In March 2005, Google acquired the web statistics analysis program Urchin, later known as Google Analytics. That was one of the customer facing solutions to gather massive amount of data. Industrial protocols like Modbus from 1970 was designed to work real time on that time and era. Generally speaking real time computing has three categories:
- Hard – missing a deadline is a total system failure.
- Firm – infrequent deadline misses are tolerable, but may degrade the system’s quality of service. The usefulness of a result is zero after its deadline.
- Soft – the usefulness of a result degrades after its deadline, thereby degrading the system’s quality of service.
So it’s easy to understand that airplane turbine and rolling 12 months sales forecast have different requirements. .
What is the cost of (data) delay ?
“A small boat that sails the river is better than a large ship that sinks in the sea.”― Matshona Dhliwayo
We can simply estimate the value a specific feature would bring in after its launch and multiply this value with the time it will take to build. That will tell the economic impact that postponing a task will have.
High performing teams can do cost of delay estimation to understand which task should take first. Can we calculate and understand the cost of delayed data? How much that will cost to your organization if service or product must be postponed because you are missing data or you can not use it.
Start defining real-time
You can easily start discussing what kind of data is needed to improve customer experience. Real time requirements might be different for each use case and that is totally fine. It’s a good practice to specify near real time requirements in factual numbers and few examples. It’s good to remember that end to end can have totally different meanings. Working with OT systems for example the term First Mile is used when protect and connect OT systems with IT.
Any equipment failure must be visible to technicians at site in less than 60 seconds. ― Customer requirement
Understand team topologies
Incorrect team topology can block any near real time use cases. That means that adding each component and team deliverable to work together might end up having unexpected data delays. Or in the worst case scenario a team is built too much around one product / feature that will have come a bottleneck later when building more new services.
Data as a product refers to an idea where the job of the data team is to provide the data that the company needs. Data as a Service team partners with stakeholders and have more functional experience and are responsible for providing insight as opposed to rows and columns. Data Mesh is about the logical and physical interconnections of the data from producers through to consumers.
Team topologies will have a huge impact on how data driven services are built and can data land to business case purposes just on the right time.
Enable Edge streaming and APIs capabilities
On cloud services like AWS Kinesis is great, it is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second. Apache Kafka is a framework implementation of a software bus using stream-processing. Apache Spark is an open-source unified analytics engine for large-scale data processing.
I am sure that at least one of these you are already familiar with. In order to control data flow we have two parameters: amount of messages and time. Which will come first will se served.
Is your data solution idempotent and able to handle data delays ? ― Customer requirement
Modern purpose-built databases have capability to process streaming data. Any extra layer of data modeling will add a delay for data consumption. On Edge we typically run purpose-built robust database services in order to capture all factory floor events with industry standard data models.
Site and Cloud API is a contact between different parties and will improve connectivity and collaboration. API calls on Edge works nicely and you can have data available in less than 70-300ms from Cloud endpoint (example below). Same data is available on Edge endpoint where client response is even faster so building factory floor applications is easy.
curl --location --request GET 'https://data.iotsitewise.eu-west-1.amazonaws.com/properties/history?assetId=aa&maxResults=1&propertyId=pp --header 'X-Amz-Date: 20211118T152104Z' --header 'Authorization: AWS4-HMAC-SHA256 Credential=xxx, SignedHeaders=host;x-amz-date, Signature=xxxx
Quite many databases has built-in Data API. It’s still good to remember that underlying engine, data model and many factors will determine how scalable solution really is.
AWS GreenGrass StreamManager is a component that enables you to process data streams to transfer to the AWS Cloud from Greengrass core devices. Other services like Firehose is supported using specific aws.greengrass.KinesisFirehose component. These components will support also building Machine Learning (ML) features on Edge as well.
Business case will define the requirement of real time. Build your near real time capabilities according to your future proof architecture – adding real time capabilities later might come almost impossible.
If business case is not clear enough what should I do ? Maybe a cup of tea, relax and read blog post from Johannes The gap between design thinking and business impact
Let’s stay fresh-minded !