Introduction to Edge AI with HPE Ezmeral Data Fabric

In this blog, we will be talking about how technology has shifted from on-premises data centers to the cloud and from cloud to edge. Then, we will explain data fabric, introduce HPE Ezmeral Data Fabric and investigate its capabilities. Finally, we will talk about Edge AI with HPE Ezmeral Data Fabric.

To see what Edge AI is, we need to take a deeper look at the history of data processing over time.

The evolutions of data-intensive workloads

On-premises data centers

Back in 2000, almost everything was running locally in on-premises data centers. This means that everything from management to maintenance was on the company’s shoulders. It was fine but over time, when everything was getting more dependent on the internet, businesses faced some challenges. Here are some of the most important ones:

Infrastructure inflexibility

Over time, many new services and technologies are released and it should be taken into consideration that there might be a need to update the infrastructure or apply some changes to the services. 

This can be challenging when it comes to hardware changes. The only solution seems to be purchasing the desirable hardware, then manual configuration. It can be worse if, at some point, we realize that the new changes are not beneficial. In this case, we have to start all over again! 

This inflexibility causes wasting money and energy.

How about scaling on demand

A good business invests a lot of money to satisfy its customers. It can be seen from different angles but one of the most important ones always has the capacity to respond to the clients as soon as possible. This rule is also applied to the digital world: even loyal customers might change their minds if they see that the servers are not responding due to reaching their maximum capacity.

Therefore, there should be an estimation of the demand. The challenging part of this estimation is when this demand goes very high on some days during the year and one should forecast it. This demand forecasting has many aspects and it is not limited to the digital traffic from clients to servers. Having a good estimation of the demand for a particular item in the inventory is highly valuable.

Black Friday is a good example of such a situation. 

There are two ways to cope with this unusual high demand: 

  1. Purchase extra hardware to ensure that there will be no delay in responding to the customers’ requests. This strategy seems to be safe, but it has some disadvantages. First, since the demand is high on only certain days, many resources are in idle mode for a long time. Second, the manual configuration of the newly purchased devices should be considered. All in all, it is not a wise decision financially.
  2. Ignore that demand and let customer experience the downtime and wait for servers to become available. As it is easy to guess, it is not good for the reputation of the business.

This inflexibility is hard to address, and it gets worst over time. 

Expansion 

One might want to expand the business geographically. Along with marketing, there are some technical challenges. 

The issue with the geographical expansion is the delay that is caused by the physical distance between the clients and servers. A good strategy is to distribute the data centers around the world and locate them somewhere closer to the customers.

The configuration of these new data centers along with the security, networking, and data management might be very hard.

Cloud Computing

Having the challenges of the on-premises data centers, the first evolution of data-intensive workloads happened around 2010 when third-party cloud providers such as Amazon Web Services and Microsoft Azure were introduced. 

They provided companies with the infrastructure/services with the pay-as-you-go approach. 

Cloud Computing solved many problems with on-premises approaches. 

Risto and Timo have a great blog post about “Cloud Data Transformation” and I recommend checking it out to know more about the advantages of Cloud Computing.

Edge Computing

Over time, more applications have been developed, and Cloud Computing seemed to be the proper solution for them, but around 2020 Edge Computing got more and more attention as the solution for a group of newly-introduced applications that were more challenging. 

The common feature of these applications was being time-sensitive.  Cloud computing might act poorly in such cases since the data transmission to the cloud is time-consuming itself. 

The basic idea of Edge Computing is to process data close to where it is produced. This decentralization has some benefits such as:

Reducing latency

As discussed earlier, the main advantage of Edge Computing is that it reduces the latency by eliminating the data transmission between its source and cloud.

Saving Network Bandwidth 

Since the data is being processed in Edge Nodes, the network bandwidth can be saved. This matters a lot when the stream of data needs to be processed.

Privacy-preserving

Another essential advantage of Edge Computing is that the data does not need to leave its source. Therefore, it can be used in some applications where sending data to the cloud/on-perm data centers is not aligned with regulations.

AI applications

Many real-world use cases in the industry were introduced along with the advances in Artificial Intelligence. 

There are two options for deploying the models: Cloud-based AI and Edge AI. There is also another categorization for training the model (centralized and decentralized) but it is beyond the scope of this blog.

Cloud-based AI

With this approach, everything happens in the cloud, from data gathering to training and deploying the model.

Cloud-based AI has many advantages, such as being cost-saving. It would be much cheaper to use cloud infrastructure for training a model rather than purchasing the physical GPU-enabled computers.

The workflow of such an application is that after the model is deployed, new unseen data from the business unit (or wherever the source of data is) will be sent to the cloud, the decision will be made there and it will be sent back to the business unit.

Edge AI

As you might have guessed, Edge AI addresses the time-sensitivity issue. This time, the data gathering and training of the model steps still happen in the cloud, but the model will be deployed on the edge nodes. This change in the workflow not only saves the network bandwidth but also reduces the latency. 

Edge AI opens the doors to many real-time AI-driven applications in the industry. Here are some examples: 

  • Autonomous Vehicles
  • Traffic Management Systems
  • Healthcare systems
  • Digital Twins

Data Fabric

So far, we have discussed a bit about the concepts of Cloud/Edge computing, but as always, the story is different in real-world applications.

We talked about the benefits of cloud computing but it is important to ask these questions ourselves:

  • What would be the architecture of having such services in the Cloud/Edge?
  • What is the process of migration from on-prem to cloud? What are the challenges? How can we solve them? 
  • How can we manage and access data in a unified manner to avoid data silos?
  • How can we orchestrate distributed servers or edge nodes in an optimized and secure way?
  • How about monitoring and visualization?

Many companies came up with their own solutions for the above questions with manual work but there is a need for a better way for a business to focus on creating values, rather than dealing with these issues. This is when Data Fabric comes into the game. 

Data Fabric is an approach for managing data in an organization. Its architecture consists of a set of services that make accessing data easier regardless of its location (on-prem, cloud, edge). This architecture is flexible, secure, and adaptive.

Data Fabric can reduce the integration time, the maintenance time, and the deployment time. 

Next, we will be talking about the HPE Ezmeral Data Fabric (Data Fabric is offered as a solution by many enterprises and the comparison between them is beyond the scope of this blog).

HPE Ezmeral Data Fabric

HPE Ezmeral Data Fabric is an Edge to Cloud solution that supports industry-standard APIs such as REST, S3, POSIX, and NFS. It also has an ecosystem package that contains many open-source tools such as Apache Spark and allows you to do data analysis. 

You can find more information about the benefits of using HPE Ezmeral Data Fabric here.

As you can see, there is an eye-catching part named “Data Fabric Event Stream”. This is the key feature that allows us to develop Edge AI applications with the HPE Ezmeral Data Fabric.

Edge AI with HPE Ezmeral Data Fabric – application

An Edge AI application should contain at least one platform for orchestrating the broker cluster such as Kafka, some tools such as Apache Spark, and a data store. This might not be as easy as it seems, especially in large-scale applications when we have millions of sensors, thousands of edge sites, and the cloud. 

Fortunately, with HPE Ezmeral Data Fabric Event Stream, this task can be done much easier. We will go through it by demonstrating a simple application that we developed. 

Once you set up the cluster, the only thing you need to do is to install the client on the edge nodes, connect them to the cluster (by a simple line maprlogin command), and then enable the services that you want to use. 

For the event stream, it is already there, and again it just needs a single command for creating a stream and then creating topics in it.

For the publisher (also called producer), you need to just send the data from any source to the broker, and for the subscriber (also called consumer) the story is the same.

For using open-source tools such as Apache Spark (or in our case Spark Structure Streaming), you just need to install them on the mapr client, and the connection between the client and the cluster will be automatically established. So you can run a script in edge nodes and access data in the cluster.

Storing data is again as simple as the previous ones. The table creation can be done with a single command, and storing it is also straightforward.

Conclusion

To sum up, Edge AI has a promising future, and leveraging it with different tools such as Data Fabric can be a game changer.

Thank you for reading this blog! I would also like to invite you to our talk about the benefits of Edge Computing in Pori on 23/09/2022!

More information can be found here.

Sadaf Nazari.

Metadata driven development realises “smart manufacturing” of data ecosystems – Part 1

Data development is following a few steps behind the evolution of car manufacturing. Waterfall development is like chain manufacturing. Agile team works like a manufacturing cell. Metadata aligns with digital twin for smart manufacturing. This is the next step in the evolution.

Business Problem

A lot of companies are making a digital transformation with ambitious goals for creating high value outcomes using data and analytics. Leveraging siloed data assets efficiently is one of the biggest challenges in doing analytics at scale. Companies often struggle to provide an efficient supply of trusted data to meet the growing demand for digitalization, analytics, and ML. Many companies have realised that they must resolve data silos to overcome these challenges.

Business Solution

Metadata enables finding, understanding, management and effective use of data. Metadata should be used as a foundation and as an accelerator for any development of data & analytics solutions. Poor metadata management has been one of the key reasons why the large “monolith” data warehouses have failed to deliver their promises.

Data Development Maturity Model

Many companies want to implement a modern cloud-based data ecosystem. They want to migrate away from the old monolith data warehouses. It is important to know the history of how the monoliths were created to avoid repeating the old mistakes.   

The history of developing data ecosystems – data marts, lakes, and warehouses – can be illustrated with the below maturity model & analogy to car manufacturing.

Data development maturity model

1. Application Centric Age

In the application centric age, the dialogue between business & IT related mostly to functionality needs. Realising the functionality would require some data, but the data was treated just as a by-product of the IT applications and their functionality.

Artisan workshop – created customised solutions that are tuned for a specific use case & functionality – like custom cars or data marts – which were not optimised from data reuse (semantics & integrity etc.) point of view.

Pitfalls – Spaghetti Architecture

Projects were funded, planned, and realised in organisational/ departmental silos. Data was integrated for overlapping point solutions for tactical silo value. Preparation of data for analytics is about 80% of the effort and this was done repeatedly.

As a consequence of lack of focus on data, the different applications are integrated with “point-to-point” integrations resulting in so-called “spaghetti architecture”.  Many companies realised that this was very costly as IT was spending a very large part of their resources in connecting different silos with this inefficient style.

From data silos to efficient sharing of data

2. Data Centric Age

In the data centric age companies wanted to share data for reuse. They wanted to save costs, improve agility, and enable business to exploit new opportunities. Data became a business enabler and a key factor into the dialogue between business & IT. Data demand & supply needed to be optimised.

Companies also realised that they need to develop a target architecture – like enterprise data warehouse (EDW) – that enables them to provide data for reuse. Data integration for reuse required common semantics & data models to enable loose coupling of data producers and data consumers.

Pitfalls – Production line for “monolith” EDW

Many companies created a production line – like chain production or waterfall data program for building analytical ecosystems or EDWs. Sadly, most of these companies got into the data centric age with poor management of metadata. Without proper metadata management the objective of data reuse is hard to achieve.

There are a lot of studies that indicate that 90% of process lead time is spent on handovers between responsibilities. Within data development processes metadata is handed over between different development phases and teams spanning across an entire lifecycle of a solution.

In waterfall development splits the “production line” into specialised teams. It includes a lot of documentation overhead to orchestrate the teams and many handovers between the teams. In most cases metadata was left in different silo tools and files (Excel & SharePoint). Then the development process did not work smoothly as the handovers lacked quality and control. Handing over to an offshore team added more challenges.

Sometimes prototyping was added, which helped to reduce some of these problems by increasing user involvement in the process.

Waterfall development model with poor metadata management

The biggest headache is however the ability to manage the data assets. Metadata was in siloed tools and files, there were no integrated views of the data assets. To create integrated views the development teams invented yet another Excel sheet.

  • Poor metadata management made it very hard to:
    • understand DW content without the help of experts
    • create any integrated views of the data assets that enable reusing data
    • analyse impacts of the changes
    • understand data lineage to study root causes of problems

Because of slow progress these companies made a lot of shortcuts (technical debt) that started violating the target architecture, which basically meant that they started drifting back to the application centric age.

The beautiful data centric vision resulted in a lot of monolithic EDWs that are hard to change, provide slow progress and have increasing TCO. They have become legacy systems that everyone wants to migrate away.

3. Metadata Centric Age

Leading companies have realised that one of the main pitfalls of the data centric age was lack of metadata management. They have started to move into the “metadata centric age” with a fundamental way of working change. They apply metadata driven development by embedding usage of collaborative metadata capabilities into the development processes. All processes are able to consume and contribute into a common repository. Process handovers are simplified, standardised, integrated, and even automated.

Metadata driven development brings the development of modern data warehouses into the level of “smart manufacturing”.

Metadata enables collaboration around shared knowledge

Smart manufacturing enables us to deliver high quality products with frequent intervals and with batch size one. Smart manufacturing uses digital twins to manage designs, processes, and track quality of the physical products. It also enables scaling because knowledge is shared even between distributed teams and manufacturing sites.

Thanks to open access article: Development of a Smart Cyber-Physical Manufacturing System in the Industry 4.0 Context

 

Metadata driven development uses metadata as the digital twin to manage data product designs and the efficiency of the data supply chain. Common metadata repository which is nowadays branded as Data Catalog centralises knowledge, reduces dependency on bottleneck resources and enables a scalable development.

Metadata driven data supply chain produces data products through a life cycle where the artefacts evolve through conceptual, logical, physical, and operational stages. That also reflects the metadata needs in different stages.

Metadata driven data supply chain

 

Data Plan & Discovery act as a front end of the data supply chain. That is the place where business plans are translated into data requirements and prioritised into a data driven roadmap or backlog. This part of the data supply chain is a perfect opportunity window for ensuring cross functional alignment and collaboration using common semantics and a common Data Catalog solution. It gives a solid start for the supply of trusted data products for cross functional reuse.

Once the priority is confirmed and the feasibility is evaluated with data discovery, we have reached a commitment point from where the actual development of the vertical slices can continue.

DataOps automation enables smart manufacturing of data products. It enables highly optimised development, test and deployment practices required for frequent delivery of high-quality solution increments for user feedback and acceptance.

DataOps enables consistent, auditable, repeatable & scalable development. More teams – even in distributed locations – can be added to scale development. Solita has experience of forming agile teams that each are serving a certain business function. The team members are distributed into multiple countries, but they all share the same DataOps repository. This enables transparent & automatically documented architecture.

Example of transparent & automatically documented architecture with ADE

 

Learn more about Solita DataOps with Agile Data Engine (ADE):

Future-proof your data delivery http://www.agiledataengine.com/

Learn more about Solita experiences with Data Catalogs:

https://www.solita.fi/en/data-catalogs/

 

Stay tuned for more blogs about “Smart Manufacturing”.

This blog is the first blog in the series. It focused on the maturity model and explained how the large monolith data warehouses were created. It just briefly touched on the metadata driven development. There is a lot more to tell in that.

Data Academy 2022 Spring

Data Academy – launching my career as a Data Consultant

After a decade in my previous profession, I felt it was time for a change. I used to be a senior level expert, so making this kind of a change was exciting but also a bit terrifying. The Data Academy was ideal, because I felt it would better support my transition. After my studies, I applied to the Data Academy and I was accepted.

Our Data Academy group had future data platform engineers, future master data management engineers and me, the future visual data consultant. Everyone would learn a bit of each role, giving an introductory level to the topics. Solita’s internal experts held hybrid lessons, which meant real-life knowledge combined with expert tips. Regardless of your career path, the topics will be important to you at some point of your data career.

The best part of the Academy was the network that it offered to me. Firstly, I had my fellow academians. Secondly, I got a good look at all the departments and met colleagues. During the Academy, I met over 70 Solitans and got to know colleagues in different offices.

“The best part of the Academy was the network that it offered to me.”

Data Academy 2022 Spring

Growing as a specialist

After the Academy I dedicated my time to self-studies: Power BI and Azure certificates were my first targets, but I also continued my AWS studies, together with the Mimmit Koodaa community.

I will learn a lot through my work as well, because all the work projects are different in Solita. Most importantly, I can commit self-study time for my work weeks. I am participating in internal training, covering agile methods, Tableau, and service design. These courses will contribute my work in the future.

Solita community has warmly welcomed me. My colleagues are helpful, and they share their knowledge eagerly. I received work straight after the Academy, even quite demanding tasks, but there are always senior colleagues to turn to and discuss the matters.

Data Academy Spring 2022

Three tips how to become a Data Consultant

Check what my colleagues Johanna, Tuomas and Tero think of their work as a Data Consultant. The article gives you a good picture what our work is all about!

Learn one visualization tool well: it is a small hop to learn a second one later. Also, it is important to understand the licensing. Read my colleagues’ blog post about taking a deep dive into the world of Power BI. 

Of course, you need to understand the core fundamentals of data, and how you can utilize data in business. Here is a great example, how data creates value.

Finally, notable topics to learn are cloud technologies, databases and data modelling. They are strongly present in our every day work.

I could not be happier with my choice to join Solita via the Academy, and I sincerely recommend it!

The application to the Solita Data academy is now open!

Are you interested in attending Data academy? The application is now open, apply here!

Data Consultant

Unfolding the work of a Data Consultant

Meet Johanna, Tuomas and Tero! Our Data Consultants, who all work with data analysis and visualizations. Let’s map out their journey at Solita and demystify the work of Data Consultants!

All three have had different journeys to become a Data Consultant. Tuomas has a business degree and Tero started his career working with telecommunications technology. Johanna however found her way to visualizations quite young: “I created my first IBM Cognos reports as a summer trainee when I was 18 and somehow, I ended up studying Information Systems Science.” It has been, however, love at first sight for all of them. Now they work at Solita’s Data Science and Analytics Cell.

What is a typical Data Consultant’s work day like?

The interest to versatile work tasks combines our Data Consultants.  Tuomas describes himself as “a Power BI Expert”. His days go fast by designing Power BI phases, modelling data, and doing classical pipeline work. “Sometimes I’d say my role has been something between project or service manager.”

Tero in the other hand is focusing on report developing and visualizations. He defines backlogs, develops metadata models, and holds client workshops.

Johanna sees herself as a Data Visualization Specialist, who develops reports for her customers. She creates datasets, and defines report designs and themes. “My work also includes data governance and the occasional maintenance work,” Johanna adds.

All three agree that development work is one of their main tasks. “I could say that a third of my time goes to development,” Tuomas estimates. “In my case I would say even half of my time goes to development,” Tero states.

Power BI is the main tool that they are using. Microsoft Azure and Snowflake are also in daily use. Tools vary in projects, so Tuomas highlights that “it is important to understand the nature of different tools even though one would not work straight with them”.

What is the best part of a Data Consultant’s work?

The possibility to work with real-life problems creating concrete solutions brings the most joy for our consultants. “It is really satisfying to provide user experiences, which deliver the necessary information and functionality, which the end users need solve their business-related questions,” Johanna clarifies her thoughts.

And of course, collaborating with people keeps our consultants going! Tuomas estimates that 35% of his time is dedicated to stakeholder communications: he mentions customer meetings, but also writing documentations, and creating project defining, “specs”, with his customers.

Our consultants agree that communication skills are one of the key soft skills to master when desiring to become a Data Consultant! Tuomas tells, that working and communicating with end-users has always felt natural to him.

Tero is intrigued by the possibility to work with different industries: “I will learn how different industries and companies work, what kind of processes they have and how legislation affects them. This work is all about understanding the industry and being customer-oriented.”

“Each workday is different and interesting! I am dealing with many different kinds of customers and business domains every day.”

When asked, what keeps the consultants working with visualizations, they all ponder for a few seconds. “A report, which I create, will provide straight benefit for the users. That is important to me,” Tuomas sums ups his thoughts. “Each workday is unique and interesting! I am dealing with many different customers and business domains every day,” Johanna answers. Tero smiles and concludes: “When my customers get excited about my visualization, that is the best feeling!”

How are our Data Consultants developing their careers?

After working over 10 years with reporting and visualizations, Tero feels that he has found his home: “This role feels good to me, and it suits my personality well. Of course, I am interested in getting involved with new industries and learning new tools, but now I am really contented!”

Tuomas, who is a newcomer compared to Tero, has a strong urge to learn more: “Next target is to get deeper and more technical understanding of data engineering tools. I would say there are good opportunities at Solita to find the most suitable path for you.”

Johanna has had different roles in her Solita journey, but she keeps returning to work with visualizations: “I will develop my skills in design, and I would love to learn a new tool too! This role is all about continuous learning and that is an important capability of a Data Consultant!”

“I would say there are good opportunities at Solita to find the most suitable path for you.”

How to become an excellent Data Consultant? Here are our experts’ tips:

Johanna: “Work together with different stakeholders to produce the best solutions. Do not be afraid to challenge the customer, ask questions or make mistakes.”

Tuomas: “Be curious to try and learn new things. Don’t be afraid to fail. Ask from colleagues and remember to challenge customer’s point of view when needed.”

Tero: “Be proactive! From the point of view of technical solutions and data. Customers expect us to bring them innovative ideas!”

Would you like join our Data Consultant team? Check our open positions.

Read our Power BI Experts’ blog post: Power BI Deep Dive

Tableau Image Role Example

Overview of the Tableau product roadmap based on TC22 and TC21

Tableau Conference (TC22) was held last week in person in Las Vegas (with virtual participation possibility). Majority of the introduced new features and functionalities were related to data preparation & modeling, easy and automated data science (business science as Tableau calls it), and Tableau Cloud management & governance capabilities. Tableau is on its journey from a visual analytics platform to a full scale end-to-end analytics platform.

In the keynote Tableau CEO Mark Nelson emphasised the role of both Tableau and Salesforce user communities to drive change with data: there are over 1M Tableau Datafam members and over 16M Salesforce Trailblazers. Once again, the importance of data for businesses and organisations was highlighted. But the viewpoint was data skills – or lack of them – and data cultures more than technologies. Mark Nelson underlined the meaning of cloud saying 70% of new customers start their analytical journey in the cloud. One of the big announcements was rebranding Tableau Online to Tableau Cloud and introducing plenty of new features to it.

Taking account the new features introduced at TC22 Tableau platform includes good data preparation and modelling capabilities with many connectors to a variety of data sources, services and APIs. Tableau’s visual analytics and dashboarding capabilities are already one of best in the market. In TC21 last year Tableau talked a lot about Slack integration and embedding to boost collaboration and sharing of insights. At the moment, effort is put especially to democratize data analytics for everyone despite gaps in the data skills. This is done using autoML type of functionalities to automatically describe and explain data, show outliers, create predictions and help to build and act on scenarios. Also the cloud offering with better governance, security and manageability was a high priority.

Next I’ll go through the key features introduced in TC22 and also list functionalities presented in TC21 to understand the big picture. More info about TC21 released features can be found in a previous blog post: A complete list of new features introduced at the Tableau Conference 2021. These feature lists don’t contain all the features included in previous releases but the ones mentioned in TC21.

Note: All the images are created using screenshots from the TC22 Opening Keynote / Devs on Stage session and Tableau new product innovations blog post. You can watch the sessions at any time on Tableau site.

Workbook authoring & data visualization

In TC22 there weren’t too many features related to workbook authoring. The only bigger announcement was the new image role to enable dynamic images in visualizations. These could be for example product images or any other images that can be found via a url link in the source data.  From TC21 there are still a couple of very interesting features waiting to be released, I’m especially waiting for dynamic dashboard layouts.

  • Introduced in TC22
    • Image role: Dynamically render images in the viz based on a link field in the data.
  • Introduced in TC21 (but not yet released)
    • Dynamic Dashboard Layouts (~2022 H1): Use parameters & field values to show/hide layout containers and visualizations.
    • Visualization Extensions (~2022 H2): Custom mark types, mark designer to fine tune the visualization details, share custom viz types.
  • Introduced in TC21 (and already released)
    • Multi Data Source Spatial Layers (2021.4): Use data from different data sources in different layers of a single map visualization.
    • Redesigned View Data (2022.1): View/hide columns, reorder columns, sort data, etc.
    • Workbook Optimizer (2022.1): Suggest performance improvements when publishing a workbook.
Tableau Image Role Example
Image role example to dynamically render images presented in TC22. Side note: have to appreciate the “Loves Tableau: True” filter.

Augmented analytics & understand data

For this area there were a couple of brand new announcements and more info about a few major functionalities already unveiled in TC21. Data stories is an automated feature to create descriptive stories about data insights in a single visualization. Data stories explains what data and insights is presented in the visualization, explanation changes dynamically when data is filtered or selected in the viz. With the data orientation pane the author can partly automate the documentation of dashboard and visualizations. It shows information about data fields, applied filters, data outliers and data summary, and possible links to external documentation.

Tableau Data Stories example
Example of automatically created descriptive data story within a dashboard presented in TC22.

 

Few originally in TC21 introduced features were also mentioned in TC22. Model Builder is a big step toward guided data science. It will help to build ML-model driven predictions fully integrated within Tableau. It’s based on the same technology as Salesforce’s Einstein Analytics. Scenario planner is a functionality to build what-if-analyses to understand different options and outcomes of different decisions.

  • Introduced in TC22
    • Data Stories (beta in Tableau Cloud):  Dynamic and automated data story component in Tableau Dashboard. Automatically describes data contents.
    • Data orientation pane: Contain information about dashboard and fields, applied filters, data outliers and data summary, and links to external resources.
    • Model Builder: Use autoML to build and deploy predictive models within Tableau. Based on Salesforce’s Einstein platform.
    • Scenario Planner: Easy what-if-analysis. View how changes in certain variables affect target variables and how certain targets could be achieved.
  • Introduced in TC21 (but not yet released)
    • Data Change Radar (~2022 H1): Alert and show details about meaningful data changes, detect new outliers or anomalies, alert and explain these.
    • Multiple Smaller Improvements in Ask Data (~2022 H1): Contact Lens author, Personal pinning, Lens lineage in Catalog, Embed Ask Data.
    • Explain the Viz (~2022 H2): Show outliers and anomalies in the data, explain changes, explain mark etc.
  • Introduced in TC21 (and already released)
    • Ask Data improvements (2022.1): Phrase builder already available, phrase recommendations available later this year.

Collaborate, embed and act

In TC21 collaboration and Slack integration were one of the big development areas. In TC22 there wasn’t much new about this topic, but Tableau actions were again demonstrated as a way to build actionable dashboards. Also the possibility to share dashboards publicly for unauthenticated non-licenced users was shown again in TC22. This functionality is coming to Tableau Cloud later this year.

  • Introduced in TC22
    • Tableau Actions: Trigger actions outside Tableau, for example Salesforce Flow actions. Support for other workflow engines will be added later.
    • Publicly share dashboards (~2022 H2): Share content via external public facing site to give access to unauthenticated non-licenced users, only Tableau Cloud.
  • Introduced in TC21 (but not yet released)
    • 3rd party Identity & Access Providers: Better capabilities to manage users externally outside Tableau.
    • Embeddable Web Authoring: No need for desktop when creating & editing embedded contents, full embedded visual analytics.
    • Embeddable Ask Data 
  • Introduced in TC21 (and already released)
    • Connected Apps (2021.4): More easily embed to external apps, create secure handshake between Tableau and other apps.
    • Tableau search, Explain Data and Ask Data in Slack (2021.4)
    • Tableau Prep notifications in Slack (2022.1)

Data preparation, modeling and management

My personal favourite of the new features can be found here. Shared dimensions enable more flexible multi-fact data models where multiple fact tables can relate to shared dimension tables. This feature makes the logical data model layer introduced a couple of years ago more comprehensive and very powerful. Tableau finally supports creation of enterprise level data models that can be leveraged in very flexible ways and managed in a centralized manner. Another data model related new feature was Table extensions that enable use of Python and R scripts directly in the data model layer.

Tableau Shared Dimensions Example
Shared dimensions enabled multi-fact data source example presented in TC22.

 

There are also features to boost data source connectivity. Web Data Connector 3.0 makes it easier to connect different web data sources, services and API’s. One important new data source is AWS S3 that will enable connection directly to the data lake layer. Also Tableau Prep is getting few new functionalities. Row number column and null value cleaning are rather small features. Multi-row calculations instead are a bit bigger thing, although the examples Tableau mentioned (running totals and moving averages) might not very relevant in data prep cause these usually must take into account filters and row level security and therefore these calculations must often be done at runtime.

  • Introduced in TC22
    • Shared dimensions: Build multi-fact data models where facts relate to many shared dimensions,
    • Web data connector 3.0: Easily connect to web data and APIs, for example to AWS S3, Twitter etc.
    • Table extensions: Leverage python and R scripts in the data model layer.
    • Insert row number and clean null values in Prep: Easily insert row number column and clean & fill null values.
    • Multi-row calculations in Prep: Calculate for example running total or moving average in Tableau Prep.
    • New AWS data sources: Amazon S3, Amazon DocumentDB, Amazon OpenSearch, Amazon Neptune.
  • Introduced in TC21 (but not yet released)
    • Data Catalog Integration: Sync external metadata to Tableau (from Collibra, Alation, & Informatica).
    • Tableau Prep Extensions: Leverage and build extension for Tableau Prep (sentiment analysis, OCR, geocoding, feature engineering etc.).
  • Introduced in TC21 (and already released)
    • Virtual Connections (2021.4): Centrally managed and reusable access points to source data with single point to define security policy and data standards.
    • Centralized row level security (2021.4): Centralized RLS and data management for virtual connections.
    • Parameters in Tableau Prep (2021.4): Leverage parameters in Tableau Prep workflows.

Tableau Cloud management

Rebranding Tableau Online to Tableau Cloud and a bunch of new management and governance features in it was one important area of TC22. Tableau Cloud can now be managed as a whole with multi-site management. Security has already been a key area when moving to cloud and now Tableau finally supports customer managed encryption keys (BYOK).  From a monitoring point of view both activity log and admin insights provide information how Tableau Cloud and contents in it are used.

  • Introduced in TC22
    • Multi-site management for Tableau Cloud: Manage centrally all Tableau Cloud sites.
    • Customer managed encryption keys (later 2022): BYOK (Bring Your Own Keys). 
    • Activity Log: More insights on how people are using Tableau, permission auditing etc.
    • Admin Insights: Maximise performance, boost adoption, and manage contents.
Tableau Admin Insights Example
Tableau Cloud Admin Insights example presented in TC22.

Tableau Server management

There weren’t too many new features in Tableau Server management, I guess partly because of the effort put into Tableau Cloud Management instead. However, Tableau Server auto-scaling was mentioned again and it will be coming soon starting with backgrounder auto-scaling.

  • Introduced in TC22
    • Auto-scaling for Tableau Server (2022 H1): Starting with backgrounder auto-scaling for container deployments.
  • Introduced in TC21 (but not yet released)
    • Resource Monitoring Improvements (~2022 H1): Show view load requests, establish new baseline etc.
    • Backgrounder resource limits (~2022 H1): Set limits for backgrounder resource consumption.
  • Introduced in TC21 (and already released)
    • Time Stamped log Zips (2021.4)

Tableau ecosystem & Tableau Public

Last year in the TC21 Tableau ecosystem and upcoming Tableau Public features had a big role. This year there wasn’t much new in this area but still the Tableau exchange and accelerators were mentioned and shown in the demos a couple of times.

  • Introduced in TC21 (but not yet released)
    • Tableau Public Slack Integration (~2022 H1)
    • More connectors to Tableau Public (~2022 H1): Box, Dropbox, OneDrive.
    • Publish Prep flows to Tableau Public: Will there be a Public version for Tableau Prep?
    • Tableau Public custom Channels (~2022 H1):  Custom channels around certain topics.
  • Introduced in TC21 (and already released)
    • Tableau exchange: Search and leverage shared extensions, connectors, more than 100 accelerators. Possibility to share dataset may be added later on.
    • Accelerators: Dashboard starters for certain use cases and source data (e.g. call center analysis, Marketo data, Salesforce data etc.). Can soon be used directly from Tableau.

Want to know more?

If you are looking for more info about Tableau read our previous blog posts:

More info about the upcoming features on the Tableau coming soon page.

Check out our offering about visual analytics & Tableau, and book a demo to find out more:

 

A Beginner’s Guide to AutoML

In a world driven by data, Machine Learning plays the most central role. Not everyone has the knowledge and skills required to work with Machine Learning. Moreover, the creation of Machine Learning models requires a sequence of complex tasks that need to be handled by experts.

Automated Machine Learning (AutoML) is a concept that provides the means to utilise existing data and create models for non-Machine Learning experts. In addition to that, AutoML provides Machine Learning (ML) professionals ways to develop and use effective models without spending time on tasks such as data cleaning and preprocessing, feature engineering, model selection, hyperparameter tuning, etc.

Before we move any further, it is important to note that AutoML is not some system that has been developed by a single entity. Several organisations have developed their own AutoML packages. These packages cover a broad area, and targets people at different skill levels.

In this blog, we will cover low-code approaches to AutoML that require very little knowledge about ML. There are AutoML systems that are available in the form of Python packages that we will cover in the future.

At the simplest level, both AWS and Google have introduced Amazon Sagemaker and Cloud AutoML, which are low-code PAAS solutions for AutoML. These cloud solutions are capable of automatically building effective ML models. The models can then be deployed and utilised as needed.

Data

In most cases, a person working with the platform doesn’t even need to know much about the dataset they want to analyse. The work carried out here is as simple as uploading a CSV file and generating a model. We will take a look at Amazon Sagemaker as an example. However, the process is similar in other existing cloud offerings.

With Sagemaker, we can upload our dataset to an S3 bucket and tell our model that we want to be working with that dataset. This is achieved using Sagemaker Canvas, which is a visual, no code platform.

The dataset we are working with in this example contains data about electric scooters. Our goal is to create a model that predicts the battery level of a scooter given a set of conditions.

Creating the model

In this case, we say that our target column is “battery”. We can also see details of the other columns in our dataset. For example, the “latitude”and “longitude” columns have a significant amount of missing data. Thus, we can choose not to include those in our analysis.

Afterwards, we can choose the type of model we want to create. By default, Sagemaker suggests creating a model that classifies the battery level into 3 or more categories. However, what we want is to predict the battery level.

Therefore, we can change the model type to “numeric” in order to predict battery level.

Thereafter, we can begin building our models. This is a process that takes a considerable amount of time. Sagemaker gives you the option to “preview” the model that would be built before starting the actual build.

The preview only takes a few minutes, and provides an estimate of the performance we can expect from the final model. Since our goal is to predict the battery level, we will have a regression model. This model can be evaluated with RMSE (root mean square error).

It also shows the impact different features have on the model. Therefore, we can choose to ignore features that have little or no impact.

Once we have selected the features we want to analyse, we select “standard build” and begin building the model. Sagemaker trains the dataset with different models along with multiple hyperparameter values for each model. This is done in order to figure out an optimal solution. As a result, the process of building the model takes a long time.

Once the build is complete, you are presented with information about the performance of the model. The model performance can be analysed in further detail with advanced metrics if necessary.

Making predictions

As a final step, we can use the model that was just built to make predictions. We can provide specific values and make a single prediction. We can also provide multiple data in the form of a CSV file and make batch predictions.

If we are satisfied with the model, we can share it to Amazon Sagemaker Studio, for further analysis. Sagemaker Studio is a web-based IDE that can be used for ML development. This is a more advanced ML platform geared towards data scientists to perform complex tasks with Machine Learning models. The model can be deployed and made available through an endpoint. Thereafter, existing systems can use these endpoints to make their predictions.

We will not be going over Sagemaker Studio as it is something that goes beyond AutoML. However, it is important to note that these AutoML cloud platforms are capable of going beyond tabular data. Both Sagemaker and Google AutoML are also capable of working with images, video, as well as text.

Conclusion

While there are many useful applications for AutoML, its simplicity comes with some drawbacks. The main issue that we noticed about AutoML especially with Sagemaker is the lack of flexibility. The platform provides features such as basic filtering, removal, and joining of multiple datasets. However, we could not perform basic derivations such as calculating the distance traveled using the coordinates, or measuring the duration of rentals. All of these should have been simple mathematical derivations based on existing features.

We also noticed issues with flexibility for the classification of battery levels. The ideal approach to this would be to have categories such as “low”, “medium”, and “high”. However, we were not allowed to define these categories or their corresponding threshold values. Instead, the values were chosen by the system automatically.

The main purpose of AutoML is to make Machine Learning available to those who are not experts in the field. As a benefit of this approach, this also becomes useful to people like data scientists. They do not have to spend a large amount of time and effort selecting an optimal model, and hyperparameter tuning.

Experts can make good use of low code AutoML platforms such as Sagemaker to validate any data they have collected. These systems could be utilised as a quick and easy way to produce well-optimised models for new datasets. The models would measure how good the data is. Experts also get an understanding about the type of model and hyperparameters that are best suited for their requirements.

 

 

Data classification methods for data governance

Data classification is an important process in enterprise data governance and cybersecurity risk management. Data is categorized into security and sensitivity levels to make it easier to keep the data safe, managed and accessible. The risks for poor data classification are relevant for any business. By not following the data confidentiality policies and also preferably automation, an enterprise can expose its trusted data to unwanted visitors by a simple human error or accident. Besides the governance and availability points of view, proper data classification policies provide security and coherent data life cycles. They are also a good way to prove that your organization follows compliance standards (e.g. GDPR) to promote trust and integrity.

In the process of data classification, data is initially organized into categories based on type, contents and other metadata. Afterwards, these categories are used to determine the proper level of controls for the confidentiality, integrity, and availability of data based on the risk to the organization. It also implies likely outcomes if the data is compromised, lost or misused, such as the loss of trust or reputational damage.

Though there are multiple ways and labels for classifying company data, the standard way is to use high risk, medium risk and low/no risk levels. Based on specific data governance needs and the data itself, organizations can select their own descriptive labels for these levels. For this blog, I will label the levels confidential (high risk), sensitive (medium risk) and public (low/no risk). The risk levels are always mutually exclusive.

  • Confidential (high risk) data is the most critical level of data. If not properly controlled, it can cause the most significant harm to the organization if compromised. Examples: financial records, IP, authentication data
  • Sensitive (medium risk) data is intended for internal use only. If medium risk data is breached, the results are not disastrous but not desirable either. Examples: strategy documents, anonymous employee data or financial statements
  • Public (low risk or no risk) data does not require any security or access measures. Examples: publicly available information such as contact information, job or position postings or this blog post.

High risk can be divided into confidential and restricted levels. Medium risk is sometimes split into private data and internal data. Because a three-level design may not fit every organization, it is important to remember that the main goal of data classification is to assess a fitting policy level that works with your company or your use case. For example, governments or public organizations with sensitive data may have multiple levels of data classification but for a smaller entity, two or three levels can be enough. Guidelines and recommendations for data classification can be found from standards organizations such as International Standards Organization (ISO 27001) and National Institute of Standards and Technology (NIST SP 800-53).

Besides standards and recommendations, the process of data classification itself should be tangible. AWS (Amazon Web Services) offers a five-step framework for developing company data classification policies. The steps are:

  1. Establishing a data catalog
  2. Assessing business critical functions and conduct an impact assessment
  3. Labeling information
  4. Handling of assets
  5. Continuous monitoring

These steps are based on general good practices for data classification. First, a catalog for various data types is established and the data types are grouped based on the organization’s own classification levels.

The security level of data is also determined by its criticality to the business. Each data type should be assessed by its impact. Labeling the information is recommended for quality assurance purposes.

AWS uses services like Amazon SageMaker (SageMaker provides tools for building, training and deploying machine learning models in AWS) and AWS Glue (AWS Glue is an ETL event-driven service that is used for e.g. data identification and categorization) to provide insight and support for data labels. After this step, the data sets are handled according to their security level. Specific security and access controls are provided here. After this, continuous monitoring kicks in. Automation handles monitoring, identifies external threats and maintains normal functions.

Automating the process

The data classification process is fairly complex work and takes a lot of effort. Managing it manually every single time is time-consuming and prone for errors. Automating the classification and identification of data can help control the process and reduce the risk of human error and breach of high risk data. There are plenty of tools available for automating this task. AWS uses Amazon Macie for machine learning based automation. Macie uses machine learning to discover, classify and protect confidential and sensitive data in AWS. Macie recognizes sensitive data and provides dashboards and alerts for visual presentation of how this data is being used and accessed.

Amazon Macie dashboard shows enabled S3 bucket and policy findings

 

After selecting the S3 buckets the user wants to enable for Macie, different options can be enabled. In addition to the frequency of object checks and filtering objects by tags, the user can use custom data identification. Custom data identifiers are a set of criteria that is defined to detect sensitive data. The user can define regular expressions, keywords and a maximum match distance to target specific data for analysis purposes.

As a case example, Edmunds, a car shopping website, promotes Macie and data classification as an “automated magnifying glass” into critical data that would be difficult to notice otherwise. For Edmunds, the main benefits of Macie are better visibility into business-critical data, identification of shared access credentials and protection of user data.

Though Amazon Macie is useful for AWS and S3 buckets, it is not the only option for automating data classification. A simple Google search offers tens of alternative tools for both small and large scale companies. Data classification is needed almost everywhere and the business benefit is well-recognized.

For more information about this subject, please contact Solita Industrial.

Your AI partner can make or break you!

Industries have resorted to use AI partner services to fuel their AI aspirations and quickly bring their product and services to market. Choosing the right partner is challenging and this blog lists a few pointers that industries can utilize in their decision making process.

 

Large investments in AI clearly indicate industries have embraced the value of AI. Such a high AI adoption rate has induced a severe lack of talented data scientists, data engineers and machine learning engineers. Moreover, with the availability of alternative options, high paying jobs and numerous benefits, it is clearly an employee’s market.

Market has a plethora of AI consulting companies ready to fill in the role of AI partners with leading industries. Among such companies, on one end are the traditional IT services companies, who have evolved to provide AI services and on the other end are the AI start-up companies who have backgrounds from academia with a research focus striving to deliver the top specialists to industries.

Considering that a company is willing to venture into AI with an AI partner. In this blog I shall enumerate what are the essentials that one can look for before deciding to pick their preferred AI partner.

AI knowledge and experience:  AI is evolving fast with new technologies developed by both industries and academia. Use cases in AI also span multiple areas within a single company. Most cases usually fall in following domains: Computer vision, Computer audition, Natural language processing, Interpersonally intelligent machines, routing, and motion and robotics. It is natural to look for AI partners with specialists in the above areas.

It is worth remembering that most AI use cases do not require AI specialists or super specialists and generalists with wide AI experience could well handle the cases.

Also specialising in AI alone does not suffice to successfully bring the case to production. The art of handling industrial AI use cases is not trivial and novice AI specialists and those that are freshly out of University need oversight. Hence companies have to be careful with such AI specialists with only academic experience or little industrial experience.

Domain experience: Many AI techniques are applicable across cases in multiple domains. Hence it is not always necessary to seek such consultants with domain expertise and often it is an overkill with additional expert costs. Additionally, too much domain knowledge can also restrict our thinking in some ways. However, there are exceptions when domain knowledge might be helpful, especially when limited data are available.

A domain agnostic AI consultant can create and deliver AI models in multiple domains in collaboration with company domain experts.

Thus making them available for such projects would be important for the company.

Problem solving approach This is probably the most important attribute when evaluating an AI partner. Company cases can be categorised in one of the following silo’s:

  • Open sea: Though uncommon, it is possible to see few such scenarios, when the companies are at an early stage of their AI strategy. This is attractive for many AI consultants who have the freedom to carve out an AI strategy and succeeding steps to boost the AI capabilities for their clients. With such freedom comes great responsibility and AI partners for such scenarios must be carefully chosen who have a long standing position within the industry as a trusted partner.
  • Straits: This is most common when the use case is at least coarsely defined and suitable ML technologies are to be chosen and take the AI use case to production.  Such cases often don’t need high grade AI researchers/scientists but any generalist data scientist and engineer with the experience of working in an agile way can be a perfect match. 
  • Stormy seas: This is possibly the hardest case, where the use case is not clearly defined and also no ready solution is available. The use case definition is easy to be defined with data and AI strategists, but research and development of new technologies requires AI specialists/scientists. Hence special emphasis should be focused on checking the presence of such specialists. It is worth noting that AI specialists availability alone does not guarantee that there is a guaranteed conversion to production. 

Data security: Data is the fuel for growth for many companies. It is quite natural that companies are extremely careful with safeguarding the data and their use. Thus when choosing an AI partner it is important to look and ask for data security measures that are currently practised with the AI partner candidate organisation. In my experience it is quite common that AI specialists do not have data security training. If the company does not emphasise on ethics and security the data is most likely stored by partners all over the internet, (i.e. personal dropbox and onedrive accounts) including their private laptops.

Data platform skills: AI technologies are usually built on data. It is quite common that companies have multiple databases and do not have a clear data strategy. AI partners with inbuilt experience in data engineering shall go well, else a separate partner would be needed.

Design thinking: Design thinking is rarely considered a priority expertise when it comes to AI partnering and development. However this is probably the hidden gem beyond every successful deployment of AI use case. AI design thinking adopts a human centric approach, where the user is at the centre of the entire development process and her/his wishes are the most important. The adoption of the AI products would significantly increase when the users problems are accounted for, including AI ethics.

Blowed marketing: Usually AI partner marketing slides boast about successful AI projects. Companies must be careful interpreting this number, as often major portions of these projects are just proof of concepts which have not seen the light of day for various reasons. Companies should ask for the percentage of those projects that have entered into production or at least entered a minimum viable product stage.

Above we highlight a few points that one must look for in an AI partner, however what is important over all the above is the market perception of the candidate partner, and as a buyer you believe there is a culture fit, they understand your values, terms of cooperation, and their ability to co-define the value proposition of the AI case. Also AI consultants should stand up for their choices and not shy away from pointing to the infeasibility and lack of technologies/data to achieve desired goals set for AI use cases fearing the collapse of their sales. 

Finding the right partner is not that difficult, if you wish to understand Solita’s position on the above pointers and looking for an AI partner don’t hesitate to contact us.

Author: Karthik Sindhya, PhD, AI strategist, Data Science, AI & Analytics,
Tel. +358 40 5020418, karthik.sindhya@solita.fi

Workshop with AWS: Lookout for Vision

Have you ever wondered how much value a picture can give your business? Solita participated in a state-of-the-art computer vision workshop given by Amazon Web Services in Munich. We built an anomaly detection pipeline with AWS's new managed service called Lookout for Vision.

What problem are we solving?

On a more fundamental level, computer vision at the edge enables efficient quality control and evaluation of manufacturing quality. Quickly detecting manufacturing anomalies means that you can take corrective action and decrease costs. If you have pictures, we at Solita have the knowledge to turn those to value generating assets.

Building the pipeline

At the premises we had a room filled with specialised cameras and edge hardware for running neural networks. The cameras were Basler’s 2D grayscale cameras and an edge computer: Adlink DLAP-301 with the MXE-211 gateway. All the necessary components to build an end-to-end working demo.

We started the day by building the training pipeline. With Adlink software, we get a real-time stream from the camera to the computer. Furthermore, we can integrate the stream to an S3 bucket. When taking a picture, it automatically syncs it to the assigned S3 bucket in AWS. After creating the training data, you simply initiate a model in the Lookout for Vision service and point to the corresponding S3 bucket and start training.

Lookout for Vision is a fully managed service and as a user you have little control over configuration. In other words, you do make a compromise between configurability and speed to deployment. Since the service has little configuration, you won’t need a deep understanding of machine learning to use it. But knowing how to interpret the basic performance metrics is definitely useful for tweaking and retraining the model.

After we were satisfied with our model we used the AWS Greengrass service to deploy it to the edge device. Here again the way Adlink and AWS are integrated makes things easier. Once the model was up and running we could use the Basler camera stream to get a real-time result on whether the object had anomalies.

Short outline of the workflow:

  1. Generate data
  2. Data is automatically synced to S3
  3. Train model with AWS Lookout for Vision, which receives data from the S3 bucket mentioned above
  4. Evaluate model performance and retrain if needed
  5. Once model training is done, deploy it with AWS Greengrass to the edge device
  6. Get real-time anomaly detection.

All in all this service abstracts away a lot of the machine learning part, and the focus is on solving a well defined problem with speed and accuracy. We were satisfied with the workshop and learned a lot about how to solve business problems with computer vision solutions.

If you are interested in how to use Lookout for Vision or how to apply it to your business problem please reach out to us or the Solita Industrial team.

A sad person looking at a messy table with crows foot prints. Birds flying away holding silverware.

Data Academians share 5 tips to improve data management

Is your data management like a messy dinner table, where birds took “data silverware” to their nests? More technically, is your data split to organizational silos and applications with uncontrolled connections all around? This causes many problems for operations and reporting in all companies. Better data management alone won’t solve the challenges, but it has a huge impact.

While the challenges may seem like a nightmare, beginning to tackle them is easier than you think. Let our Data Academians, Anttoni and Pauliina, share their experiences and learnings. Though they’ve only worked at Solita for a short time, they’ve already got a hang of data management.

What does data management mean?

Anttoni: Good data management means taking care of your organization’s know-how and distributing it to employees. Imagine your data and AI being almost as person, who can answer questions like “how is our sales doing?” and “what are the current market trends?”. You probably would like to have the answer in a language you understand and with terms that everyone is familiar with. Most importantly, you want the answer to be trustworthy. With proper data management, your data could be this person.

Pauliina: For me data management compares to taking care of your closet, with socks, shirts and jeans being your data. You have a designated spot for each clothing type in your closet and you know how to wash and care for them. Imagine you’re searching for that one nice shirt you wore last summer when it could be hidden under your jeans. Or better yet, lost in your spouse or children’s closet! And when you finally find the shirt, someone washed it so that it shrank two sizes – it’s ruined. The data you need is that shirt and with data management you make sure it’s located where it should be, and it’s been taken care of so that it’s useful.

How do challenges manifest?

Anttoni: Bad data management costs money and wastes valuable resources in businesses. As an example of a data quality related issue from my experience: if employees are maybe not allowed, but technically able, to enter poor data into a system, like CRM or WMS, they will most likely do that at some point. This leads to poor data quality, which causes operational and sometimes technical issues. The result is hours and hours of cleaning and interpretation work that the business could have avoided with a few technical fixes.

Pauliina: The most profound problem I’ve seen bad data management cause is the hindering of a data-driven culture. This happened in real life when presenters collected material for a company’s management meeting from different sources and calculated key KPI’s differently. Suddenly, the management team had three contradicting numbers for e.g. marketing and sales performance. Each one of them came from a different system and had different filtering and calculation applied. In conclusion, decision making was delayed because no-one trusted each other’s numbers. Additionally, I had to check and validate them all. This wouldn’t happen if the company properly manages data.

Person handing silverware back to another person with a bird standing on his shoulder. They are both smiling.

Bringing the data silverware from silos to one place and modelling and storing it appropriately will clean the dinner table. This contributes towards meeting the strategic challenges around data – though might not solve them fully. The following actions will move you towards a better data management and thus your goals.

How to improve your data management?

Pauliina & Anttoni:

  1. We could fill all five bullets with communication. Improving your company’s data management is a change in organization culture. The whole organization will need to commit to the change. Therefore, take enough time to explain why data management is important.
  2. Start with analyzing the current state of your data. Pick one or two areas that contribute to one or two of your company or department KPIs. After that, find out what data you have in your chosen area: what are the sources, what data is stored there, who creates, edits, and uses the data, how is it used in reporting, where, and by whom.
  3. Stop entering bad data. Uncontrolled data input is one of the biggest causes of poor data quality. Although you can instruct users on how they should enter data to the system, it would be smart to make it impossible to enter bad data. Also pay attention to who creates and edits the data – not everyone needs the rights to create and edit.
  4. Establish a single source of truth, SSOT. This is often a data platform solution, and your official reporting is built on top of it. In addition, have an owner for your solution even when it requires a new hire.
  5. Often you can name a department responsible for each of your source system’s data. Better yet, you can name a person from each department to own the data and be a link between the technical data people and department employees.

Pink circle with a crows foot inside it and hearts around. Next to it a happy person with an exited bird on his shoulder.

About the writers:

My name is Anttoni, and I am a Data Engineer/4th year Information and Knowledge Management student from Tampere, Finland. After Data Academy, I’ll be joining the MDM-team. I got interested in data when I saw how much trouble bad data management causes in businesses. Consequently, I gained a desire to fix those problems.

I’m Pauliina, MSc in Industrial Engineering and Management. I work at Solita as a Data Engineer. While I don’t have education in data, I’ve worked in data projects for a few years in SMB sector. Most of my work circled around building official reporting for the business.

 

The application to the Solita Data academy is now open!

Are you interested in attending Data academy? The application is now open, apply here!

Two zebras

Short introduction to digital twins

What are digital twins and how can they help you to understand complex structures.

What are digital twins?

A digital twin is a virtual model of a physical object or process. Such as production lines and buildings. When sensors collect data from a device, the sensor data can be used to update a “digital twin” copy of the device’s state in real time. So it can be used for things like monitoring and diagnostics.

There are different types of digital twins for designing and testing parts or products, but let’s focus more on system and process related twins.

For a simple example, you have a water heater connected to a radiator. Your virtual model gets data from the heater’s sensors and knows the temperature of the heater. The radiator on the other hand has no sensor attached to it. But the link between the heater and 3D picture of water heater and radiatorradiator is in your digital model. Now you can see virtually that when the heater is malfunctioning, your radiator gets colder. Not only sensors are connected to your digital twin, but manuals and other documents are also. So you can view the heater’s manual right there in the dashboard.

Industrial point of view benefits

We are living in an age when everything is connected to the internet and industrial devices are no different. Huge amounts of data is flowing from devices to different endpoints. That’s where digital twins will show its strengths by connecting all those dots to form a bigger picture about process and assets. Making it easier to understand complex structures. It’s also a two-way street, so digital twins can generate more useful data or update existing data.

Many times industrial processes consist of other processes that aren’t connected to each other. Like that lonely motor spinning without real connection to other parts of the process. Those are easily forgotten, even if it is a crucial part of the process. When complexity grows there will be even more loose ends that aren’t connected to each other.

  • Predictive maintenance lowers maintenance costs.
  • Productivity will improve, because reduced downtime and improved performance via optimization.
  • Testing in the digital world before real world applications.
  • Allows you to make more informed decisions at the beginning of the process.
  • Continuous improvement through simulations.

Digital twins offer great potential for predicting the future instead of analyzing the past. Real world experiments aren’t a cost effective way to test ideas. With a digital counterpart you can cost effectively test ideas and see if you missed something important.

Quick overview of creating digital twins with AWS IoT Twinmaker

In workspace you create entities that are digital versions of devices. Those entities are connected with components that will handle data connections. Components can connect to AWS Sitewise or other data source via AWS lambda. When creating a component you define it in JSON format and it can inherit other components.

Next step is to get your CAD models uploaded to the Twinmaker. When you have your models uploaded, you can start creating 3D scenes that will visualize your digital twin. Adding visual rules like tags that change their appearance can be done in this phase.

Now digital twin is almost ready and the only thing to do is connect Grafana with Twinmaker and create a dashboard in Grafana. Grafana has a plugin for Twinmaker that helps with connecting 3D scenes and data.

There are many other tools for creating digital twins and what to use, depends on the needs.

If you are interested in how to create Digital Twins please reach out to me or the Solita Industrial team. Please also check our kickstart for Connected Factory and blog posts related to Smart and Connected Factories

 

Do machines speak? Audio fingerprinting of internal combustion engines

Sensor data analytics is a fast-growing trend in the industrial domain. Audio, despite its holistic nature and huge importance to human machine operators, is usually not utilised to its full potential. In this blog post we showcase some of these possibilities through a research experiment case study conducted as part of the IVVES research project.

On a cold winter morning in December 2021 in the Solita Research R&D group we packed our bags with various audio recording equipment and set our sights on a local industrial machine rental company. We wanted to answer a simple question: do machines speak? Our aim was to record sound from multiple identical industrial grade machines (which turned out to be 53 kg soil compactors) in order to investigate whether we could consistently distinguish them based on their sound alone. In other words, just as each human has a very unique voice, our hypothesis was that the same would be true for machines, that is, we wanted to construct an audio fingerprint. This could then be used not only to identify each machine, but to detect if a particular machine’s sound starts to drift (indicating a potential incoming fault) or to check whether the fingerprint matches before and after renting out the machine, for example.

It is always important to keep the business use case and real-world limitations in mind when designing solutions to data-based (no pun intended) problems. In this case, we identified the following important aspects in our research problem:

  1. The solution would have to be lightweight, capable of being run on the edge with limited computational resources and internet connectivity.
  2. Our methods should be robust against interference from varying levels of background noise and variances in how users hold the microphone when recording a machine’s sound.
  3. It would be important to be able to communicate our results and analysis to domain experts and eventual end users. Therefore, we should focus on physically meaningful features over arbitrary ones and on explainable algorithms over black boxes.
  4. The set-up of our experiment should be planned to ensure high-quality uncontaminated data that at the same time would serve to produce the best possible research outcome while being representative of the data we might expect for a productionalised solution.

In this blog post we will focus on points 1. and 3. and we’ll return to 2. and 4. in a follow-up post.

Analysing Sound

We are surrounded by a constant stream of sound mixed together from a multitude of sources: cars speeding along on the street, your colleague typing on their keyboard or a dog barking at songbirds outside your window. Yet, seemingly without any effort, your brains can process this jumbled up signal and tell you exactly what is happening around you in real-time. Our hope is that we could somehow imitate this process by developing audio analysis methods with similar properties.

Waveform of speech on the left, corresponding spectrogram on the right.
Figure 1. On the left is the waveform for the sound produced by the author uttering “hello”. On the right is the corresponding spectrogram.

It is quite futile to try to analyse raw signals of this type directly: each sound source emits vibrations in multiple frequencies and these get combined over all the different sources into one big mess. Luckily there is a classical mathematical tool which can help us to figure out the frequency content of an audio (or any other type) wave: the Fourier transform. By computing the Fourier transform for consecutive small windows of the input signal, we can determine how much of each frequency is present at a given time. We can then arrange this data in the form of a matrix, where the rows correspond to different frequency ranges and columns are consecutive time steps (typically in the order of 10-20 milliseconds each). Hence, the entries of the matrix tell you how much of each frequency is present at that particular moment. The resulting matrix is called a spectrogram, which we can visualise by colouring the values based on their magnitudes: dark for values close to zero with lighter colours signifying higher intensity. In Figure 1 you can see an example of the waveform produced by the author uttering “hello” and the resulting spectrogram. The process of transforming the original signal to its constituent frequencies and studying this decomposition is called spectral analysis.

From Raw to Refined Features

The raw frequency data by itself is still not the most useful. This is because different audio sources can of course produce sounds in overlapping frequency ranges. In particular, a single machine can have multiple vibrating parts which each produce their distinct sound. Instead, we should try to extract features that are meaningful to the problem at hand—classification of fuel powered machines in this case. There are many spectral features that could be useful (for some inspiration you can check out our public Google Colab notebooks or the documentation of librosa, a popular Python audio analysis library).

In this blog we’ll take a slightly different approach. Our goal is to be able to compare the frequency data of different machines at two points in time, but this won’t be efficient (let alone robust) if we rely on raw frequencies. This is because of background noise and the varying operating speed of the engines (think about how the pitch of the sound is affected by how fast the engine is running). Instead, we want to pool together individual frequencies in a way that would allow us to express our high-dimensional spectrogram in terms of a handful of distinct frequency range combinations.

Weights of the coefficients for the first two principal components.
Figure 2. Each principal component is some combination of frequencies with different weights.

Luckily there is, yet again, a classical mathematical tool which does exactly this: principal component analysis (PCA). If you’ve taken a course in linear algebra then this is nothing more than matrix diagonalisation, but it has become one of the staple methods of dimensionality reduction in the machine learning world. The output of the PCA-algorithm is a set of principal components each of which is some combination of the original frequencies. In Figure 2 we plot the weight of each frequency for two principal components: in the first component we have positive weights for all but the lowest of frequencies while for the second one the midrange has negative weights. An additional reason for why PCA is an attractive method for our problem is that the resulting frequency combinations will be linearly independent (i.e. you cannot obtain one component by adding together multiples of the other components). This is a crude imitation of our earlier observation that a single machine can have multiple separate parts producing sound at the same time. The crux of the algorithm is that in order to faithfully represent our original data, we only have to keep a small number of these principal components thus effectively reducing the dimensionality of our problem to a more manageable scale.

Structure in Audio

Now that we have a sequence of low-dimensional feature vectors that capture the most important aspects of the original signal, we can try to start finding some structure in this stream of data. We do this by computing the self-similarity matrix (SSM) [1], whose elements are the pairwise distances between our feature vectors. We can visualise the resulting matrix as a heat map where the intensity of the colour corresponds to the distance (with black colour signifying that the features are identical), see Figure 3.

Showing how the self-similarity matrix is obtained from the feature vectors via pairwise distances.
Figure 3. The (i, j)-entry of the self-similarity matrix (on the right) is given by the distance between the feature vectors at times ti and tj. Black colour corresponds to zero distance i.e. the vectors being equal.

In Figure 4 you can see a part of an SSM for one of the soil compactors. By definition, time flows along its main diagonal (blue arrow). Short segments of the audio that are self-similar (i.e. the nature of the sound doesn’t change) appear as dark rectangles along the diagonal. For each rectangle on the main diagonal, the remaining rectangles on the same row show how similar the other segments are to the one in question. If you pause here for a moment and gather your thoughts, you might notice that there are two types of alternating segments (of varying duration) in this particular SSM.

Self-similarity matrix for a soil compactor showcasing an repeating pattern of two alternating segments.
Figure 4. Self-similarity matrix for one of the soil compactors. Time flows down the main diagonal on which the dark rectangles signify self-similar segments.

Do machines speak?

We have covered a lot of technical material, but we are almost done! Now we understand how to uncover patterns in audio, but how can we use this information to tell apart our four machines? The more ML-savvy readers might be tempted to classify the SSMs with e.g. convolutional neural networks. This might certainly work well, but we would lose sight of one of our aims which was to keep the method computationally light and simple. Hence we proceed with a more traditional approach.

Recall that we have constructed a separate SSM for each machine. For each of the resulting matrices, we can now look at small blocks along the diagonal (see Figure 5) and figure out what they typically look like. If we scale the results to [-1, 1], we obtain a small set of fingerprints (we also refer to these as kernels) for each machine. Just like you (hopefully) have ten fingers each with its own unique fingerprint, a machine can also have more than one acoustic fingerprint. We have visualised a few of these for one of the machines in Figure 5.

Fingerprints (on the right) for a single machine computed from its self-similarity matrix on the left.
Figure 5. Fingerprints (on the right) for a single machine computed from its self-similarity matrix on the left.

We are now ready to return back to the machine rental shop to test if our solution works! Once we arrive, we follow the set of instructions below in order to determine which machine is which (see Figure 6 for an animation of this process):

  1. Turn on the machine and record its sound.
  2. While the machine is running, compute the self-similarity matrix on the fly.
  3. Slide the fingerprints for each machine along the diagonal and compute their activations (by summing the elementwise product).
  4. The fingerprint which reacts to the sound the most tells you which machine is running.
By computing the activations of each fingerprint on freshly recorded audio, we can find out which machine has been returned to the rental shop.
Figure 6. By computing the activations of each fingerprint on freshly recorded audio, we can find out which machine has been returned to the rental shop.

And that’s it! We saw how something seemingly natural, the sounds surrounding us, can produce very complex signals. We learned how to begin to understand this mess via spectral analysis, which led us to uncover structure hidden in the data—something our brain does with ease. Finally, we used this structure to produce a solution to our original business use case of classifying machine sounds.

I hope you have enjoyed this little excursion into the mathematical world of audio data and colourful graphs. Maybe next time you start your car (or your soil compactor) you might wonder whether you could recognise its sound from your neighbour’s identical one and what it is about their sounds that lets your brain achieve that.

If you are interested in applying advanced sensor data (audio or otherwise) analytics in your business context please reach out to me or the Solita Industrial team.

References

[1] J. Foote, Visualizing Music and Audio using Self-Similarity, MULTIMEDIA ’99, pp. 77-80 (1999) http://www.musanim.com/wavalign/foote.pdf

 

Connecting IoT fleets with LoRaWAN

For connecting IoT devices over the internet there are several network protocols available like ZigBee, Bluetooth, BLE, WiFi, LTE-M, NB-IoT, Z-Wave, LoRa and LoRaWAN. Each one serves its own purpose and brings its own feature combinations. In this blog post I go through a very interesting low power and long range protocol LoRaWAN.

Explaining the concepts

LoRa (Long Range) is a wireless radio modulation technology, originated from Chirp Spread Spectrum (CSS) technology. It encodes information on radio waves using frequency modulated chirp pulses. It is very ideal for transmitting data in small chunks, with low bit rates and at a longer range compared to WiFi, ZigBee or Bluetooth. Typical range is 2-8km depending on the network environment. It is a good fit for applications that need to operate in low power mode. 

LoRaWAN is a wide area networking protocol built on top of the LoRa. It defines the bi-directional communication protocol, network system architecture, principles how devices connect to gateways and how gateways process the packets and how packets find their way to network servers. Whereas LoRa enables the physical network and enables the long-range communication link. 

Taking a look at this from the OSI (Open Systems Interconnection) model of computer networking. LoRaWAN is a Media Access Control (MAC) protocol on OSI model layer 2, whereas LoRa defines the physical layer on the bottom layer, meaning transmitting of raw bits over a physical data link. LoRaWAN defines 3 device types, Class A, B and C for different power needs. Class A is suitable for bi-directional communication.

LoRa and LoRaWAN sitting on OSI model

 

Now when we understand LoRa and LoRaWAN differences we can take a look at typical network architecture. It consists of LoRaWAN enabled devices (sensors or actuators), which are connected wirelessly to the LoRaWAN network using LoRa. The Gateway receives LoRa RF messages and forwards those to the network server. All the network traffic can be bi-directional (depending on LoRaWAN device classification), so the Gateway can also deliver messages to the device. Devices are not associated with a specific gateway, Instead, the same sensor can be served by multiple gateways in the area. 

The network server is responsible for managing the entire network. It forwards the payloads to application servers, queues payloads coming from the application server to connected devices and forwards join request- and accept-messages between devices and the join server. Application servers are responsible for securely handling, managing and interpreting device data and also generating payloads towards connected devices. Join server is responsible for the OTA (Over-The-Air) activation process for adding devices to the network.

Typical LoRaWAN network architecture

 

LoRaWAN is deployed widely and globally. There are public network operators in many countries, like here in Finland, Sweden and Norway. Take a look at public network operators and open community networks.

LoRaWAN is globally deployed

Where is it used?

Low power, long range and low cost connectivity are the top LoRaWAN benefits. These enable and make new use cases possible. Just to mention few

  • Asset tracking – Track the location and condition of business critical equipment like containers location or cargo temperature or other equipment condition. 
  • Supply chain monitoring – For example monitor food, medicine and other goods that need to be stored in a certain temperature through the entire supply-chain from production to storage and delivery.
  • Smart Water and Energy management – Monitor water and energy consumption
  • Smart environment – Air condition, loudness, air pressure, space optimization, building security, failure prediction.

Read more from LoRa Alliance pages and also from our data driven initiatives and solution like building everyday tools for EU citizens to combat climate change, circular economy, Fortum electricity retail business and Edge computing starts new era of intelligence in forest harvesting.

 

Do I have to do all this by myself?

You can find the LoRaWAN network server as an open source product and deploy it to any cloud environment. But deploying, maintaining and operating the network server, join server and application servers can be a pain and not so easy to get started with. 

Amazon hyperscaler can help with this. Amazon IoT Core has the LoRaWAN capability, which is a fully managed solution for connecting and managing LoRaWAN enabled devices with the AWS Cloud. With the IoT Core for LoraWAN you can set up a private network by connecting devices and gateways to the AWS Cloud, and there is no need for developing or operating the network server. By using the AWS technologies for LoRaWAN network the architecture looks like this:

Private LoraWAN network using AWS IoT Core

 

How about the real devices

For example for asset tracking there are plenty of devices available on the market. I recently bought a LoRaWAN capable GPS tracking device and indoor LoRaWAN gateway. The tracker is small pocket/keychain size and the gateway is easy to register to the AWS cloud.

LoRaWAN GPS tracker and gateway

 

The power of low power is powerful

LoRaWAN is not ideal in all environments, like where you need low latency, high bandwidth and continuous availability.

But if you need a low power environment, like battery powered for a few years, long range and cost efficient data transfer, then LoRaWAN might be your choice. 

Check out our Connected Fleet Kickstart for boosting development for Fleet management and LoRaWAN:

https://www.solita.fi/en/connected-fleet/

And take a look other blog posts related to the IoT scene like M2M Meets IoT.

 

 

Solita Health researched: Omaolo online symptom checker helps to predict national healthcare admissions related to COVID-19

In our recently published study [1], I and my Solitan colleagues Kari Antila and Vilma Jägerroos examined the possibility of predicting the burden of healthcare using machine learning methods. We used data on symptoms and past healthcare utilization collected in Finland. Our results show that COVID-19-related healthcare admissions can be predicted one week ahead with an average accuracy of 76% during the first wave of the pandemic. Similar symptom checkers could be used in other societies and for future epidemics, and they could provide an opportunity to collect data on symptom development very rapidly - and at a relatively low cost.

The rapid spread of the SARS-CoV-2 virus in March 2020 presented challenges for nationwide assessment of the progression of the COVID-19 pandemic. In Finland, Solita helped to add a COVID-19 symptom checker to a pre-existing national, CE-marked medical symptom checker service ©Omaolo. The Omaolo COVID-19 symptom checker achieved considerable popularity immediately after its release, and the city of Helsinki, for example, has estimated annual savings of 2.5 million euros from its use. Although there have been studies about how well symptom checkers perform as clinical tools, their data’s potential on predicting epidemic progression, to our knowledge, has not yet been studied.

For this purpose, Solita developed a machine learning pipeline in the Finnish Institute for Health and Welfare’s (THL) computing environment for automated model training and comparison. The models created by Solita were retrained every week using time-series nested cross-validation, allowing them to adapt to the changes in the correlation of the symptom checker answers and the healthcare burden. The pipeline makes it easy to try new models and compare the results to previous experiments. 

We decided to compare linear regression, a simple and traditional method, to XGBoost regression, a modern option with many hyperparameters that can be learned from the data. The best linear regression model and the best XGBoost model (shown in the figure) achieved mean absolute percentage error of 24% and 32%, respectively. Both models get more accurate over time, as they have more data to learn from when the pandemic progresses.

COVID-19–related admissions predicted by linear regression and XGBoost regression models, together with the true admission count during the first wave of the pandemic in 2020.

Our results show that a symptom checker is a useful tool for making short-term predictions on the health care burden due to the COVID-19 pandemic. Symptom checkers provide a cost-effective way to monitor the spread of a future epidemic nationwide and the data can be used for planning the personnel resource allocation in the coming weeks. The data collected with symptom checkers can be used to explore and verify the most significant factors (age groups of the users, severity of the symptoms) predicting the progression of the pandemic as well.

You can find more details in the publication [1]. The research was done in collaboration with University of Helsinki, Finnish Institute for Health and Welfare, Digifinland Oy, and IT Centre for Science, and we thank everyone involved.

If you have similar register data and would like to perform a similar analysis, get in touch with me or Solita Health and we can work on it together!

Joel Röntynen, Data Scientist, joel.rontynen@solita.fi

References

[1] ​​Limingoja L, Antila K, Jormanainen V, Röntynen J, Jägerroos V, Soininen L, Nordlund H, Vepsäläinen K, Kaikkonen R, Lallukka T. Impact of a Conformité Européenne (CE) Certification–Marked Medical Software Sensor on COVID-19 Pandemic Progression Prediction: Register-Based Study Using Machine Learning Methods. JMIR Form Res 2022;6(3):e35181, doi: 10.2196/35181, PMID: 35179497

Reading the genomic language of DNA using neural networks

Neural networks are powerful tools in natural language processing (NLP). In addition, they can also learn the language of DNA and help in genome annotation. Annotated genes, in turn, play a key role in finding the causes and developing treatments for many diseases.

I have been finishing my studies while working at Solita and got the opportunity to do my master’s thesis in the ivves.eu research program in which Solita is participating. The topic of my thesis consisted of language, genomics and neural networks, and this is a story of how they all fit into the same picture.

When I studied Data Science at the University of Helsinki, courses in NLP were my favorites. In NLP, algorithms are taught to read, generate, and understand language, in both written and spoken forms. The task is difficult because of the characteristics of the language: words and sentences can have many interpretations depending on the context. Therefore, the language is far from accurate calculations and rules where the algorithms are good at. Of course, such challenges only make NLP more attractive!

Neural networks

This is where neural networks and deep learning come into play. When a computational network is allowed to process a large amount of text over and over again, the properties of the language will gradually settle into place, forming a language model. A good model seems to “understand” the nuances of language, although the definition of understanding can be argued, well, another time. Anyways, these language models taught with neural networks can be used for a wide variety of NLP problems. One example would be classifying movie reviews as positive or negative based on the content of the text. We will see later how the movie reviews can be used as a metaphor for genes.

In recent years, a neural network architecture called transformers has been widely used in NLP. It utilizes a method called attention, which is said to pay attention to emphases and connections of the text (see the figure below). This plays a key role in building the linguistic “understanding” for the model. Examples of famous transformers (other than Bumblebee et al.) include Google’s BERT and OpenAI’s GPT-3. Both are language models, but transformers are, well, transformable and can also be used with inputs other than natural language.

An example of how transformers self-attention “sees” the connections in a sentence. The difference of the last word completely changes what the word “it” most refers to.

 

DNA-language

And here DNA and genomes come into the picture (also literally in the picture below). You see, DNA has its own grammar, semantics, and other linguistic properties. At its simplest, genes can be thought of as positive movie reviews, and non-coding sequences between genes as negative reviews. However, because the genomes of organisms are part of nature, genes are a little more complex in reality. But this is just one more thing that language and genomics have in common: the rules do not always apply and there is room for interpretation.

Simplification of a genome and a gene. Genomic data is a long sequence of characters A, T, C, and G representing four nucleotide types. Genes are coding parts of the genome. At their simplest, they consist of start and end points and the characters between them.

 

Since both text and genomic data consist of letters, it is relatively straightforward to teach the transformer model with DNA sequences instead of text. Like the classification of movie reviews in an NLP-model, the DNA-model can be taught to identify different parts of the genome, such as genes. In this way, the model gains the understanding of the language of DNA.

In my thesis, I used DNABERT, a transformer model that has been pre-trained with a great amount of genomic data. I did my experiments with one of the most widely known genomes, E. coli bacterium, and fine-tuned the model to predict its gene locations.

Example of my experiments: the Receiver operating characteristic (ROC) curves helped me to find the most optimal input length for the genome data. Around 100 characters led to the highest curve and thus the best results, whilst 10 was obviously too short and 500 too long.

After finding the most optimal settings and network parameters, the results clearly showed the potential. Accuracy of 90.15% shows that the model makes “wise” decisions instead of just guessing the locations of the genes. Therefore the method has potential to assist in the basic task of bioinformatics: new genomes are sequenced at a rapid pace, but their annotation is slower and more laborious. Annotated genes are used, for example, to study the causes of diseases and to develop treatments tailored to them.

There are also other methods for finding genes and other markers in DNA sequences, but neural networks have some advantages over more traditional statistics and rule based systems. Rather than human expertise in genomics, the neural network based method relies on the knowledge gathered by the network itself, using a large amount of genomic data. This saves time and expert hours in the implementation of the neural network. The use of the pre-trained general DNA language model is also environmentally friendly. Such a model can be fine-tuned with the task-specific data and settings in just a few iterations, saving computational resources and energy. 

There is a lot of potential in further developing the link between transformer networks and DNA to study what else the genome language has to tell us about the life around us. Could this technology contribute to the understanding of genetic traits, the study of evolution, the development of medicine or vaccines? These questions are closely related to the healthcare field, in which Solita has strong expertise, including in research. If you are interested in this type of research, I and other Solita experts will be happy to tell you more!

Venla Viljamaa (Data Scientist) venla.viljamaa@solita.fi linkedin.com/in/venlav/

How to choose your next machine learning project

Three steps to be intentionally agnostic about tools. Reduce technical debt, increase stakeholder trust and make the objective clear. Build a machine learning system because it adds value, not because it is a hammer to problems.

As data enthusiasts we love to talk, read and hear about machine learning. It certainly delivers value to some businesses. However, it is worth taking a step back. Do we treat machine learning as a hammer to problems? Maybe a simple heuristic does the job with substantially lower technical debt than a machine learning system.

Do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.

Google developers. Rules of ML.

In this article, I look at a structured approach to choose the next data science project that aligns to business goals. It combines objective key results (OKR), value-feasibility and other suggestions to stay focused. It is especially useful for data science leads, business intelligence leads or data consultants.

Why data science projects require a structured approach

ML solves complex problems with data that has a predictive signal for the problem at hand. It does not create value by itself.

So, we love to talk about Machine learning (ML) and artificial intelligence (AI). On the one hand, decision makers get excited and make it a goal: “We need to have AI & ML”. On the other hand, the same goes for data scientists who claim: “We need to use a state-of-the-art method”. Being excited about technology has its upsides, but it is worth taking a step back for two reasons.

  1. Choosing a complex solution without defining a goal creates more issues than it solves. Keep it simple, minimize technical debt. Make it easy for a future person to maintain it, because that person might be you.
  2. A method without a clear goal fails to create business value and erodes trust. Beyond the hype around machine learning, we do data science to create business value. Ignoring this lets executives reduce funding for the next data project.

This is nothing new. But, it does not hurt to be reminded of it. If I read about an exciting method, I want to learn and apply it right away. What is great for personal development, might not be great for the business. Instead, start with what before thinking about how.

In the next section, I give some practical advice on how to structure the journey towards your next data project. The approach helps me to focus on what is next up for the business to solve instead of what ML method is in the news.

How to choose the next data science project

“Rule #1: Don’t be afraid to launch a product without machine learning.”

Google developers. Rules of ML.

Imagine you draft the next data science cases at your company. What project to choose next? Here are three steps to structure the journey.

Photo by Leah Kelley from Pexels

Step 1: Write data science project cards

The data science project card helps to focus on business value and lets you be intentionally agnostic about methodologies in the early stage

Summarize each idea in a data science project card which includes some kind of OKR, data requirements, value-feasibility and possible extensions. It covers five parts which contain all you need to structure project ideas, namely an objective (what), its key results (how), ideal and available data (needs), the value-feasibility diagram (impact) and possible extension. What works for me is to imagine the end-product/solution to a business need/problem before I put it into a project card.

Find the project card templates as markdown or powerpoint slides.

I summarize the data science project in five parts.

  1. An objective addresses a specific problem that links to a strategic goal/mission/vision, for example: “Enable data-driven marketing to get ahead of competitors”, “Automate fraud detection for affiliate programs to make marketing focusing on core tasks” or “Build automated monthly demand forecast to safeguard company expansion”.
  2. Key results list measurable outcomes that mark progress towards achieving the objective, for example: “80% of marketing team use a dashboard daily”, “Cover 75% of affiliate fraud compared to previous 3 month average” or “Cut ‘out-of-stock’ warnings by 50%, compared to previous year average”.
  3. Data describes properties of the ideal or available dataset, for example: “Transaction-level data of the last 2 years with details, such as timestamp, ip and user agent” or “Product-level sales including metadata, such as location, store details, receipt id or customer id”.
  4. Extensions explores follow-up projects, for example: “Apply demand forecast to other product categories” or “Take insights from basket analysis to inform procurement.”
  5. The value-feasibility diagram puts the project into a business perspective by visualizing value, feasibility and uncertainties around it. The smaller the area, the more certain is the project’s value or feasibility.

To provide details, I describe a practical example how I use these parts for exploring data science projects.  The journey starts by meeting the marketing team to hear about their work, needs and challenges. If a need can be addressed with data, they become the end-users and project target group. Already here, I try to sketch the outcome and ask the team about how valuable it is which estimates the value.

Next, I take the company’s strategic goals and formulate an objective that links to them following OKR principles. This aligns the project with mid-term business goals, makes it part of the strategy and increases buy-in from top-level managers. Then I get back to the marketing team to define key results that let us reach the objective.

A draft of an ideal dataset gets compared to what is available with data owners or the marketing team itself. That helps to get a sense for feasibility. If I am uncertain about value and feasibility, I increase the area in the diagram. It is less about being precise, but about being able to compare projects with each other.

Step 2: Sort projects along value and feasibility

Value-feasibility helps to prioritize projects, takes a business perspective and increases stakeholder buy-in.

Ranking each project along value and feasibility makes it easier to see which one to prioritize. The areas visualize uncertainties on value and feasibility. The larger they stretch along an axis, the less certain I am about either value or feasibility. If they are more dot-shaped, I am confident about a project’s value and its feasibility.

Projects with their estimated value and feasibility

Note that some frameworks evaluate adaptation and desirability separately to value and feasibility. But you get low value when you score low on either adaptation or desirability. So, I estimate the value with business value, adaptation and desirability in my mind without explicitly mentioning it.

Data science projects tend to be long-term with low feasibility today and uncertain, but potentially high future value. Breaking down visionary, less feasible projects into parts that add value in themselves could produce a data science roadmap. For example, project C which has uncertain value and not feasible as of today, requires project B to be completed. Still, the valuable and feasible project A should be prioritized now. Thereafter, aim for B on your way to C. Overall, this overview helps to link projects and build a mid-term data science roadmap.

Related data science projects combined to a roadmap

Here is an example of a roadmap that starts with descriptive data science cases and progresses towards more advanced analytics such as forecasting. That gives a prioritization and helps to draft a budget.

Step 3: Iterate around the objective, method, data and value-feasibility

Be intentionally agnostic about the method first, then opt for the simplest one, check the data and implement. Fail fast, log rigorously and aim for the key results.

Implementing data science projects has so many degrees of freedom that it is beyond the scope of this article to provide an exhaustive review. Nevertheless, I collected some statements that can help through the project.

  1. Don’t be afraid to launch a product without machine learning. And do machine learning like the great engineer you are, not like the great machine learning expert you aren’t. (Google developers. Rules of ML.)
  2. Focus on few customers with general properties instead of specific use cases (Zhenzhong Xu, 2022. The four innovation phases of Netflix’ trillions scale real-time data infrastructure.)
  3. Keep the first model simple and get the infrastructure right. Any heuristic or model that gives quick feedback suits at early project stages. For example, start with linear regression or a heuristic that predicts the majority class for imbalanced datasets. Build and test the infrastructure around those components and replace them when the surrounding pipelines work (Google developers. Rules of ML. Mark Tenenholtz, 2022. 6 steps to train a model.)
  4. Hold the model fixed and iteratively improve the data. Embrace a data-centric view where data consistency is paramount. This means, reduce the noise in your labels and features such that an existing predictive signal gets carved out for any model (Andrew Ng, 2021. MLOps: From model-centric to data-centric AI).
  5. Each added component also adds a potential for failure. Therefore, expect failures and log any moving part in your system.
  6. Test your evaluation metric and ensure you understand what “good” looks like (Raschka, 2020. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning.)

There are many more best practices to follow and they might work differently for each of us. I am curious to hear yours!

Conclusion

In this article, I outlined a structured approach for data science projects. It helps me to channel efforts into projects that fit business goals and choose appropriate methods. Applying complex methods like machine learning independent of business goals risks accruing technical debt and at worst jeopardizes investments.

I propose three steps to take action:

  1. Write a project card that summarizes the objective of a data science case and employs goal-setting tools like OKR to engage business-oriented stakeholders.
  2. Sort projects along value and feasibility to reasonably prioritize.
  3. Iterate around the objective, method, data and value-feasibility and follow some guiding industry principles that emerged over the last years.

The goal is to translate data science use cases into something more tangible, bridging the gap between business and tech. I hope that these techniques empower you for your next journey in data science.

Happy to hear your thoughts!

Materials for download

Download the data science project template, structure and generic roadmap as Power Point slides here. You can also find a markdown of a project template here.

Data Governance is your way from Data minor leagues to major leagues

In the spirit of Valentine’s Day this post is to celebrate my love of Data Governance, and it is also a teaser to a future series of Data Governance related blog posts by me and other members of Solita data Governance team.

I will be copying the trend of using sports analogies, but rather than focusing on explaining the basics I want to explain what Data Governance brings to the game – why Data Governance is something for organisations to embrace, not to fear.

Data Governance can seem scary and to be all about oversight and control, but the aim of  governance is never to be constricting without a purpose!

Data Governance is established for the people and is done by people.

Think about the football players on field during the game, they should all be aware of the goal, and their individual roles. But can they also pass the ball to each other efficiently? Do they even know why they are playing all the games, and are they running around without a plan? 

Data Governance as the Backroom staff

In football it is rarely the case that players would run around aimlessly, because the team spends a lot of time not just playing, but training, strategizing, going through tactics, game plays etc. All that work done outside the actual game is just as important. Team has a manager, a coach, trainers – the Backroom staff. The staff and players work together as a team to achieve progress.

In organisations Data Management should have Data Governance as their Backroom staff to help get their “game” better.

A playbook exists to make sure the players have guidance needed to perform to their optimal level. In the playbook there are stated the rules that need to be followed. Some might be the general laws from outside, then there are the  game rules and there are detail level rules for the team itself. Players need to learn their playbook, and understand it.

The Playing field

Before getting to the roles and playbook, think about: Who needs a playbook? Where to start? Did you think “from the area where there are most issues“?  Unfortunately that is the road most are forced take, because the wake up call to start building governance is when big issues already appear. 

Don’t wait for trouble and take the easy road first. 

Instead of getting yourself into trouble by choosing the problematic areas, think about a team or function of which you can already say: These are the players on that field. This is the common goal for them. And even better if you know the owner of the team and the captain of the team, since then you already have the people who can already start working on the playbook.

If you are now thinking about the players as the people just in IT and data functions – think again! Data management is done also by people in Business processes who handle, modify, add to the data. Once there is a running governance in at least part of the organisation, you can take that as an example, and take the lessons learned to start widening the scope to problematic areas.

Conclusion

Organisations are doing data management and perhaps already doing data governance, but how good is their Data Management depends on their governance. 

Data Management without governance is like playing in the minors not in the major leagues.

In the next posts on this theme, we will dive into figuring out who is the coach, and other members of the Backroom staff, and what are their responsibilities. We will have a closer look on the content of the playbook, and how you can start building a playbook, that is the right fit for your organisation.  Let the journey to the major leagues begin!

 #ilovedatagovernance

Microchips and fleet management

The ultimate duo for smart product at scale

We have seen how cloud based manufacturing has taken a huge step forward and you can find insights listed in our blog post The Industrial Revolution 6.0. Cloud based manufacturing is already here and extends IoT to the production floor. You could define a connected factory as a manufacturing facility that uses digital technology to allow seamless sharing of information between people, machines, and sensors.

if you haven’t read it yet there is great article Globalisation and digitalisation converge to transform the industrial landscape.

There is still much more than factories. Looking around you will notice a lot of smart products such as smart TVs, elevators, traffic light control systems, fitness trackers, smart waste bins and electric bikes. In order to control and monitor the fleet of devices we need rock solid fleet management capabilities that we will cover in another blog post.

This movement towards digital technologies, autonomous systems and robotics will require the most advanced semiconductors to come up with even more high-performance, low power consumption,  low-cost, microcontrollers in order to carry complicated actions and operations at Edge. Rise in the Internet of Things and growing demand for automation across end-user industries is fueling growth in the global microcontroller market.

As Software has eaten the world and every product is a data product there will only be SaaS Companies.

Devices at the field must be robust to connectivity issues, in some cases withdraw -30 ~ 70°C operating temperatures, build on resilience and be able to work in isolation most of the time. Data is secured on device, it stays there and only relevant information is ingested to other systems. Machine-to-machine is a crucial part of the solutions and it’s nothing new like explained in blog post M2M has been here for decades.

Microchip powered smart products

Very fine example of world class engineering is Oura Ring.  On this scale it’s typical to have Dual-core​ ​arm-processor:​ ​ARM​ ​Cortex​ based​ ​ultra​ ​low​ ​power​ ​MCU with limited ​memory​ ​to​ ​store​ ​data​ ​up​ ​to​ ​6​ ​weeks. Even at this  size it’s packed with sensors such as infrared​ ​PPG​ ​(Photoplethysmography) sensor, body​ ​temperature​ ​sensor, 3D​ ​accelerometer​ ​and​ ​gyroscope.

Smart watches are using e.g. Exynos W920, a wearable processor made with the 5nm node, will pack two Arm Cortex-A55 cores and an Arm Mali-G68 GPU. Even on this small size it includes 4G LTE modem and a GNSS L1 sensor to track speed, distance, and elevation when watch wearers are outdoors.

Taking a mobile phone from your pocket it can be powered by the Qualcomm Snapdragon 888 capable of producing 1.8 – 3 GHz 8 cores with 3 MB Cortex-X1.

Another example is Tesla famous of having Self-Driving Chip for autonomous driving chip designed by Tesla the FSD Chip incorporates 3 quad-core Cortex-A72 clusters for a total of 12 CPUs operating at 2.2 GHz, a Mali G71 MP12 GPU operating 1 GHz, 2 neural processing units operating at 2 GHz, and various other hardware accelerators. infotainment systems can be built on the  seriously powerful AMD Ryzen APU powered by RDNA2 graphics so you play The Witcher 3 and Cyberpunk 2077 when waiting inside of your car.

Artificial Intelligence – where machines are smarter

Just a few years ago, to be able to execute machine learning models at Edge on a fleet of devices was a tricky job due to lack of processing power, hardware restrictions and just pure amount of software work to be done. Very often the imitation is the amount of flash and ram available to store more complex models on a particular device. Running AI algorithms locally on a hardware device using edge computing where the AI algorithms are based on the data that are created on the device without requiring any connection is a clear bonus. This allows you to process data with the device in less than a few milliseconds which gives you real-time information.

Figure 1. Illustrative comparison how many ‘cycles’ a microprocessor can do (MHz)

The pure power of computing power is always a factor of many things like the Apple M1 demonstrated how to make it much cheaper and still gain the same performance compared to other choices. So far, it’s the most powerful mobile CPU in existence so long as your software runs natively on its ARM-based architecture. Depending on the AI application and device category, there are various hardware options for performing AI edge processing like CPUs, GPUs, ASICs, FPGAs and SoC accelerators.

Price range for microcontroller board with flexible digital interfaces will start around 4$ with very limited ML cabalities . Nowadays mobile phones are actually very powerful to run heavy compute operations thanks to purpose designed super boosted microchips.

GPU-Accelerated Cloud Services

Amazon Elastic Cloud Compute (EC2) is a great example where P4d instances AWS is paving the way for another bold decade of accelerated computing powered with the latest NVIDIA A100 Tensor Core GPU. The p4d comes with dual socket Intel Cascade Lake 8275CL processors totaling 96 vCPUs at 3.0 GHz with 1.1 TB of RAM and 8 TB of NVMe local storage. P4d also comes with 8 x 40 GB NVIDIA Tesla A100 GPUs with NVSwitch and 400 Gbps Elastic Fabric Adapter (EFA) enabled networking. In practice this means you do not have to take coffee breaks so much and wait for nothing  when executing Machine Learning (ML), High Performance Computing (HPC), and analytics. You can find more on P4d from AWS.

 

Top 3 benefits of using Edge for computing

There are clear benefits why you should be aware of Edge computing:

1. Reduced costs where costs for data communication and bandwidth costs will be reduced as fewer data will be transmitted.

2. Improved security when you are processing data locally, the problem can be avoided with streaming without uploading a lot of data to the cloud.

3. Highly responsive where devices are able to process data really fast compared to centralized IoT models.

 

Convergence of AI and Industrial IoT Solutions

According to a Gartner report, “By 2027, machine learning in the form of deep learning will be included in over 65 percent of edge use cases, up from less than 10 percent in 2021.” Typically these solutions have not fallen into Enterprise IT  – at least not yet. It’s expected Edge Management becomes an IT focus by utilizing IT resources to optimize cost.

Take a look on Solita AI Masterclass for Executives how we can help you to bring business cases in life and you might be interested taking control of your fleet with our kickstart. Let’s stay fresh minded !

M2M meets IoT

M2M has been here for decades and is the foundation for IoT

In this blogpost I continue discussion around Industrial Connected Fleets from the M2M (machine-to-machine) point-of-view. 

M2M and IoT. Can you do one without another?

M2M machine-to-machine refers to an environment where networked machines communicate with each other without human intervention. 

Traffic control is one example of an M2M application. There multiple sensing devices collect traffic volume and speed data around the city and send the data to an application that controls the traffic lights. The intelligence of this application makes traffic more fluent and opens bottlenecks and helps traffic flow from city areas to another. No human intervention is needed.

Another example is the Auto industry, where cars can communicate with each other and with infrastructure around them. Cars create a network and enable the application to notify drivers about the road or weather conditions. Also in-car systems are using M2M for example rain detectors together with windshield wiper control.

There are lots of examples where M2M can be used. In addition to the above, it is worth mentioning the Smart Home and Office applications, where for example one device measures direct sunlight near the window and notifies the window blind controller to close the blinds when brightness threshold value is crossed. Another very interesting M2M areas are robotics and logistics.

M2M sounds a lot like IoT. What’s the difference? Difference is in network architecture. On M2M Internet connectivity is not a must. Devices and device networks can communicate without it. M2M is point-to-point communication and typically targets single devices to use short-range communication (wired or wireless). Whereas IoT enables devices to communicate with cloud platforms over the internet and gives cloud computing and networking capabilities. The data collected by IoT devices are typically shared with other functions, processes and digital services whereas M2M communication does not share the data. 

I can say that IoT extends the capabilities of M2M.

 

Networking in M2M

M2M does not necessarily mean point-to-point communication. It can be point-to-multiple as well. Communication can be wired or wireless and network topology can be ring, mesh, star, line, tree, bus, or something else which serves the application best, as M2M systems are typically built to be task or device specific.

Figure 1. Network topology

 

For distributed M2M networks there are a number of wireless technologies like Wifi, ZigBee, Bluetooth, BLE, 5G, WiMax. These can also be implemented in hardware products for M2M communication. Of course one option is to build a network with wired technology as well.

There are few very interesting protocols for M2M communication, which I go through at a high level. These are DDS, MQTT, CoAP and ZeroMQ.

The Data Distribution Service DDS is for real-time distributed applications. It is a decentralized publish/subscribe protocol without a broker. Data is organized to topics and each topic can be configured uniquely for required QoS. Topic describes the data and publishers and subscribers send and receive data only for the topics they are interested in. DDS supports automatic discovery for publishers and subscribers, which is amazing! This makes it easy to extend the system and add new devices automatically in plug-and-play fashion.

MQTT is a lightweight publish subscribe messaging protocol. This protocol relies on the broker to which publishers and subscribers connect to and all communication routes through the broker (Centralized). Messages are published to topics. Subscribers can decide which topic to listen to and receive the messages. Automatic discovery is not supported on MQTT.

CoAP (Constrained Application Protocol) is for low power electronic devices “nodes”. It uses an HTTP REST-like model where servers make resources available under URL. Clients can access resources using GET, PUT, POST and DELETE methods. CoAP is designed for use between devices on the same network, between devices and nodes on the Internet, and between devices on different networks both joined by an internet. It provides a way to discover node properties from the network. 

ZeroMQ is a lightweight socket-like sender-to-receiver message queuing layer. It does not require a broker, instead devices can communicate directly with each other. Subscribers can connect to the publisher they need and start subscribing messages from their interest area. Subscriber can also be a publisher, which makes it possible to build complex topology as well. ZeroMq does not support Automatic discovery.

As you can see there is a variety of these protocols with features. Choose the right one based on your system requirements.

 

Make Fleet of Robots work together with AWS

DDS is great for distributed M2M networks. For robotics there is the open-source framework ROS (Robot Operating System). The version 2 (ROS2) is built on top of DDS. With the help of DDS, ROS nodes can communicate easily within one robot or between multiple robots. For example 3D visualization for distributed robotics systems is one of ROS enabled features.

Figure 2. Robot and ROS

 

I recommend you check AWS IoT RoboRunner service. It makes it easier to build and deploy applications that help fleets of robots work together. With the RoboRunner, you can connect your robots and work management systems. This enables you to orchestrate work across your operation through a single system view. Applications you build in AWS RoboMaker are based on ROS. With the RoboMaker you can simulate first without a need for real robotics hardware.

Our tips for you

It’s very clear that M2M communication brings advantages like:

  • Minimum latency, higher throughput and lower energy consumption
  • It is for mobile and fixed networks (indoors and outdoors)
  • Smart device communication requires no human intervention
  • Private networks brings extra security

And together with IoT, the advantages are at the next level.

Supercharge your system with a distributed M2M network and make it planet scaled with AWS IoT services. The technology is supporting very complex M2M networks where you can have distributed intelligence spread across tiny low power devices. 

Check out our Connected Fleet Kickstart for boosting development for Fleet management and M2M: 

https://www.solita.fi/en/connected-fleet/

 

 

SQL Santa for Factory and Fleet

Awesome SQL Is coming To Town

We have a miniseries before Christmas coming where we talk S-Q-L, /ˈsiːkwəl/ “sequel”. Yes, the 47 years old domain-specific language used in programming and designed for managing data. It’s very nice to see how old faithful SQL is going stronger than ever for stream processing as well the original relational database management purposes.

What is data then and how that should be used ? Take a look on article written in Finnish “Data ei ole öljyä, se on lantaa”

We will show you how to query and manipulate data across different solutions using the same SQL programming language.

The Solita Developer survey has become a tradition here at Solita and please check out the latest survey. It’s easy to see how SQL is dominating in a pool of many cool programming languages. It might take an average learner about two to three weeks to master the basic concepts of SQL and this is exactly what we will do with you.

Data modeling and real-time data

Operative technology (OT) solution have been real time from day one despite it’s also a question of illusion of real-time when it comes to IT systems. We could say that having network latency 5-15 ms towards Cloud and data processing with single-digit millisecond latency irrespective of the scale is considered near real time. This is important for Santa Claus and Industry 4.0 where autonomous fleet, robots and real-time processing in automation and control is a must have. Imagine situation where Santa’s autonomous sleigh with smart safety systems boosted computer vision (CV) able bypass airplanes and make smart decisions would have time of unit seconds or minutes – that would be a nightmare.

A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities.

It’s easy to identify at least conceptual, logical and physical data models, where from the last one we are interested the most in this exercise to store and query data.

Back to the Future

Dimensional model heavily development by Ralph Kimball was breakthrough 1996 and had concepts like fact tables, dimension and ultimately creating a star schema. Challenge of this modeling is to keep conformed dimensions across the data warehouse and data processing can create unnecessary complexity.

One of the main driving factors behind using Data Vault is for both audit and historical tracking purposes. This methodology was developed by Daniel (Dan) Linstedt in early 2000. It has gain a lot of attraction being able to support especially modern cloud platform with massive parallel processing (MPP) of data loading and not to worry so much of which entity should be loaded first. Possibility even create data warehouse from scratch and just loading data in is pretty powerful when designing an idempotent system. 

Quite typical data flow looks like picture above and like you already noticed this will have impact on how fast data is landed into applications and users. Theses for Successful Modern Data Warehousing are useful to read when you have time.

Data Mesh ultimate promise is to eliminate the friction to deliver quality data for producers and enable consumers to discover, understand and use the data at rapid speed. You could imagine this as data products in own sandboxes with some common control plane and governance. In any case to be successful you need expertise from different areas such as business, domain and data. End of the day Data Mesh does not take a strong position on data modeling.

Wide Tables / One Big Table (OBT) that is basically nested and denormalized tables is one modeling that is perhaps the mostly controversy. Shuffling data between compute instances when executing joins will have negative impact on performance (yes, you can e.g. replicate dimensional data to nodes and keep fact table distributed which will improve performance) and very often operational data structures produced by micro-services and exchanged over API are closer to this “nested” structure. Having same structure and logic for batch SQL as streaming SQL will ease your work.

Breaking down the OT data items to multiple different sub optimal data strictures inside IT systems will loose the single, atomic data entity. Having said this it’s possible to ingest e.g. Avro files to MPP, keeping the structure same as original file as and using evolving schemas to discovery new attributes. That can be then use as baseline to load target layers such as Data Vault.

One interesting concept called Activity Schema that is sold us as being designed to make data modeling and analysis substantially simpler, faster.

Contextualize data

For our industrial Santa Claus case one very important thing is how to create inventory and contextualize data. One very promising path is an augmented data catalog that will cover a bit later. For some reason there is material out there explaining how IoT data has no structure which is just incorrect. The only reason I can think is that kind of data asset was not fit to traditional data warehouse thinking.

Something to take a look is Apache Avro that is a language-neutral data serialization system, developed by Doug Cutting, the father of Hadoop. The other one is JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. This is not solution for data modeling even more you will notice later on this blog post how those are very valuable on steaming data and having schema compared to other formats like CSV.

Business case for Santa

Like always everything starts with Why and solution discovery phase, what we actual want to build and would that have a business value. At Christmas time our business is around gifts and how to deliver those on time. Our model is a bit more simplified and will include operational technology systems such as assets (Santa’s workshop) and fleet (sleighs) operations. There might always be something broken so few maintenance needs are pushed to technicians (elfs). Distributed data platform is used for supply chain and logistics analytics to remove bottlenecks so business owners can be satisfied (Santa Claus and the team) and all gifts will be delivered to the right address just in time.

Case Santa’s workshop

We can later use OEE to calculate that workshop performance in order to produce high quality nice gifts. Data is ingested real time and contextualized so once a while Santa and the team will check how we are doing. In this specific case we know that using Athena we can find relevant production line data just querying the S3 bucket where all raw data is stored already.

Day 1 – creating a Santa’s table for time series data

Let’s create a very basic table to capture all data from Santa’s factory floor. You will notice there are different data types like bigint and string. You can even add comments to help others to later find what kind of data field should include. In this case raw data is Avro but you do not have to worry about that so let’s go.

CREATE EXTERNAL TABLE `raw`(

`seriesid` string COMMENT 'from deserializer',

`timeinseconds` bigint COMMENT 'from deserializer',

`offsetinnanos` bigint COMMENT 'from deserializer',

`quality` string COMMENT 'from deserializer',

`doublevalue` double COMMENT 'from deserializer',

`stringvalue` string COMMENT 'from deserializer',

`integervalue` int COMMENT 'from deserializer',

`booleanvalue` boolean COMMENT 'from deserializer',

`jsonvalue` string COMMENT 'from deserializer',

`recordversion` bigint COMMENT 'from deserializer'

) PARTITIONED BY (

`startyear` string, `startmonth` string,

`startday` string, `seriesbucket` string

)

Day 2 – query Santas’s data

Now we have a table and how to query that one ? That is easy with SELECT and taking all fields using asterix. It’s even possible to limit that to 10 rows which is always a good practice.

SELECT * FROM "sitewise_out"."raw" limit 10;

Day 3 – Creating a view from query

View is a virtual presentation of data that will help to organize assets more efficiently. One golden rule is still now to create many views on top of other views and keep the solution simple. You will notice that CREATE VIEW works nicely and now we have timeinseconds and actual factory floor value (doublevalue) captured. You can even drop the view using DROP command.

CREATE OR REPLACE VIEW "v_santa_data"

AS SELECT timeinseconds, doublevalue FROM "sitewise_out"."raw" limit 10;

Day 4 – Using functions to format dates to Santa

You noticed that timeinseconds is in Epoch so let’s use functions to have more human readable output. So we add a small from_unixtime function and combine that with date_format to have formatted output like we want. Perfect, now we know from which data Santa Claus manufacturing data originated.

SELECT date_format(from_unixtime(timeinseconds),'%Y-%m-%dT%H:%i:%sZ') , doublevalue FROM "sitewise_out"."raw" limit 10;

 Day 5 – CTAS creating a table

Using CTAS (CREATE TABLE AS SELECT) you can even create a new physical table easily. You will notice that Athena specific format has been added that you do not need on relational databases.

CREATE TABLE IF NOT EXISTS new_table_name

WITH (format='Avro') AS

SELECT timeinseconds , doublevalue FROM "sitewise_out"."raw" limit 10;

Day 6 – Limit the result sets

Now I want to limit the results to only those where the quality is Good.Adding a WHERE clause I can have only those rows printed to my output – that is cool!

SELECT * FROM "sitewise_out"."raw"  where quality='GOOD' limit 10;

 


Case Santa’s fleet

Now we jump into Santa’s fleet meaning sleights and there is few attribute we are interested like SleightD , IsSmartLock, LastGPSTime , SleightStateIDLatitude and Longitude. This data is time series that is ingested into our platform near real-time. Let’s use AWS Timestream service which is fast, scalable, and serverless time series database service for IoT and operational applications. A time series is a data set that tracks a sample over time. 

Day 7 – creating a table for fleet

You will notice very quickly that data model looks different than on relational database cases. There is no need beforehand to define table structure just executing CreateTable is enough.

 

Day 8- query the latest record

You can override time field using e.g. LastGPSTime, in this example we use time when data was ingested in, so getting the last movement of sleigh would be like this.

SELECT * FROM movementdb.tbl_movement
ORDER BY time DESC
LIMIT 1

Day 9- let’s check the last 24 hours movement

We can use time to filter our results and ordering on descending same time.

SELECT *
FROM "movementdb"."tbl_movement" 
WHERE time > ago(24h) 
ORDER BY time DESC

Day 10- latitude and longitude

We can find out latitude and longitude information easily and please note we are using IN operator to bet both to query result.

SELECT measure_name,measure_value::double,time 
FROM "movementdb"."tbl_movement" 
WHERE time > ago(24h) 
and measure_name in ('Longitude','Latitude')
ORDER BY time DESC LIMIT 10

Day 11- last connectivity info

Now we use 2 things so we group data based on sleigh id and find the maximum value. This will tell when sleigh was connected and sending data to our platform. There are plenty of functions to choose from so please check documentation.

SELECT greatest (time) as last_time, sleighId
FROM "movementdb"."tbl_movement" 
WHERE time > ago(24h) 
and measure_name = ('LastGPSTime')
group by sleighId,greatest (time)

Day 12- using conditions for smart lock data

CASE is very powerful to manipulate the query results so in this example we use that do indicate better if sleigh had smart lock.

SELECT time, measure_name,
CASE 
WHEN measure_value::boolean = true THEN 'Yes we have a smart lock'
ELSE 'No we do not that kind of fancy locks'
END AS smart_lock_info
FROM "movementdb"."tbl_movement"
WHERE time between ago(1d) and now() 
and measure_name='IsSmartLock'

Day 13- finding the latest battery level on each fleet equipment

This would be a bit more complex so we have one query to find max value of battery level and then we later join that to base data so on each record we know the latest battery level in the past 24 hours. Please notice we are using INNER join in this example.

WITH latest_battery_time as (
select 
d_sleighIdentifier, 
max(time) as latest_time 
FROM 
"movementdb"."tbl_movement" 
WHERE 
time between ago(1d) 
and now() 
and measure_name = 'Battery' 
group by 
d_sleighIdentifier
) 
SELECT 
b.d_sleighIdentifier, 
b.measure_value :: double as last_battery_level 
FROM 
latest_battery_time a 
inner join "movementdb"."tbl_movement" b on a.d_sleighIdentifier = b.d_sleighIdentifier 
and b.time = a.latest_time 
WHERE 
b.time between ago(1d) 
and now() 
and b.measure_name = 'Battery'

Day 14- distinct values

The SELECT DISTINCT statement is used to return only distinct (different) values. This is so create and also very misused when removing duplicates etc. when actual problem can be on JOIN conditions.

SELECT 
DISTINCT (d_sleighIdentifier) 
FROM 
"movementdb"."tbl_movement"

Day 15- partition by is almost magic

The PARTITION BY clause is a subclause of the OVER clause. The PARTITION BY clause divides a query’s result set into partitions. The window function is operated on each partition separately and recalculate for each partition. This is almost a magic and that can be used in several ways like in this example identify last sleigh Id.

select 
d_sleighIdentifier, 
SUM(1) as total, 
from 
(
SELECT 
*, 
first_value(d_sleighIdentifier) over (
partition by d_sleighTypeName 
order by 
time desc
) lastaction 
FROM 
"movementdb"."tbl_movement" 
WHERE 
time between ago(1d) 
and now()
) 
GROUP BY 
d_sleighIdentifier, 
lastaction

Day 16- interpolation (values of missing data points)

Timestream and few other IoT services supports linear interpolation, enabling to estimate and retrieve the values of missing data points in their time series data. This will come very handy when our fleet is not connected all the time, in this example we used it for our smart sleight battery level.

WITH rawseries as (
select 
measure_value :: bigint as value, 
time as d_time 
from 
"movementdb"."tbl_movement" 
where 
measure_name = 'Battery'
), 
interpolate as (
SELECT 
INTERPOLATE_LINEAR(
CREATE_TIME_SERIES(d_time, value), 
SEQUENCE(
min(d_time), 
max(d_time), 
1s
)
) AS linear_ts 
FROM 
rawseries
) 
SELECT 
time, 
value 
FROM 
interpolate CROSS 
JOIN UNNEST(linear_ts)

Case Santa’s  master data

Now we jump into Master Data when factory and fleet is up are covered. In this very complex supply chain system customer data is very typical transactional data and in this exercise we keep it very atomic having stored only very basic info into DynamoDB that is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. We use this data to on IoT data streams for join, filtering and other purposes in fast manner. Good to remember that DynamoDB is not build for complex query patterns so it’s best on it’s original key=value data query pattern.

Day 17- adding master data

We upload our customer data into DynamoDB so called “items” based om the list received from Santa.

{
"customer_id": {
"S": "AJUUUUIIIOS"
},
"category_list": {
"L": [
{
"S": "Local Businesses"
},
{
"S": "Restaurants"
}
]
},
"homepage_url": {
"S": "it would be here"
},
"founded_year": {
"N": "2021"
},
"contract": {
"S": "NOPE"
},
"country_code": {
"S": "FI"
},
"name": {
"S": ""
},
"market_stringset": {
"SS": [
"Health",
"Wellness"
]
}
}

Day 18- query one customer item

Amazon DynamoDB supports PartiQL, a SQL-compatible query language, to select, insert, update, and delete data in Amazon DynamoDB. That is something we will use too speed up things. Let’s first query one customer data asset.

SELECT * FROM "tbl_customer" where customer_id='AJUUUUIIIOS'

Day 18- update kids information

Using the same PartiQL you can update item to have new attributes with one go.

UPDATE "tbl_customer" 
SET kids='2 kids and one dog' 
where customer_id='AJUUUUIIIOS'

Day 19- contains function

Now we can easily check that form marketing data who was interested on Health using CONTAINS. Many moderns database engines have native support for semi-structured data, including: Flexible-schema data types for loading semi-structured data without transformation. If you are not already familiar please take a look on AWS Redshift and Snowflake.

SELECT * FROM "tbl_customer" where contains("market_stringset", 'Health')

Day 20- inserting a new customer

Using familiar SQL like it’s very straightforward to add one new item.

INSERT INTO "tbl_customer" value {'name' : 'name here','customer_id' : 'A784738H'}

Day 21- missing data

Using a special MISSING you can find those where some attribute is not present easily.

SELECT * FROM "tbl_customer" WHERE "kids" is MISSING

Day 22- export data into s3

With one command you can export data from DynamoDB to S3 so let’s do that one based on documentation. AWS and others do have support for something called Federated Query where you can run SQL queries across data stored in relational, non-relational, object, and custom data sources. This federated feature we will cover later with You.

Day 23- using S3 select feature

Now you have data stored to  S3 bucket and there is holder called /data so you can even use SQL to query S3 stored data. This will find relevant information for customer_id.

Select s.Item.customer_id from S3Object s

Day 24- s3 select to find right customer

You can even use customer Id to restrict data returned to you.

Select s.Item.customer_id from S3Object s where s.Item.customer_id.S ='AJUUUUIIIOS'

 

That’s all, I hope you get some glimpse how useful SQL is even you have different services and you might first think this will never be possible to use same kind of language of choice. Please do remember when some day You might be building next generation artificial intelligence and analysis platform with us knowing few data modeling techniques and SQL is a very good start.

You might be interested Industrial equipment data at scale for factory floor or manage your fleet at scale so let’s keep fresh mind and have a very nice week !

 

vision

The Industrial Revolution 6.0

Strength of will, determination, perseverance, and acting rationally in the face of adversity

The Industrial Revolution

The European Commission has taken a very active role to define Industry 5.0 and it complements Industry 4.0 for transformation of sustainable, human-centric and resilient European industry.

Industry 5.0 provides a vision of industry that aims beyond efficiency and productivity as the sole goals, and reinforces the role and the contribution of industry to society. https://ec.europa.eu/info/research-and-innovation/research-area/industrial-research-and-innovation/industry-50_en

Finnish industry is affected by the pandemic, the fragmentation global supply chains and dependency of suppliers all around the world. Finnish have something called “sisu”. It’s a Finnish term that can be roughly translated into English as strength of will, determination, perseverance, and acting rationally in the face of adversity. That might be one reason why in Finland group of people are already defining Industry 6.0 and also one of the reasons we wanted to share our ideas using blog posts such as:

  1. Smart and Connected Factories
  2. Factory Floor and Edge computing
  3. Industrial data contextualization at scale
  4. AWS SageMaker Pipelines – Making MLOps easier for the Data Scientist
  5. Productivity and industrial user experience
  6. Cloud data transformation
  7. Illusion of real-time
  8. Manufacturing security hardening

It’s not well defined where the boundaries on each industrial revolution really are. We can argue that first Industry 1.0 was around 1760 when transition to new manufacturing processes using water and steam was happening.  Roughly 1840 the second industrial revolution was referred to as “The Technological Revolution” where one component was superior electrical technology which allowed for even greater production. Industry 3.0 introduced more automated systems onto the assembly line to perform human tasks, i.e. using Programmable Logic Controllers (PLC).

Present 

The Fourth Industrial Revolution (Industry 4.0) will incorporate storage systems and production facilities that can autonomously exchange information. How to deliverer and purchase any service or product will have on these 3 dimensions two categories: physical and digital.

IoT has a bit of inflation as a word and the few biggest hype cycles are past life- which is a good thing. The Internet of things (IoT) plays very important role to enable smart connected devices and extend the possibility to Cloud computing. Companies are already creating cyber-physical systems where machine learning (ML) is built into product-centered thinking. Few of the companies have a digital twin that serves as the real-time digital counterpart of a physical object or process.

In Finland with a long history of factory, process and manufacturing companies this is reality and bigger companies are targeting for faster time to market, quality and efficiency. Rigid SAP processes combined with yearly budgets are not blocking future looking products and services – we are past that time. There are great initiatives for sensor networks and edge computing for environment analysis. Software enabled intelligent products, new better offerings based on real usage and how to differentiate on market is everyday business to many of us in the industrial domain.

Future

“When something is important enough, you do it even if the odds are not in your favor.” Elon Musk

World events have pushed industry to rethink how to build and grow business in a sustainable manner. Industry 5.0 is being said to be the revolution in which man and machine reconcile and find ways to work together to improve the means and efficiency of production.  Being on stage or watching your fellow colleagues you can hear words like human-machine co-creative resilience, mass-customization,  sustainability and circular economy. Product complexity is increasing at the same time with ever-increasing customer expectations.

Industry 6.0 exists only in whitepapers but that does not mean that “customer driven virtualized antifragile manufacturing” could be real some day. Hyperconnected factories and dynamic supply chains would most probably benefit all of us. Some are referring to industrial change same way as hyperscalers such as AWS are doing for selling cloud capacity. There are challenges for sure like “Lot Size One” to be economically feasible. One thing is for sure that all models and things will merge, blur and convergence.

 

Building the builders

“My biggest mistake is probably weighing too much on someone’s talent and not someone’s personality. I think it matters whether someone has a good heart.” – Elon Musk

One fact is that industrial life is not super interesting for millennials. It looks old fashioned so to have a future professional is a must have. Factory floor might not be as interesting as it was a few decades ago. Technology possibilities and cloud computing will boost to have more different people to have interest towards industrial solutions. A lot of ecosystems exist with little collaboration and we think it’s time to change that by reinventing business models, solutions and onboarding more fresh minded people for industrial solutions.

That is one reason we have packaged kickstarts to our customers and anyone interested can grow with us.

 

 

 

 

Manufacturing security hardening

Securing IT/OT integration

 

Last time my colleague Ripa and I discussed about industrial UX and productivity. This time I focus on factory security especially in situations when factories will be connected to the cloud.

Historical habits 

As we know for a long time manufacturing OT workloads were separated from IT workloads. Digitalization, IoT and edge computing enabled IT/OT convergence and made it possible to take advantage of cloud services.

Security model at manufacturing factories has been based on isolation where the OT workload could be isolated and even fully air-gapped from the company’s other private clouds. I recommend you to take a look at the Purdue model back from the 1990s, which was and still is the basis for many factories for giving guidance for industrial communications and integration points. It was so popular and accepted that it became the basis for the ISA-95 standard (the triangle I drew in a blog post). 

Now with new possibilities with the adoption of cloud, IoT, digitalization and enhanced security we need to think: 

Is the Purdue model still valid and is it just slowing down moving towards smart and connected factories?

Purdue model presentation aligned to industrial control system

 

Especially now that edge computing (manufacturing cloud) is becoming more sensible, we can process the data already at level 1 and send the data to the cloud using existing secured network topology. 

Is the Purdue model slow down new thinking ? Should we have Industrial Edge computing platform that can connect to all layers?

 

Well architected

Thinking about the technology stack from factory floor up to AWS cloud data warehouses or visualizations, it is huge! It’s not so straightforward to take into account all the possible security principles to all levels of your stack. It might even be that the whole stack is developed during the last 20 years, so there will be legacy systems and technology dept, which will slow down applying modern security principles. 

In the following I summarize 4 main security principles you can use in hybrid manufacturing environments:

  • Is data secured in transit and at rest ? 

Use encryption and if possible enforce it. Use key and certificate management with scheduled rotation. Enforce access control to data, including backups and versions as well. For hardware, use Trusted Platform Module (TPM) to store keys and certificates.

  • Are all the communications secured ? 

Use TLS or IPsec to authenticate all network communication. Implement network segmentation to make networks smaller and tighten trust boundaries. Use industrial protocols like OPC-UA.

  • Is security taken in use in all layers ? 

Go through all layers of your stack and verify that you cover all layers with proper security control.

  • Do we have traceability ? 

Collect log and metric data from hardware and software, network, access requests and implement monitoring, alerting, and auditing for actions and changes to the environment in real time.

 

Secured data flow 

Following picture is a very simplified version of the Purdue model aligned to manufacturing control hierarchy and adopting AWS cloud services. It focuses on how manufacturing machinery data can connect to the cloud securely. Most important thing to note from the picture is that network traffic from on-prem to cloud is private and encrypted. There is no reason to route this traffic through the public internet. 

Purdue model aligned to manufacturing control hierarchy adopting AWS cloud

 

You can establish a secure connection between the factory and AWS cloud by using AWS Direct Connect or AWS Site-to-Site VPN. In addition to this I recommend using VPC endpoints so you can connect to AWS services without a public IP address. Many AWS services support VPC endpoints, including AWS Sitewise and IoT Core.

Manufacturing machinery is on layers 0-2. Depending on the equipment trust levels it’s a good principle to divide the whole machinery into cells / sub networks to tighten trust boundaries. Machinery with different trust levels can be categorized in its own cells. Using industrial protocols, like OPC-UA, brings authentication and encryption capabilities near the machinery. I’m very excited about the possibility to do server initiated connections (reverse connect) on OPC-UA, which makes it possible for clients to communicate with server without firewall inbound port opening.

As you can see from the picture, data is routed through all layers of and looks like layers IDMZ (Industrial Demilitarized Zone), 4 and 5 are almost empty. As discussed earlier, only for connecting machinery to the cloud via secure tunneling we could bypass some layers. But for other use cases the layers are still needed. If for some reason we need to route factory network traffic to AWS Cloud through the public internet, we need a TLS proxy on IDMZ to encrypt the traffic and protect the factory from DDoS attacks (Distributed Denial of Service attack).

The edge computing unit on Layer 3 is a AWS Greengrass device which ingests data from factory machinery, processes the data with ML and sends only the necessary data to the cloud. The unit can also discuss and ingest data from Supervisory Control and Data Acquisition (SCADA), and Distributed Control System (DCS) and other systems from manufacturing factories. AWS Greengrass uses x509 certificate based authentication to AWS cloud. Idea is that the private key will not leave from the device and is protected and stored in the device’s TPM module. All the certificates are stored to AWS IoT Core and can be integrated to custom PKI. For storing your custom CA’s (Certificate Authority) you can use AWS ACM. I strongly recommend to design and build certificate lifecycle policies and enforce certificate rotation for reaching a good security level.

One great way of auditing your cloud IoT security configuration is to audit it with AWS IoT Device Defender. Also you can analyse the factory traffic real-time, find anomalies and trigger security incidents automatically when needed.

 

Stay tuned

Security is our best friend, you don’t need to be afraid of it.

Build it to all layers, from bottom to top in as early a phase as possible. AWS has the security capabilities to connect private networks to the cloud and do edge computing and data ingesting in a secure way. 

Stay tuned for next posts and check out our Connected Factory Kickstart if you haven’t yet

https://www.solita.fi/en/solita-connected/

 

Illusion of real-time

Magic is the only honest profession. A magician promises to deceive you and he does.

Cloud data transformation

Tipi shared thoughts on how data assets could be utilized on Cloud. We had few question after blog post and one of those was “how to tackle real time requirements ?

Let’s go real time ?

Real-time business intelligence is a concept describing the process of delivering business intelligence or information about business operations as they occur. Real time means near to zero latency and access to information whenever it is required.

We all remember those nightly batch loads and preprocessing data –  waiting a few hours before data is ready for reports. Someone is looking if sales numbers are dropped and the manager will ask quality reports from production. Report is evidence to some other team what is happening in our business.

Let’s go back to the definition that says “information whenever it is required” so actually for some of the team(s) even one week or day can be realtime. Business processes and humans are not software robots so taking action based on any data will take more than a few milliseconds so where is this real time requirement coming from ?

Marko had a nice article related to OT systems and Factory Floor and Edge computing. Any factory issue can be a major pain and downtime is not an option and explained how most of the data assets like metrics and logs must be available immediately in order to recover and understand the root cause.

Hyperscalers and real time computing

In March 2005, Google acquired the web statistics analysis program Urchin, later known as Google Analytics. That was one of the customer facing solutions to gather massive amount of data. Industrial protocols like Modbus from 1970 was designed to work real time on that time and era. Generally speaking real time computing has three categories:

  • Hard – missing a deadline is a total system failure.
  • Firm – infrequent deadline misses are tolerable, but may degrade the system’s quality of service. The usefulness of a result is zero after its deadline.
  • Soft – the usefulness of a result degrades after its deadline, thereby degrading the system’s quality of service.

So it’s easy to understand that airplane turbine and rolling 12 months sales forecast have different requirements. .

What is the cost of (data) delay ?

“A small boat that sails the river is better than a large ship that sinks in the sea.”― Matshona Dhliwayo

We can simply estimate the value a specific feature would bring in after its launch and multiply this value with the time it will take to build. That will tell the economic impact that postponing a task will have.

High performing teams can do cost of delay estimation to understand which task should take first.  Can we calculate and understand the cost of delayed data? How much that will cost to your organization if service or product must be postponed because you are missing data or you can not use it.

Start defining real-time

You can easily start discussing what kind of data is needed to improve customer experience.  Real time requirements might be different for each use case and that is totally fine. It’s a good practice to specify near real time requirements in factual numbers and few examples. It’s good to remember that end to end can have totally different meanings. Working with OT systems for example the term First Mile is used when protect and connect OT systems with IT.

Any equipment failure must be visible to technicians at site in less than 60 seconds. ― Customer requirement

Understand team topologies

Incorrect team topology can block any near real time use cases. That means that adding each component and team deliverable to work together might end up having unexpected data delays. Or in the worst case scenario a team is built too much around one product / feature that will have come a bottleneck later when building more new services.

Data as a product refers to an idea where the job of the data team is to provide the data that the company needs. Data as a Service team partners with stakeholders and have more functional experience and are responsible for providing insight as opposed to rows and columns. Data Mesh is about the logical and physical interconnections of the data from producers through to consumers.

Team topologies will have a huge impact on how data driven services are built and can data land to business case purposes just on the right time.

Enable Edge streaming and APIs capabilities

On cloud services like AWS Kinesis is great, it is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second. Apache Kafka is a framework implementation of a software bus using stream-processing. Apache Spark is an open-source unified analytics engine for large-scale data processing.

I am sure that at least one of these you are already familiar with. In order to control data flow we have two parameters: amount of messages and time. Which will come first will se served.

 Is your data solution idempotent and able to handle data delays ? ― Customer requirement

Modern purpose-built databases have capability to process streaming data. Any extra layer of data modeling will add a delay for data consumption. On Edge we typically run purpose-built robust database services in order to capture all factory floor events with industry standard data models.

Site and Cloud API is a contact between different parties and will improve connectivity and collaboration. API calls on Edge works nicely and you can have data available in less than 70-300ms from Cloud endpoint (example below). Same data is available on Edge endpoint where client response is even faster so building factory floor applications is easy.

curl --location --request GET 'https://data.iotsitewise.eu-west-1.amazonaws.com/properties/history?assetId=aa&maxResults=1&propertyId=pp --header 'X-Amz-Date: 20211118T152104Z' --header 'Authorization: AWS4-HMAC-SHA256 Credential=xxx, SignedHeaders=host;x-amz-date, Signature=xxxx

Quite many databases has built-in Data API. It’s still good to remember that underlying engine, data model and many factors will determine how scalable solution really is.

AWS GreenGrass StreamManager is a component that enables you to process data streams to transfer to the AWS Cloud from Greengrass core devices. Other services like Firehose is supported using specific aws.greengrass.KinesisFirehose component. These components will support also building Machine Learning (ML) features on Edge as well.

 

Conclusion

Business case will define the requirement of real time. Build your near real time capabilities according to your future proof architecture – adding real time capabilities later might come almost impossible. 

If business case is not clear enough what should I do ? Maybe a cup of tea, relax and read blog post from Johannes The gap between design thinking and business impact

You might be interested our kickstarts Accelerate cloud data transformation ​and Industrial equipment data at scale

Let’s stay fresh-minded !

 

A complete list of new features introduced at the Tableau Conference 2021

The Tableau Conference 2021 is over and yet again it was a lot of fun with all the not-so-serious music performances, great informative sessions, excellent Iron Viz competition, and of course demonstrations of many new features coming in the future releases. In general my first thoughts about the new capabilities revealed in TC21 are very positive. Obviously some of the details are still a bit blurry but the overall topics seem to be in a good balance: There are very interesting improvements coming for visual analytics, data management and content consumption in different channels, but in my opinion the most interesting area was augmented analytics and capabilities for citizen data scientists.

It’s been 2 years since Salesforce announced the acquisition of Tableau. After acquisitions and mergers, it’s always interesting to see how it affects the product roadmap and development. Now I really feel the pace for Tableau is getting faster and also the scope is getting more extensive. Tableau is not only fine tuning the current offering, but creating a more comprehensive analytics platform with autoML, easier collaboration & embedding, and action triggers that extend beyond the Tableau.

Note: All the pictures are created using screenshots from the TC21 Devs on Stage and TC21 Opening Keynote sessions. You can watch the sessions at any time on Tableau site.

Update: Read our latest overview of the Tableau product roadmap based on TC22 and TC21.

The Basics – Workbook Authoring

Let’s dive into workbook authoring first. It is still the core of Tableau and I’m very pleased to see there is still room for improvement. For the workbook authoring the biggest announcement was the visualization extensions. This means you can more easily develop and use new custom visualization types (for example sunburst and flower). The feature makes it possible to adjust visualization details with mark designer and to share these custom visualizations with others. Another very nice feature was dynamic dashboard layouts, you can use parameters and field values to dynamically toggle the visibility of dashboard components (visualizations and containers). This gives so much more power to flexibly show and hide visualizations on the dashboard.

There is also a redesigned UI to view underlying data with options to select the desired columns, reorder columns and sort data, export data etc. For map analysis the possibility to use data from multiple data sources in spatial layers is a very nice feature. Using workbook optimizer you can view tips to improve performance when publishing the workbook. In general it also seems the full web authoring for both data source and visualization authoring isn’t very far away anymore.

  • Visualization Extensions (2022 H2): Custom mark types, mark designer to fine tune the visualization details, share custom viz types.
  • Dynamic Dashboard Layouts (2022 H1): Use parameters & field values to show/hide layout containers and visualizations.
  • Multi Data Source Spatial Layers (2021.4): Use data from different data sources in different layers of a single map visualization.
  • Redesigned View Data (2022 H1): View/hide columns, reorder columns, sort data, etc.
  • Workbook Optimizer (2021.4): Suggest performance improvements when publishing a workbook.
Visualization Extensions. Create more complex visualizations (like sunburst) with ease.

Augmented Analytics & Citizen Data Science

This topic has been in the Gartner’s hype cycle for some time. In Tableau we have already seen the first capabilities related to augmented analytics and autoML, but this area is really getting a lot more power in the future. Data change radar will automatically detect new outliers or anomalies in the data, and alert and visualize those to the user. Then users can apply the explain data feature to automatically get insights and explanations about the data, what has happened and why. Explain the viz feature will not explain only one data point but the whole visualization or dashboard and show descriptive information about the data. All this happens automatically behind the scenes and it can really speed up the analysis to get these insights out-of-the-box. There were also a bunch of smaller improvements in the Ask Data feature for example to adjust the behavior and to embed the ask data functionality.

One of the biggest new upcoming features was the possibility to create and deploy predictive models within Tableau with Tableau Model Builder. This means citizen data scientists can create autoML type of predictive models and deploy those inside Tableau to get new insights about the data.  The user interface for this seemed to be a lot like Tableau Prep. Another very interesting feature was Scenario Planning, which is currently under development in Tableau Labs. This feature gives the possibility to view how changes in certain variables would affect defined target variables and compare different scenarios to each other. Another use case for scenarios would be finding different ways to achieve a certain target. For me the scenario planning seemed to be a bit disconnected from the core capabilities of Tableau, but it is under development and for sure there could be some very nice use cases for this type of functionality.

  • Data Change Radar (2022 H1): Alert and show details about meaningful data changes, detect new outliers or anomalies, alert and explain these.
  • Explain the Viz (2022 H2): Show outliers and anomalies in the data, explain changes, explain mark etc.
  • Multiple Smaller Improvements in Ask Data (2022 H1): Contact Lens author, Personal pinning, Phrase builder, Lens lineage in Catalog, Embed Ask Data.
  • Tableau Model Builder: Use autoML to build and deploy predictive models within Tableau.
  • Scenario Planning: View how changes in certain variables affect target variables and how certain targets could be achieved.
Explain Data side pane with data changes and explain change drill down path.

Collaborate, embed and act

The Tableau Slack integration is getting better and more versatile. With the 2021.4 version you can use Tableau search, Explain Data and Ask Data features directly in Slack. As it was said in the event: “it’s like having data as your Slack member“. In the future also Tableau Prep notifications can be viewed via Slack. It was also suggested that later on similar integration will be possible for example with MS Teams.

There were many new capabilities related to embedding contents to external services. With Connected Apps feature admins can define trusted applications (secure handshake) to make embedding more easy. Tableau Broadcast can be used in Tableau Online to share content via external public facing sites for everyone (for unauthenticated users). There was also a mention about 3rd party identity and access provider support which was not very precise but in my opinion it suggests the possibility to more easily leverage identities and access management from outside Tableau. Embeddable web authoring makes it possible to create and edit contents directly within the service where contents are embedded using the web edit, so no need to use Tableau Desktop.

One big announcement was the Tableau Actions. Tableau dashboards already have great actions to create interactions between the user and the data, but this is something more. With Tableau Actions you can trigger actions outside Tableau directly from a dashboard. You could for example trigger Salesforce Flow tasks by clicking a button in the dashboard. And in the future also other workflow engines will be supported. This will provide much more powerful interactivity options for the user.

  • Tableau search, Explain Data and Ask Data in Slack (2021.4)
  • Tableau Prep notifications in Slack (2022 H1)
  • Connected Apps (2021.4): More easily embed to external apps, create secure handshake between Tableau and other apps.
  • Tableau Broadcast (2022 H2): Share contest via external public facing site to give access to unauthenticated users, only Tableau Online.
  • 3rd party Identity & Access Providers: Better capabilities to manage users externally outside Tableau.
  • Embeddable Web Authoring: No need for desktop when creating & editing embedded contents, full embedded visual analytics.
  • Embeddable Ask Data 
  • Tableau Actions: Trigger actions outside Tableau, for example Salesforce Flow actions, later on support for other workflow engines.
Creating new Tableau Action to trigger Salesforce Flow to escalate case.

Data management & data preparation

Virtual Connections have already been introduced earlier and those seem to be very powerful functionality to centrally manage data connections and create centralized row level security rules. These functionalities and possible new future features build around them can really boost end-to-end self-service analytics in the future. The only downside is that this is part of the data management add-on. Data Catalog Integration will bring the possibility to sync metadata from external data catalog services, like Collibra and Alation.

Related to the data preparation there will be new Tableau Prep Extensions so you can get more power to the prep workflows as a custom step. These new steps can be for example sentiment analysis, geocoding, feature engineering etc. Other new functionality in Tableau Prep is the possibility to use parameters in the Prep workflows. It was also said that in the future you can use Tableau Public to publish and share Tableau Prep flows. This might mean there is also a Public version coming for Tableau Prep. It wasn’t mentioned in the event, but it would be great.

  • Virtual Connections (2021.4): Centrally managed and reusable access points to source data with single point to define security policy and data standards.
  • Centralized row level security (2021.4): Centralized RLS and data management for virtual connections.
  • Data Catalog Integration: Sync external metadata to Tableau (from Collibra, Alation, & Informatica).
  • Tableau Prep Extensions: Leverage and build extension for Tableau Prep (sentiment analysis, OCR, geocoding, feature engineering etc.).
  • Parameters in Tableau Prep (2021.4): Leverage parameters in Tableau Prep workflows.
Content of a virtual connection and related security policies.

Server Management

Even though SaaS options like Tableau Online are getting more popular all the time there was still a bunch of new Tableau Server specific features. New improved resource monitoring capabilities as well as time stamped log file zip generation were mentioned. Backgrounder resource limits can limit the amount of resources consumed by backgrounder processes and auto-scaling for backgrounders for containerized deployments can help the environment to adjust for different workloads during different times of the day.

  • Resource Monitoring Improvements (2022 H1): Show view load requests, establish new baseline etc.
  • Time Stamped log Zips (2021.4)
  • Backgrounder resource limits (2022 H1): Set limits for backgrounder resource consumption.
  • Auto-scaling for backgrounder (2022 H1): Set backgrounder auto-scaling for container deployments.

Tableau Ecosystem & Tableau Public

Tableau is building Tableau Public to better serve the data family in different ways. There is already a possibility to create visualizations in Tableau Public using the web edit. There is also redesigned search and better general user interface to structure and view contents as channels. Tableau Public will also have Slack integration and more data connectors for example to Dropbox and OneDrive. As already mentioned, Tableau Prep flows can be published to Tableau Public in the future and that might also mean a release of Tableau Prep Public, who knows.

In the keynote there was also mention that Tableau exchange would contain all the different kinds of extensions, connectors, datasets and accelerators in the future. The other contents are already there but the datasets will be a very interesting addition. This would mean companies could publish, use and possibly sell and buy analysis ready data contents. The accelerators are dashboard starters for certain use cases or source data.

  • Tableau Public Slack Integration (2022 H1)
  • More connectors to Tableau Public (2022 H1): Box, Dropbox, OneDrive.
  • Publish Prep flows to Tableau Public: Will there be a Public version for Tableau Prep?
  • Tableau Public custom Channels (2022 H1):  Custom channels around certain topics.
  • Tableau exchange: Search and leverage shared extensions, connectors, datasets and accelerators.
  • Accelerators: Dashboard starters for certain use cases and source data (e.g. call center analysis, Marketo data, Salesforce data etc.).

Want to read or hear more?

If you are looking for more info about Tableau read our blog post: Tableau – a pioneer of modern self-service business intelligence.

More info about the upcoming features on the Tableau coming soon page.

You can also read about our visual analytics services and contact to hear more or to see a comprehensive end-to-end Tableau Demo.

Thanks for reading!

Tero Honko, Senior Data Consultant
tero.honko@solita.fi
Phone +358 40 5878359

Accelerate cloud data transformation

Cloud data transformation

Data silos and unpredicted costs preventing innovation

Cloud database race ?

One of the first cloud services was S3 launched in 2006.  AWS Hadoop based Amazon SimpleDB  was released in 2007 and after that there have been many nice cloud database products from multiple cloud hyperscalers. Database as a service (DBaaS) has been a prominent service when customers are looking for scaling, simplicity and taking advantage of the ecosystem. It has been estimated that the Cloud database and DBaaS market was estimated to be USD 12,540 Million by 2020, so no wonder there is a lot of activity. Looking from a customer point of view this is excellent news when the cloud database service race is on and new features are popping up and same time usage costs are getting lower. I can not remember the time when creating a global solution backed by a database would be so cost efficient as it is now.

 

Why should I move data assets to the Cloud ?

There are few obvious reasons like rapid setup, cost efficiency, scaling solutions and integration to other Cloud services. That will give nice security enforcement in many cases where old school username and password is not used like in some on premises systems still do.

 

“No need to maintain private data centers”, “No need to guess capacity”

 

Cloud computing instead typical on premises setup is distributed by nature, so computing and storage are separated. Data replication to other regions is supported out of the box in many solutions, so data can be stored as close as possible to end users for best in class user experience.

In the last few years even more database service can work seamlessly with on premises and cloud. Almost all data related cases have aspects of machine learning nowadays and Cloud empowers teams to enable machine learning in several different ways: in built into database services, purpose-built services or using native integrations. Just using the same development environment and using industry standard SQL you can do all ML phases easily. Database integrated AutoML aims to empower developers to create sophisticated ML models without having to deal with all the phases of ML – that is a great opportunity for any Citizen data scientist !

 

Purpose build databases to support diverse data models

Beauty of cloud comes rapidly with flexibility and pay as you go model with very real time cost monitoring. You can cherry pick the best purpose-built database (relational, key-value, document, in-memory, graph, time series, wide column, and ledger databases.) to suit your use case, data models and avoid building one big monolithic solution.

Snowflake is one of the few enterprise-ready cloud data warehouses that brings simplicity without sacrificing features and can be operated on any major cloud platform. Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale to any relational database in the cloud. Amazon Timestream is a nice option for serverless, super fast time series processing and near real time solutions. You might have a Hadoop system or running a non-scalable relational database on premises and think about how to get started on a journey for improved customer experience and digital services?

Success for your cloud data migration

We have worked with our customers to build a Data Migration strategy. That will help in understanding the migration options, create a plan and also validate future proof architecture.

Today we share with you here a few tips that might help you when planning data migrations.

  1. Employee experience – embrace your team, new possibilities and replace pure technical approach to include commitment from your team developers. Domain knowledge of data assets and applications is very important and building a trust to new solutions from day one.
  2. Challenge your partner of choice. There is more than lift and shift or creating all from scratch options. It might be that all data assets are not needed or useful anymore. Our team is working on a vertical slicing approach where the elephant is splitted to manageable pieces. Using state of the art accelerator solutions we can make an inventory using real life metrics. Let’s make sure that you can avoid the big bang and current systems can operate without impact even when building new systems.
  3. Bad design and technical debt of legacy systems. It’s very typical that old systems’ performance and design can be broken already.  That is something which is not visible to all stakeholders and when doing the first Cloud transformation all that will come visible will pop up. Prepare yourself for surprises – take that as an opportunity to build more robust architecture. Do not try to fix all problems at once !
  4. Automation to the bones. In order to be able to try and replay data make sure everything is fully automated including database, data loading and integrations. So, making a change is fun and not something to be careful of. It’s very hard to build DataOps to on premises systems because of the nature of operating models, contracts and hardware limitations. In Cloud those are not the blockers anymore.
  5. Define workloads and scope ( no low hanging fruits only) . Taking one database and moving that to the Cloud can not be used as any baseline when you have hundreds of databases. Metrics from the first one should not be used as a matrix multiplied by the amount of databases when thinking about the whole project scope. Take a variety of different workloads and solutions, some even the hard one to first sprint. It’s better to start immediately and not wait for any target systems because on Cloud that is totally redundant.
  6. Welcome Ops model improvement. On Cloud database metrics of performance (and any other kind) and audit trails are all visible so creating a more proactive and risk free ops model is at your fingertips. My advice is not to copy the existing Ops model with the current SLA as it is. High availability and recovery are different things – so do not mix those.
  7. Going for meta driven DW. In some cases choosing state of the art automated warehouse like Solita Agile Data Engine (ADE) will boost your business goals when you are ready to take a next step.

 

Let’s kick the Cloud Data transformation ongoing !

Take advantage of cloud when building digital services with less money and faster with our Accelerate cloud data transformation kickstart

You might be interested also Migrating to the cloud isn’t difficult, but how to do it right?

Productivity and industrial user experience

Digital employee is not software robot

 

The last post was about data contextualisation and today on this video blog post we talk about the Importance of User Experience in an Industrial Environment.

UX versus employee experience

User Experience (UX) design is the process design teams use to create products that provide meaningful and relevant experiences to users. 

Employee experience is a worker’s perceptions about his or her journey through all the touchpoints at a particular company, starting with job candidacy through to the exit from the company. 

Using modern, digital tools and platforms can support employee experience and create competitive advantage. Especially working on factory systems and remote locations it’s important to keep good productivity and one option is cloud based manufacturing.

Stay tuned for more and check our Connected Factory kickstart:

https://www.solita.fi/en/solita-connected/

AWS SageMaker Pipelines – Making MLOps easier for the Data Scientist

SageMaker Pipelines is a machine learning pipeline creation SDK designed to make deploying machine learning models to production fast and easy. I recently got to use the service in an edge ML project and here are my thoughts about its pros and cons. (For more about the said project refer to Solita data blog series about IIoT and connected factories https://data.solita.fi/factory-floor-and-edge-computing/)

Example pipeline

Why do we need MLOps?

First, there were statistics then came the emperor’s new clothes – machine learning, a rebranding of old methods accompanied with new ones emerged. Fast forward to today and we’re all the time talking about this thing called “AI”, the hype is real, it’s palpable because of products like Siri and Amazon Alexa.

But from a Data Scientist point of view, what does it take to develop such a model? Or even a simpler model, say a binary classifier? The amount of work is quite large, and this is only the tip of the iceberg. How much more work is needed to put that model into the continuous development and delivery cycle?

For a Data Scientist, it can be hard to visualize what kind of systems you need to automate everything your model needs to perform its task. Data ETL, feature engineering, model training, inference, hyperparameter optimization, performance monitoring etc. Sounds like a lot to automate?

(Hidden technical debt in machine learning https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf)

 

This is where MLOps comes to the picture, bridging DevOps CI/CD practices to the data science world and bringing in some new aspects as well. You can see more information about MLOps from previous Solita content such as https://www.solita.fi/en/events/webinar-what-is-mlops-and-how-to-benefit-from-it/ 

Building an MLOps infrastructure is one thing but learning to use it fluently is also a task of its own. For a Data Scientist at the beginning of his/her career, it could seem too much to learn how to use cloud infrastructure as well as learn how to develop Python code that is “production” ready. A Jupyter notebook outputting predictions to a CSV file simply isn’t enough at this stage of the machine learning revolution.

(The “first” standard on MLOps, Uber Michelangelo Platform https://eng.uber.com/michelangelo-machine-learning-platform/)

 

A Jupyter notebook outputting predictions to a CSV file simply isn’t enough at this stage of the machine learning revolution.

Usually, companies that have a long track record of Data Science projects have a few DevOps, Data Engineer/Machine Learning Engineer roles working closely with their Data Scientists teams to distribute the different tasks of production machine learning deployment. Maybe they even have built the tooling and the infrastructure needed to deploy models into production more easily. But there are still quite a few Data Science teams and data-driven companies figuring out how to do this MLOps thing.

Why should you try SageMaker Pipelines?

AWS is the biggest cloud provider ATM so it has all the tooling imaginable that you’d need to build a system like this. They are also heavily invested in Data Science with their SageMaker product and new features are popping up constantly. The problem so far has been that there are perhaps too many different ways of building a system like this.

AWS tries to tackle some of the problems with the technical debt involving production machine learning with their SageMaker Pipelines product. I’ve recently been involved in project building and deploying an MLOps pipeline for edge devices using SageMaker Pipelines and I’ll try to provide some insight on why it is good and what is lacking compared to a completely custom-built MLOps pipeline.

The SageMaker Pipelines approach is an ambitious one. What if, Data Scientists, instead of having to learn to use this complex cloud infrastructure, you could deploy to production just by learning how to use a single Python SDK (https://github.com/aws/sagemaker-python-sdk)? You don’t even need the AWS cloud to get started, it also runs locally (to a point).

SageMaker Pipelines aims at making MLOps easy for Data Scientists. You can define your whole MLOps pipeline in f.ex. A Jupyter Notebook and automate the whole process. There are a lot of prebuilt containers for data engineering, model training and model monitoring that have been custom-built for AWS. If these are not enough you can use your containers enabling you to do anything that is not supported out of the box. There are also a couple of very niche features like out-of-network training where your model will be trained in an instance that has no access to the internet mitigating the risk of somebody from the outside trying to influence your model training with f.ex. Altered training data.

You can version your models via the model registry. If you have multiple different use cases for the same model architectures with differences being in the datasets used for training it’s easy to select the suitable version from SageMaker UI or the python SDK and refactor the pipeline to suit your needs.  With this approach, the aim is that each MLOps pipeline has a lot of components that are reusable in the next project. This enables faster development cycles and the time to production is reduced. 

SageMaker Pipelines logs every step of the workflow from training instance sizes to model hyperparameters automatically. You can seamlessly deploy your model to the SageMaker Endpoint (a separate service) and after deployment, you can also automatically monitor your model for concept drifts in the data or f.ex. latencies in your API. You can even deploy multiple versions of your models and do A/B testing to select which one is proving to be the best.

And if you want to deploy your model to the edge, be it a fleet of RaspberryPi4s or something else, SageMaker provides tooling for that also and it seamlessly integrates with Pipelines.

You can recompile your models for a specific device type using SageMaker Neo Compilation jobs (basically if you’re deploying to an ARM etc. device you need to do certain conversions for everything to work as it should) and deploy to your fleet using SageMaker fleet management.

Considerations before choosing SageMaker Pipelines

By combining all of these features to a single service usable through SDK and UI, Amazon has managed to automate a lot of the CI/CD work needed for deploying machine learning models into production at scale with agile project development methodologies. You can also leverage all of the other SageMaker products f.ex. Feature Store or Forekaster if you happen to need them. If you’re already invested in using AWS you should give this a try.

Be it a great product to get started with machine learning pipelines it isn’t without its flaws. It is quite capable for batch learning settings but there is no support as of yet for streaming/online learning tasks. 

And for the so-called Citizen Data Scientist, this is not the right product since you need to be somewhat fluent in Python. Citizen Data Scientists are better off with BI products like Tableau or Qlik (which use SageMaker Autopilot as their backend for ML) or perhaps with products like DataRobot. 

And in a time where software products are high availability and high usage the SageMaker EndPoints model API deployment scenario where you have to pre-decide the number of machines serving your model isn’t quite enough.

 In e-commerce applications, you could run into situations where your API is receiving so much traffic that it can’t handle all the requests because you didn’t select a big enough cluster to serve the model with. The only way to increase the cluster size in SageMaker Pipelines is to redeploy a new revision within a bigger cluster. It is pretty much a no brainer to use a Kubernetes cluster with horizontal scaling if you want to be able to serve your model as the traffic to the API keeps increasing.

Overall it is a very nicely packaged product with a lot of good features. The problem with MLOps in AWS has been that there are too many ways of doing the same thing and SageMaker Pipelines is an effort for trying to streamline and package all those different methodologies together for machine learning pipeline creation.

It’s a great fit if you work with batch learning models and want to create machine learning pipelines really fast. If you’re working with online learning or reinforcement models you’ll need a custom solution. And if you are adamant that you need autoscaling then you need to do the API deployments yourself, SageMaker endpoints aren’t quite there yet. For references to a “complete” architecture refer to the AWS blog https://aws.amazon.com/blogs/machine-learning/automate-model-retraining-with-amazon-sagemaker-pipelines-when-drift-is-detected/

 

super

Industrial data contextualization at scale

Shaping the future of your data culture with contextualization

 

My colleague and good friend Marko had interesting thought on Smart and Connected factories  and how to get data out of the complex factory floor systems and enable machine learning capabilities on Edge and Cloud . In this blog post I will try to open a bit more on data modeling and how to overcome a few typical pitfalls – that are not always only data related.

Creating super powers

Research and development (R&D) include activities that companies undertake to innovate and introduce new products and services. In many cases if company is big enough R&D is separate from other units and in some cases R is separated from D as well. We could call this as separation of concerns –  so every unit can 100% focus on their goals.

What separates R&D and Business unit ? Let’s first pause and think about what business is doing. A business unit is an organizational structure such as a department or team that produces revenues and is responsible for costs. Perfect so now we have company wide functions (R&D, business) to support being innovative and produce revenue.

Hmmm, something is still missing – how to scale digital solutions in a cost efficient way so we can have profit (row80) in good shape ? Way back in 1978 information technology (IT) was used first time. The Merriam-Webster Dictionary defines information technology as “the technology involving the development, maintenance, and use of computer systems, software, and networks for the processing and distribution of data.” One the IT functions is to provide services with cost efficiency on global scale.

Combine these super powers: business, R&D and IT we should produce revenue, be innovative and have the latest IT systems up and running to support company goals – in real life this is much more complex, welcome to the era of data driven product and services.

 

Understanding your organization structure 

To be data driven, the first thing is to actually look around in which maturity level my team and company is. There are so many nice models to choose from: functional, divisional, matrix, team, and networking.  Organizational structure can easily become a blocker in how to get new ideas to market quickly enough. Quite many times Conway’s law kicks in and software or automated systems end up “shaped like” the organizational structure they are designed in or designed for.

One example of Conway’s law in action, identified back in 1999 by UX expert Nigel Bevan, is corporate website design: Companies tend to create websites with structure and content that mirror the company’s internal concerns

When you look at your car dashboard, company web sites or circuit board of embedded systems, quite many times you can see Conway’s law in action. Feature teams, tribes, platform teams, enabler team or a component team – I am sure you have at least one of these to somehow try to tackle the problem of how an organization should be able to produce good enough products and services to market on time. Calling same thing with Squad(s) will not solve the core issue. Neither to copy one top-down driven model from Netflix to your industrial landscape.

 

Why does data contextualization matter?

Based on facts mentioned above, creating industrial data driven services is not easy. Imagine you push a product out to the market that is not able to gather data from usage. Other team is building a subscription based service for the same customers. Maybe someone already started to sell that to customers. This solution will not work because now we have a product out and not able to invoice customers from usage. Refactoring of organizations, code and platforms is needed to accomplish common goals together. A new Data Platform as such is not improving the speed of development automatically or making customers more engaged.

Contextualization means adding related information to any data in order to make it more useful. That does not mean data lake, our new CRM or MES. Industrial data is not just another data source on slides, creating contextual data enables to have the same language between different parties such as business and IT. 

A great solution will help you understand better what we have or how things work, it’s like a car you have never driven and still you feel that this is exactly how it should be even if it’s not close to your old vehicle at all. Industrial data assets are modeled in a certain way and that will enable common data models from floor to cloud, enabling scalable machine learning without varying data schema changes.

Our industrial AWS SiteWise data models for example are 100% compatible with modern data warehousing platforms like Solita Agile Data Engine out of the box. General blueprints of data models have failed in this industry many times, so please always look at your use case also from bottom up and not only the other way round.

Curiosity and open minded

I have been working on data for the last 20 years and on the industrial landscape half of that time. Now it’s great  to see how Nordics companies are embracing company culture change, talking about competence based organization, asking from consultants more than just a pair of hands and creating teams of superpowers.

How to get started on data contextualization ?

  1. Gather your team and check how much of time it will take to have one idea to customer (production) – is our current organization model supporting it ?
  2. Look models and approach that you might find useful like intro for data mesh or a  deep dive – the new paradigm you might want to mess with (and remember that what works for someone else might not be perfect to you)
  3. We can help with with AWS SiteWise for data contextualization. That specific service is used to create virtual representations of your industrial operation with AWS IoT SiteWise assets.

I have been working on all major cloud platforms and focusing on AWS.  Stay tuned for the next Blog post explaining how SiteWise is used for data contextualization. Let’s keep in touch and stay fresh minded.

Our Industrial data contextualization at scale Kickstart