SQL Santa for Factory

Awesome SQL Is coming To Town

We have a miniseries before Christmas coming where we talk S-Q-L, /ˈsiːkwəl/ “sequel”. Yes, the 47 years old domain-specific language used in programming and designed for managing data. It’s very nice to see how old faithful SQL is going stronger than ever for stream processing as well the original relational database management purposes.

What is data then and how that should be used ? Take a look on article written in Finnish “Data ei ole öljyä, se on lantaa”

We will show you how to query and manipulate data across different solutions using the same SQL programming language.

 

The Solita Developer survey has become a tradition here at Solita and please check out the latest survey. It’s easy to see how SQL is dominating in a pool of many cool programming languages. It might take an average learner about two to three weeks to master the basic concepts of SQL and this is exactly what we will do with you.

Business case

At Christmas time our business is around gifts and how to deliver those on time. Our model is a bit more simplified and will include operational technology systems such as assets (Santa’s workshop) and fleet (sleighs) operations. There might always be something broken so few maintenance needs are pushed to technicians (elfs). Distributed data platform is used for supply chain and logistics analytics to remove bottlenecks so business owners can be satisfied (Santa Claus and the team) and all gifts will be delivered to the right address just in time.

Case Santa’s workshop

We can later use OEE to calculate that workshop performance is between the limits in order to produce high quality nice gifts. Data is ingested real time and contextualized so once a while Santa and the team will check how we are doing.

In this specific case we know that using Athena we can find relevant production line data just querying the S3 bucket where all raw data is stored already.

Day 1 – creating a Santa’s table for time series data

Let’s create a very basic table to capture all data from Santa’s factory floor. You will notice there are different data types like bigint and string. You can even add comments to help others to later find what kind of data field should include. In this case raw data is Avro but you do not have to worry about that so let’s go.

CREATE EXTERNAL TABLE `raw`(

`seriesid` string COMMENT 'from deserializer',

`timeinseconds` bigint COMMENT 'from deserializer',

`offsetinnanos` bigint COMMENT 'from deserializer',

`quality` string COMMENT 'from deserializer',

`doublevalue` double COMMENT 'from deserializer',

`stringvalue` string COMMENT 'from deserializer',

`integervalue` int COMMENT 'from deserializer',

`booleanvalue` boolean COMMENT 'from deserializer',

`jsonvalue` string COMMENT 'from deserializer',

`recordversion` bigint COMMENT 'from deserializer'

) PARTITIONED BY (

`startyear` string, `startmonth` string,

`startday` string, `seriesbucket` string

)

Day 2 – query Santas’s data

Now we have a table and how to query that one ? That is easy with SELECT and taking all fields using asterix. It’s even possible to limit that to 10 rows which is always a good practice.

SELECT * FROM "sitewise_out"."raw" limit 10;

Day 3 – Creating a view from query

View is a virtual presentation of data that will help to organize assets more efficiently. One golden rule is still now to create many views on top of other views and keep the solution simple. You will notice that CREATE VIEW works nicely and now we have timeinseconds and actual factory floor value (doublevalue) captured. You can even drop the view using DROP command.

CREATE OR REPLACE VIEW "v_santa_data"

AS SELECT timeinseconds, doublevalue FROM "sitewise_out"."raw" limit 10;

Day 4 – Using functions to format dates to Santa

You noticed that timeinseconds is in Epoch so let’s use functions to have more human readable output. So we add a small from_unixtime function and combine that with date_format to have formatted output like we want. Perfect, now we know from which data Santa Claus manufacturing data originated.

SELECT date_format(from_unixtime(timeinseconds),'%Y-%m-%dT%H:%i:%sZ') , doublevalue FROM "sitewise_out"."raw" limit 10;

 Day 5 – CTAS creating a table

Using CTAS (CREATE TABLE AS SELECT) you can even create a new physical table easily. You will notice that Athena specific format has been added that you do not need on relational databases.

CREATE TABLE IF NOT EXISTS new_table_name

WITH (format='Avro') AS

SELECT timeinseconds , doublevalue FROM "sitewise_out"."raw" limit 10;

Day 6 – Limit the result sets

Now I want to limit the results to only those where the quality is Good.Adding a WHERE clause I can have only those rows printed to my output – that is cool!

SELECT * FROM "sitewise_out"."raw"  where quality='GOOD' limit 10;

 

You might be interested Industrial equipment data at scale, let’s keep fresh mind  and stay tuned for next episode where we might go from Factory to Fleet !

 

vision

The Industrial Revolution 6.0

Strength of will, determination, perseverance, and acting rationally in the face of adversity

The Industrial Revolution

The European Commission has taken a very active role to define Industry 5.0 and it complements Industry 4.0 for transformation of sustainable, human-centric and resilient European industry.

Industry 5.0 provides a vision of industry that aims beyond efficiency and productivity as the sole goals, and reinforces the role and the contribution of industry to society. https://ec.europa.eu/info/research-and-innovation/research-area/industrial-research-and-innovation/industry-50_en

Finnish industry is affected by the pandemic, the fragmentation global supply chains and dependency of suppliers all around the world. Finnish have something called “sisu”. It’s a Finnish term that can be roughly translated into English as strength of will, determination, perseverance, and acting rationally in the face of adversity. That might be one reason why in Finland group of people are already defining Industry 6.0 and also one of the reasons we wanted to share our ideas using blog posts such as:

  1. Smart and Connected Factories
  2. Factory Floor and Edge computing
  3. Industrial data contextualization at scale
  4. AWS SageMaker Pipelines – Making MLOps easier for the Data Scientist
  5. Productivity and industrial user experience
  6. Cloud data transformation
  7. Illusion of real-time
  8. Manufacturing security hardening

It’s not well defined where the boundaries on each industrial revolution really are. We can argue that first Industry 1.0 was around 1760 when transition to new manufacturing processes using water and steam was happening.  Roughly 1840 the second industrial revolution was referred to as “The Technological Revolution” where one component was superior electrical technology which allowed for even greater production. Industry 3.0 introduced more automated systems onto the assembly line to perform human tasks, i.e. using Programmable Logic Controllers (PLC).

Present 

The Fourth Industrial Revolution (Industry 4.0) will incorporate storage systems and production facilities that can autonomously exchange information. How to deliverer and purchase any service or product will have on these 3 dimensions two categories: physical and digital.

IoT has a bit of inflation as a word and the few biggest hype cycles are past life- which is a good thing. The Internet of things (IoT) plays very important role to enable smart connected devices and extend the possibility to Cloud computing. Companies are already creating cyber-physical systems where machine learning (ML) is built into product-centered thinking. Few of the companies have a digital twin that serves as the real-time digital counterpart of a physical object or process.

In Finland with a long history of factory, process and manufacturing companies this is reality and bigger companies are targeting for faster time to market, quality and efficiency. Rigid SAP processes combined with yearly budgets are not blocking future looking products and services – we are past that time. There are great initiatives for sensor networks and edge computing for environment analysis. Software enabled intelligent products, new better offerings based on real usage and how to differentiate on market is everyday business to many of us in the industrial domain.

Future

“When something is important enough, you do it even if the odds are not in your favor.” Elon Musk

World events have pushed industry to rethink how to build and grow business in a sustainable manner. Industry 5.0 is being said to be the revolution in which man and machine reconcile and find ways to work together to improve the means and efficiency of production.  Being on stage or watching your fellow colleagues you can hear words like human-machine co-creative resilience, mass-customization,  sustainability and circular economy. Product complexity is increasing at the same time with ever-increasing customer expectations.

Industry 6.0 exists only in whitepapers but that does not mean that “customer driven virtualized antifragile manufacturing” could be real some day. Hyperconnected factories and dynamic supply chains would most probably benefit all of us. Some are referring to industrial change same way as hyperscalers such as AWS are doing for selling cloud capacity. There are challenges for sure like “Lot Size One” to be economically feasible. One thing is for sure that all models and things will merge, blur and convergence.

 

Building the builders

“My biggest mistake is probably weighing too much on someone’s talent and not someone’s personality. I think it matters whether someone has a good heart.” – Elon Musk

One fact is that industrial life is not super interesting for millennials. It looks old fashioned so to have a future professional is a must have. Factory floor might not be as interesting as it was a few decades ago. Technology possibilities and cloud computing will boost to have more different people to have interest towards industrial solutions. A lot of ecosystems exist with little collaboration and we think it’s time to change that by reinventing business models, solutions and onboarding more fresh minded people for industrial solutions.

That is one reason we have packaged kickstarts to our customers and anyone interested can grow with us.

 

 

 

 

Manufacturing security hardening

Securing IT/OT integration

 

Last time my colleague Ripa and I discussed about industrial UX and productivity. This time I focus on factory security especially in situations when factories will be connected to the cloud.

Historical habits 

As we know for a long time manufacturing OT workloads were separated from IT workloads. Digitalization, IoT and edge computing enabled IT/OT convergence and made it possible to take advantage of cloud services.

Security model at manufacturing factories has been based on isolation where the OT workload could be isolated and even fully air-gapped from the company’s other private clouds. I recommend you to take a look at the Purdue model back from the 1990s, which was and still is the basis for many factories for giving guidance for industrial communications and integration points. It was so popular and accepted that it became the basis for the ISA-95 standard (the triangle I drew in a blog post). 

Now with new possibilities with the adoption of cloud, IoT, digitalization and enhanced security we need to think: 

Is the Purdue model still valid and is it just slowing down moving towards smart and connected factories?

Purdue model presentation aligned to industrial control system

 

Especially now that edge computing (manufacturing cloud) is becoming more sensible, we can process the data already at level 1 and send the data to the cloud using existing secured network topology. 

Is the Purdue model slow down new thinking ? Should we have Industrial Edge computing platform that can connect to all layers?

 

Well architected

Thinking about the technology stack from factory floor up to AWS cloud data warehouses or visualizations, it is huge! It’s not so straightforward to take into account all the possible security principles to all levels of your stack. It might even be that the whole stack is developed during the last 20 years, so there will be legacy systems and technology dept, which will slow down applying modern security principles. 

In the following I summarize 4 main security principles you can use in hybrid manufacturing environments:

  • Is data secured in transit and at rest ? 

Use encryption and if possible enforce it. Use key and certificate management with scheduled rotation. Enforce access control to data, including backups and versions as well. For hardware, use Trusted Platform Module (TPM) to store keys and certificates.

  • Are all the communications secured ? 

Use TLS or IPsec to authenticate all network communication. Implement network segmentation to make networks smaller and tighten trust boundaries. Use industrial protocols like OPC-UA.

  • Is security taken in use in all layers ? 

Go through all layers of your stack and verify that you cover all layers with proper security control.

  • Do we have traceability ? 

Collect log and metric data from hardware and software, network, access requests and implement monitoring, alerting, and auditing for actions and changes to the environment in real time.

 

Secured data flow 

Following picture is a very simplified version of the Purdue model aligned to manufacturing control hierarchy and adopting AWS cloud services. It focuses on how manufacturing machinery data can connect to the cloud securely. Most important thing to note from the picture is that network traffic from on-prem to cloud is private and encrypted. There is no reason to route this traffic through the public internet. 

Purdue model aligned to manufacturing control hierarchy adopting AWS cloud

 

You can establish a secure connection between the factory and AWS cloud by using AWS Direct Connect or AWS Site-to-Site VPN. In addition to this I recommend using VPC endpoints so you can connect to AWS services without a public IP address. Many AWS services support VPC endpoints, including AWS Sitewise and IoT Core.

Manufacturing machinery is on layers 0-2. Depending on the equipment trust levels it’s a good principle to divide the whole machinery into cells / sub networks to tighten trust boundaries. Machinery with different trust levels can be categorized in its own cells. Using industrial protocols, like OPC-UA, brings authentication and encryption capabilities near the machinery. I’m very excited about the possibility to do server initiated connections (reverse connect) on OPC-UA, which makes it possible for clients to communicate with server without firewall inbound port opening.

As you can see from the picture, data is routed through all layers of and looks like layers IDMZ (Industrial Demilitarized Zone), 4 and 5 are almost empty. As discussed earlier, only for connecting machinery to the cloud via secure tunneling we could bypass some layers. But for other use cases the layers are still needed. If for some reason we need to route factory network traffic to AWS Cloud through the public internet, we need a TLS proxy on IDMZ to encrypt the traffic and protect the factory from DDoS attacks (Distributed Denial of Service attack).

The edge computing unit on Layer 3 is a AWS Greengrass device which ingests data from factory machinery, processes the data with ML and sends only the necessary data to the cloud. The unit can also discuss and ingest data from Supervisory Control and Data Acquisition (SCADA), and Distributed Control System (DCS) and other systems from manufacturing factories. AWS Greengrass uses x509 certificate based authentication to AWS cloud. Idea is that the private key will not leave from the device and is protected and stored in the device’s TPM module. All the certificates are stored to AWS IoT Core and can be integrated to custom PKI. For storing your custom CA’s (Certificate Authority) you can use AWS ACM. I strongly recommend to design and build certificate lifecycle policies and enforce certificate rotation for reaching a good security level.

One great way of auditing your cloud IoT security configuration is to audit it with AWS IoT Device Defender. Also you can analyse the factory traffic real-time, find anomalies and trigger security incidents automatically when needed.

 

Stay tuned

Security is our best friend, you don’t need to be afraid of it.

Build it to all layers, from bottom to top in as early a phase as possible. AWS has the security capabilities to connect private networks to the cloud and do edge computing and data ingesting in a secure way. 

Stay tuned for next posts and check out our Connected Factory Kickstart if you haven’t yet

https://www.solita.fi/en/solita-connected/

 

Illusion of real-time

Magic is the only honest profession. A magician promises to deceive you and he does.

Cloud data transformation

Tipi shared thoughts on how data assets could be utilized on Cloud. We had few question after blog post and one of those was “how to tackle real time requirements ?

Let’s go real time ?

Real-time business intelligence is a concept describing the process of delivering business intelligence or information about business operations as they occur. Real time means near to zero latency and access to information whenever it is required.

We all remember those nightly batch loads and preprocessing data –  waiting a few hours before data is ready for reports. Someone is looking if sales numbers are dropped and the manager will ask quality reports from production. Report is evidence to some other team what is happening in our business.

Let’s go back to the definition that says “information whenever it is required” so actually for some of the team(s) even one week or day can be realtime. Business processes and humans are not software robots so taking action based on any data will take more than a few milliseconds so where is this real time requirement coming from ?

Marko had a nice article related to OT systems and Factory Floor and Edge computing. Any factory issue can be a major pain and downtime is not an option and explained how most of the data assets like metrics and logs must be available immediately in order to recover and understand the root cause.

Hyperscalers and real time computing

In March 2005, Google acquired the web statistics analysis program Urchin, later known as Google Analytics. That was one of the customer facing solutions to gather massive amount of data. Industrial protocols like Modbus from 1970 was designed to work real time on that time and era. Generally speaking real time computing has three categories:

  • Hard – missing a deadline is a total system failure.
  • Firm – infrequent deadline misses are tolerable, but may degrade the system’s quality of service. The usefulness of a result is zero after its deadline.
  • Soft – the usefulness of a result degrades after its deadline, thereby degrading the system’s quality of service.

So it’s easy to understand that airplane turbine and rolling 12 months sales forecast have different requirements. .

What is the cost of (data) delay ?

“A small boat that sails the river is better than a large ship that sinks in the sea.”― Matshona Dhliwayo

We can simply estimate the value a specific feature would bring in after its launch and multiply this value with the time it will take to build. That will tell the economic impact that postponing a task will have.

High performing teams can do cost of delay estimation to understand which task should take first.  Can we calculate and understand the cost of delayed data? How much that will cost to your organization if service or product must be postponed because you are missing data or you can not use it.

Start defining real-time

You can easily start discussing what kind of data is needed to improve customer experience.  Real time requirements might be different for each use case and that is totally fine. It’s a good practice to specify near real time requirements in factual numbers and few examples. It’s good to remember that end to end can have totally different meanings. Working with OT systems for example the term First Mile is used when protect and connect OT systems with IT.

Any equipment failure must be visible to technicians at site in less than 60 seconds. ― Customer requirement

Understand team topologies

Incorrect team topology can block any near real time use cases. That means that adding each component and team deliverable to work together might end up having unexpected data delays. Or in the worst case scenario a team is built too much around one product / feature that will have come a bottleneck later when building more new services.

Data as a product refers to an idea where the job of the data team is to provide the data that the company needs. Data as a Service team partners with stakeholders and have more functional experience and are responsible for providing insight as opposed to rows and columns. Data Mesh is about the logical and physical interconnections of the data from producers through to consumers.

Team topologies will have a huge impact on how data driven services are built and can data land to business case purposes just on the right time.

Enable Edge streaming and APIs capabilities

On cloud services like AWS Kinesis is great, it is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second. Apache Kafka is a framework implementation of a software bus using stream-processing. Apache Spark is an open-source unified analytics engine for large-scale data processing.

I am sure that at least one of these you are already familiar with. In order to control data flow we have two parameters: amount of messages and time. Which will come first will se served.

 Is your data solution idempotent and able to handle data delays ? ― Customer requirement

Modern purpose-built databases have capability to process streaming data. Any extra layer of data modeling will add a delay for data consumption. On Edge we typically run purpose-built robust database services in order to capture all factory floor events with industry standard data models.

Site and Cloud API is a contact between different parties and will improve connectivity and collaboration. API calls on Edge works nicely and you can have data available in less than 70-300ms from Cloud endpoint (example below). Same data is available on Edge endpoint where client response is even faster so building factory floor applications is easy.

curl --location --request GET 'https://data.iotsitewise.eu-west-1.amazonaws.com/properties/history?assetId=aa&maxResults=1&propertyId=pp --header 'X-Amz-Date: 20211118T152104Z' --header 'Authorization: AWS4-HMAC-SHA256 Credential=xxx, SignedHeaders=host;x-amz-date, Signature=xxxx

Quite many databases has built-in Data API. It’s still good to remember that underlying engine, data model and many factors will determine how scalable solution really is.

AWS GreenGrass StreamManager is a component that enables you to process data streams to transfer to the AWS Cloud from Greengrass core devices. Other services like Firehose is supported using specific aws.greengrass.KinesisFirehose component. These components will support also building Machine Learning (ML) features on Edge as well.

 

Conclusion

Business case will define the requirement of real time. Build your near real time capabilities according to your future proof architecture – adding real time capabilities later might come almost impossible. 

If business case is not clear enough what should I do ? Maybe a cup of tea, relax and read blog post from Johannes The gap between design thinking and business impact

You might be interested our kickstarts Accelerate cloud data transformation ​and Industrial equipment data at scale

Let’s stay fresh-minded !

 

A complete list of new features introduced at the Tableau Conference 2021

The Tableau Conference 2021 is over and yet again it was a lot of fun with all the not-so-serious music performances, great informative sessions, excellent Iron Viz competition, and of course demonstrations of many new features coming in the future releases. In general my first thoughts about the new capabilities revealed in TC21 are very positive. Obviously some of the details are still a bit blurry but the overall topics seem to be in a good balance: There are very interesting improvements coming for visual analytics, data management and content consumption in different channels, but in my opinion the most interesting area was augmented analytics and capabilities for citizen data scientists.

It’s been 2 years since Salesforce announced the acquisition of Tableau. After acquisitions and mergers it’s always interesting to see how it affects the product roadmap and development. Now I really feel the pace for Tableau is getting faster and also the scope is getting more extensive. Tableau is not only fine tuning the current offering, but creating a more comprehensive analytics platform with autoML, easier collaboration & embedding, and action triggers that extend beyond the Tableau.

Note: All the pictures are created using screenshots from the TC21 Devs on Stage and TC21 Opening Keynote sessions. You can watch the sessions at any time on Tableau site.

The Basics – Workbook Authoring

Let’s dive into workbook authoring first. It is still the core of Tableau and I’m very pleased to see there is still room for improvement. For the workbook authoring the biggest announcement was the visualization extensions. This means you can more easily develop and use new custom visualization types (for example sunburst and flower). The feature makes it possible to adjust visualization details with mark designer and to share these custom visualizations with others. Another very nice feature was dynamic dashboard layouts, you can use parameters and field values to dynamically toggle the visibility of dashboard components (visualizations and containers). This gives so much more power to flexibly show and hide visualizations on the dashboard.

There is also a redesigned UI to view underlying data with options to select the desired columns, reorder columns and sort data, export data etc. For map analysis the possibility to use data from multiple data sources in spatial layers is a very nice feature. Using workbook optimizer you can view tips to improve performance when publishing the workbook. In general it also seems the full web authoring for both data source and visualization authoring isn’t very far away anymore.

  • Visualization Extensions (2022 H2): Custom mark types, mark designer to fine tune the visualization details, share custom viz types.
  • Dynamic Dashboard Layouts (2022 H1): Use parameters & field values to show/hide layout containers and visualizations.
  • Multi Data Source Spatial Layers (2021.4): Use data from different data sources in different layers of a single map visualization.
  • Redesigned View Data (2022 H1): View/hide columns, reorder columns, sort data, etc.
  • Workbook Optimizer (2021.4): Suggest performance improvements when publishing a workbook.
Visualization Extensions. Create more complex visualizations (like sunburst) with ease.

Augmented Analytics & Citizen Data Science

This topic has been in the Gartner’s hype cycle for some time. In Tableau we have already seen the first capabilities related to augmented analytics and autoML, but this area is really getting a lot more power in the future. Data change radar will automatically detect new outliers or anomalies in the data, and alert and visualize those to the user. Then users can apply the explain data feature to automatically get insights and explanations about the data, what has happened and why. Explain the viz feature will not explain only one data point but the whole visualization or dashboard and show descriptive information about the data. All this happens automatically behind the scenes and it can really speed up the analysis to get these insights out-of-the-box. There were also a bunch of smaller improvements in the Ask Data feature for example to adjust the behavior and to embed the ask data functionality.

One of the biggest new upcoming features was the possibility to create and deploy predictive models within Tableau with Tableau Model Builder. This means citizen data scientists can create autoML type of predictive models and deploy those inside Tableau to get new insights about the data.  The user interface for this seemed to be a lot like Tableau Prep. Another very interesting feature was Scenario Planning, which is currently under development in Tableau Labs. This feature gives the possibility to view how changes in certain variables would affect defined target variables and compare different scenarios to each other. Another use case for scenarios would be finding different ways to achieve a certain target. For me the scenario planning seemed to be a bit disconnected from the core capabilities of Tableau, but it is under development and for sure there could be some very nice use cases for this type of functionality.

  • Data Change Radar (2022 H1): Alert and show details about meaningful data changes, detect new outliers or anomalies, alert and explain these.
  • Explain the Viz (2022 H2): Show outliers and anomalies in the data, explain changes, explain mark etc.
  • Multiple Smaller Improvements in Ask Data (2022 H1): Contact Lens author, Personal pinning, Phrase builder, Lens lineage in Catalog, Embed Ask Data.
  • Tableau Model Builder: Use autoML to build and deploy predictive models within Tableau.
  • Scenario Planning: View how changes in certain variables affect target variables and how certain targets could be achieved.
Explain Data side pane with data changes and explain change drill down path.

Collaborate, embed and act

The Tableau Slack integration is getting better and more versatile. With the 2021.4 version you can use Tableau search, Explain Data and Ask Data features directly in Slack. As it was said in the event: “it’s like having data as your Slack member“. In the future also Tableau Prep notifications can be viewed via Slack. It was also suggested that later on similar integration will be possible for example with MS Teams.

There were many new capabilities related to embedding contents to external services. With Connected Apps feature admins can define trusted applications (secure handshake) to make embedding more easy. Tableau Broadcast can be used in Tableau Online to share content via external public facing sites for everyone (for unauthenticated users). There was also a mention about 3rd party identity and access provider support which was not very precise but in my opinion it suggests the possibility to more easily leverage identities and access management from outside Tableau. Embeddable web authoring makes it possible to create and edit contents directly within the service where contents are embedded using the web edit, so no need to use Tableau Desktop.

One big announcement was the Tableau Actions. Tableau dashboards already have great actions to create interactions between the user and the data, but this is something more. With Tableau Actions you can trigger actions outside Tableau directly from a dashboard. You could for example trigger Salesforce Flow tasks by clicking a button in the dashboard. And in the future also other workflow engines will be supported. This will provide much more powerful interactivity options for the user.

  • Tableau search, Explain Data and Ask Data in Slack (2021.4)
  • Tableau Prep notifications in Slack (2022 H1)
  • Connected Apps (2021.4): More easily embed to external apps, create secure handshake between Tableau and other apps.
  • Tableau Broadcast (2022 H2): Share contest via external public facing site to give access to unauthenticated users, only Tableau Online.
  • 3rd party Identity & Access Providers: Better capabilities to manage users externally outside Tableau.
  • Embeddable Web Authoring: No need for desktop when creating & editing embedded contents, full embedded visual analytics.
  • Embeddable Ask Data 
  • Tableau Actions: Trigger actions outside Tableau, for example Salesforce Flow actions, later on support for other workflow engines.
Creating new Tableau Action to trigger Salesforce Flow to escalate case.

Data management & data preparation

Virtual Connections have already been introduced earlier and those seem to be very powerful functionality to centrally manage data connections and create centralized row level security rules. These functionalities and possible new future features build around them can really boost end-to-end self-service analytics in the future. The only downside is that this is part of the data management add-on. Data Catalog Integration will bring the possibility to sync metadata from external data catalog services, like Collibra and Alation.

Related to the data preparation there will be new Tableau Prep Extensions so you can get more power to the prep workflows as a custom step. These new steps can be for example sentiment analysis, geocoding, feature engineering etc. Other new functionality in Tableau Prep is the possibility to use parameters in the Prep workflows. It was also said that in the future you can use Tableau Public to publish and share Tableau Prep flows. This might mean there is also a Public version coming for Tableau Prep. It wasn’t mentioned in the event, but it would be great.

  • Virtual Connections (2021.4): Centrally managed and reusable access points to source data with single point to define security policy and data standards.
  • Centralized row level security (2021.4): Centralized RLS and data management for virtual connections.
  • Data Catalog Integration: Sync external metadata to Tableau (from Collibra, Alation, & Informatica).
  • Tableau Prep Extensions: Leverage and build extension for Tableau Prep (sentiment analysis, OCR, geocoding, feature engineering etc.).
  • Parameters in Tableau Prep (2021.4): Leverage parameters in Tableau Prep workflows.
Content of a virtual connection and related security policies.

Server Management

Even though SaaS options like Tableau Online are getting more popular all the time there was still a bunch of new Tableau Server specific features. New improved resource monitoring capabilities as well as time stamped log file zip generation were mentioned. Backgrounder resource limits can limit the amount of resources consumed by backgrounder processes and auto-scaling for backgrounders for containerized deployments can help the environment to adjust for different workloads during different times of the day.

  • Resource Monitoring Improvements (2022 H1): Show view load requests, establish new baseline etc.
  • Time Stamped log Zips (2021.4)
  • Backgrounder resource limits (2022 H1): Set limits for backgrounder resource consumption.
  • Auto-scaling for backgrounder (2022 H1): Set backgrounder auto-scaling for container deployments.

Tableau Ecosystem & Tableau Public

Tableau is building Tableau Public to better serve the data family in different ways. There is already a possibility to create visualizations in Tableau Public using the web edit. There is also redesigned search and better general user interface to structure and view contents as channels. Tableau Public will also have Slack integration and more data connectors for example to Dropbox and OneDrive. As already mentioned, Tableau Prep flows can be published to Tableau Public in the future and that might also mean a release of Tableau Prep Public, who knows.

In the keynote there was also mention that Tableau exchange would contain all the different kinds of extensions, connectors, datasets and accelerators in the future. The other contents are already there but the datasets will be a very interesting addition. This would mean companies could publish, use and possibly sell and buy analysis ready data contents. The accelerators are dashboard starters for certain use cases or source data.

  • Tableau Public Slack Integration (2022 H1)
  • More connectors to Tableau Public (2022 H1): Box, Dropbox, OneDrive.
  • Publish Prep flows to Tableau Public: Will there be a Public version for Tableau Prep?
  • Tableau Public custom Channels (2022 H1):  Custom channels around certain topics.
  • Tableau exchange: Search and leverage shared extensions, connectors, datasets and accelerators.
  • Accelerators: Dashboard starters for certain use cases and source data (e.g. call center analysis, Marketo data, Salesforce data etc.).

Want to read or hear more?

If you are looking for more info about Tableau read our blog post: Tableau – a pioneer of modern self-service business intelligence.

More info about the upcoming features on the Tableau coming soon page.

You can also read about our visual analytics services and contact to hear more or to see a comprehensive end-to-end Tableau Demo.

Thanks for reading!

Tero Honko, Senior Data Consultant
tero.honko@solita.fi
Phone +358 40 5878359

Accelerate cloud data transformation

Cloud data transformation

Data silos and unpredicted costs preventing innovation

Cloud database race ?

One of the first cloud services was S3 launched in 2006.  AWS Hadoop based Amazon SimpleDB  was released in 2007 and after that there have been many nice cloud database products from multiple cloud hyperscalers. Database as a service (DBaaS) has been a prominent service when customers are looking for scaling, simplicity and taking advantage of the ecosystem. It has been estimated that the Cloud database and DBaaS market was estimated to be USD 12,540 Million by 2020, so no wonder there is a lot of activity. Looking from a customer point of view this is excellent news when the cloud database service race is on and new features are popping up and same time usage costs are getting lower. I can not remember the time when creating a global solution backed by a database would be so cost efficient as it is now.

 

Why should I move data assets to the Cloud ?

There are few obvious reasons like rapid setup, cost efficiency, scaling solutions and integration to other Cloud services. That will give nice security enforcement in many cases where old school username and password is not used like in some on premises systems still do.

 

“No need to maintain private data centers”, “No need to guess capacity”

 

Cloud computing instead typical on premises setup is distributed by nature, so computing and storage are separated. Data replication to other regions is supported out of the box in many solutions, so data can be stored as close as possible to end users for best in class user experience.

In the last few years even more database service can work seamlessly with on premises and cloud. Almost all data related cases have aspects of machine learning nowadays and Cloud empowers teams to enable machine learning in several different ways: in built into database services, purpose-built services or using native integrations. Just using the same development environment and using industry standard SQL you can do all ML phases easily. Database integrated AutoML aims to empower developers to create sophisticated ML models without having to deal with all the phases of ML – that is a great opportunity for any Citizen data scientist !

 

Purpose build databases to support diverse data models

Beauty of cloud comes rapidly with flexibility and pay as you go model with very real time cost monitoring. You can cherry pick the best purpose-built database (relational, key-value, document, in-memory, graph, time series, wide column, and ledger databases.) to suit your use case, data models and avoid building one big monolithic solution.

Snowflake is one of the few enterprise-ready cloud data warehouses that brings simplicity without sacrificing features and can be operated on any major cloud platform. Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale to any relational database in the cloud. Amazon Timestream is a nice option for serverless, super fast time series processing and near real time solutions. You might have a Hadoop system or running a non-scalable relational database on premises and think about how to get started on a journey for improved customer experience and digital services?

Success for your cloud data migration

We have worked with our customers to build a Data Migration strategy. That will help in understanding the migration options, create a plan and also validate future proof architecture.

Today we share with you here a few tips that might help you when planning data migrations.

  1. Employee experience – embrace your team, new possibilities and replace pure technical approach to include commitment from your team developers. Domain knowledge of data assets and applications is very important and building a trust to new solutions from day one.
  2. Challenge your partner of choice. There is more than lift and shift or creating all from scratch options. It might be that all data assets are not needed or useful anymore. Our team is working on a vertical slicing approach where the elephant is splitted to manageable pieces. Using state of the art accelerator solutions we can make an inventory using real life metrics. Let’s make sure that you can avoid the big bang and current systems can operate without impact even when building new systems.
  3. Bad design and technical debt of legacy systems. It’s very typical that old systems’ performance and design can be broken already.  That is something which is not visible to all stakeholders and when doing the first Cloud transformation all that will come visible will pop up. Prepare yourself for surprises – take that as an opportunity to build more robust architecture. Do not try to fix all problems at once !
  4. Automation to the bones. In order to be able to try and replay data make sure everything is fully automated including database, data loading and integrations. So, making a change is fun and not something to be careful of. It’s very hard to build DataOps to on premises systems because of the nature of operating models, contracts and hardware limitations. In Cloud those are not the blockers anymore.
  5. Define workloads and scope ( no low hanging fruits only) . Taking one database and moving that to the Cloud can not be used as any baseline when you have hundreds of databases. Metrics from the first one should not be used as a matrix multiplied by the amount of databases when thinking about the whole project scope. Take a variety of different workloads and solutions, some even the hard one to first sprint. It’s better to start immediately and not wait for any target systems because on Cloud that is totally redundant.
  6. Welcome Ops model improvement. On Cloud database metrics of performance (and any other kind) and audit trails are all visible so creating a more proactive and risk free ops model is at your fingertips. My advice is not to copy the existing Ops model with the current SLA as it is. High availability and recovery are different things – so do not mix those.
  7. Going for meta driven DW. In some cases choosing state of the art automated warehouse like Solita Agile Data Engine (ADE) will boost your business goals when you are ready to take a next step.

 

Let’s kick the Cloud Data transformation ongoing !

Take advantage of cloud when building digital services with less money and faster with our Accelerate cloud data transformation kickstart

You might be interested also Migrating to the cloud isn’t difficult, but how to do it right?

Productivity and industrial user experience

Digital employee is not software robot

 

The last post was about data contextualisation and today on this video blog post we talk about the Importance of User Experience in an Industrial Environment.

UX versus employee experience

User Experience (UX) design is the process design teams use to create products that provide meaningful and relevant experiences to users. 

Employee experience is a worker’s perceptions about his or her journey through all the touchpoints at a particular company, starting with job candidacy through to the exit from the company. 

Using modern, digital tools and platforms can support employee experience and create competitive advantage. Especially working on factory systems and remote locations it’s important to keep good productivity and one option is cloud based manufacturing.

Stay tuned for more and check our Connected Factory kickstart:

https://www.solita.fi/en/solita-connected/

AWS SageMaker Pipelines – Making MLOps easier for the Data Scientist

SageMaker Pipelines is a machine learning pipeline creation SDK designed to make deploying machine learning models to production fast and easy. I recently got to use the service in an edge ML project and here are my thoughts about its pros and cons. (For more about the said project refer to Solita data blog series about IIoT and connected factories https://data.solita.fi/factory-floor-and-edge-computing/)

Example pipeline

Why do we need MLOps?

First, there were statistics then came the emperor’s new clothes – machine learning, a rebranding of old methods accompanied with new ones emerged. Fast forward to today and we’re all the time talking about this thing called “AI”, the hype is real, it’s palpable because of products like Siri and Amazon Alexa.

But from a Data Scientist point of view, what does it take to develop such a model? Or even a simpler model, say a binary classifier? The amount of work is quite large, and this is only the tip of the iceberg. How much more work is needed to put that model into the continuous development and delivery cycle?

For a Data Scientist, it can be hard to visualize what kind of systems you need to automate everything your model needs to perform its task. Data ETL, feature engineering, model training, inference, hyperparameter optimization, performance monitoring etc. Sounds like a lot to automate?

(Hidden technical debt in machine learning https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf)

 

This is where MLOps comes to the picture, bridging DevOps CI/CD practices to the data science world and bringing in some new aspects as well. You can see more information about MLOps from previous Solita content such as https://www.solita.fi/en/events/webinar-what-is-mlops-and-how-to-benefit-from-it/ 

Building an MLOps infrastructure is one thing but learning to use it fluently is also a task of its own. For a Data Scientist at the beginning of his/her career, it could seem too much to learn how to use cloud infrastructure as well as learn how to develop Python code that is “production” ready. A Jupyter notebook outputting predictions to a CSV file simply isn’t enough at this stage of the machine learning revolution.

(The “first” standard on MLOps, Uber Michelangelo Platform https://eng.uber.com/michelangelo-machine-learning-platform/)

 

A Jupyter notebook outputting predictions to a CSV file simply isn’t enough at this stage of the machine learning revolution.

Usually, companies that have a long track record of Data Science projects have a few DevOps, Data Engineer/Machine Learning Engineer roles working closely with their Data Scientists teams to distribute the different tasks of production machine learning deployment. Maybe they even have built the tooling and the infrastructure needed to deploy models into production more easily. But there are still quite a few Data Science teams and data-driven companies figuring out how to do this MLOps thing.

Why should you try SageMaker Pipelines?

AWS is the biggest cloud provider ATM so it has all the tooling imaginable that you’d need to build a system like this. They are also heavily invested in Data Science with their SageMaker product and new features are popping up constantly. The problem so far has been that there are perhaps too many different ways of building a system like this.

AWS tries to tackle some of the problems with the technical debt involving production machine learning with their SageMaker Pipelines product. I’ve recently been involved in project building and deploying an MLOps pipeline for edge devices using SageMaker Pipelines and I’ll try to provide some insight on why it is good and what is lacking compared to a completely custom-built MLOps pipeline.

The SageMaker Pipelines approach is an ambitious one. What if, Data Scientists, instead of having to learn to use this complex cloud infrastructure, you could deploy to production just by learning how to use a single Python SDK (https://github.com/aws/sagemaker-python-sdk)? You don’t even need the AWS cloud to get started, it also runs locally (to a point).

SageMaker Pipelines aims at making MLOps easy for Data Scientists. You can define your whole MLOps pipeline in f.ex. A Jupyter Notebook and automate the whole process. There are a lot of prebuilt containers for data engineering, model training and model monitoring that have been custom-built for AWS. If these are not enough you can use your containers enabling you to do anything that is not supported out of the box. There are also a couple of very niche features like out-of-network training where your model will be trained in an instance that has no access to the internet mitigating the risk of somebody from the outside trying to influence your model training with f.ex. Altered training data.

You can version your models via the model registry. If you have multiple different use cases for the same model architectures with differences being in the datasets used for training it’s easy to select the suitable version from SageMaker UI or the python SDK and refactor the pipeline to suit your needs.  With this approach, the aim is that each MLOps pipeline has a lot of components that are reusable in the next project. This enables faster development cycles and the time to production is reduced. 

SageMaker Pipelines logs every step of the workflow from training instance sizes to model hyperparameters automatically. You can seamlessly deploy your model to the SageMaker Endpoint (a separate service) and after deployment, you can also automatically monitor your model for concept drifts in the data or f.ex. latencies in your API. You can even deploy multiple versions of your models and do A/B testing to select which one is proving to be the best.

And if you want to deploy your model to the edge, be it a fleet of RaspberryPi4s or something else, SageMaker provides tooling for that also and it seamlessly integrates with Pipelines.

You can recompile your models for a specific device type using SageMaker Neo Compilation jobs (basically if you’re deploying to an ARM etc. device you need to do certain conversions for everything to work as it should) and deploy to your fleet using SageMaker fleet management.

Considerations before choosing SageMaker Pipelines

By combining all of these features to a single service usable through SDK and UI, Amazon has managed to automate a lot of the CI/CD work needed for deploying machine learning models into production at scale with agile project development methodologies. You can also leverage all of the other SageMaker products f.ex. Feature Store or Forekaster if you happen to need them. If you’re already invested in using AWS you should give this a try.

Be it a great product to get started with machine learning pipelines it isn’t without its flaws. It is quite capable for batch learning settings but there is no support as of yet for streaming/online learning tasks. 

And for the so-called Citizen Data Scientist, this is not the right product since you need to be somewhat fluent in Python. Citizen Data Scientists are better off with BI products like Tableau or Qlik (which use SageMaker Autopilot as their backend for ML) or perhaps with products like DataRobot. 

And in a time where software products are high availability and high usage the SageMaker EndPoints model API deployment scenario where you have to pre-decide the number of machines serving your model isn’t quite enough.

 In e-commerce applications, you could run into situations where your API is receiving so much traffic that it can’t handle all the requests because you didn’t select a big enough cluster to serve the model with. The only way to increase the cluster size in SageMaker Pipelines is to redeploy a new revision within a bigger cluster. It is pretty much a no brainer to use a Kubernetes cluster with horizontal scaling if you want to be able to serve your model as the traffic to the API keeps increasing.

Overall it is a very nicely packaged product with a lot of good features. The problem with MLOps in AWS has been that there are too many ways of doing the same thing and SageMaker Pipelines is an effort for trying to streamline and package all those different methodologies together for machine learning pipeline creation.

It’s a great fit if you work with batch learning models and want to create machine learning pipelines really fast. If you’re working with online learning or reinforcement models you’ll need a custom solution. And if you are adamant that you need autoscaling then you need to do the API deployments yourself, SageMaker endpoints aren’t quite there yet. For references to a “complete” architecture refer to the AWS blog https://aws.amazon.com/blogs/machine-learning/automate-model-retraining-with-amazon-sagemaker-pipelines-when-drift-is-detected/

 

super

Industrial data contextualization at scale

Shaping the future of your data culture with contextualization

 

My colleague and good friend Marko had interesting thought on Smart and Connected factories  and how to get data out of the complex factory floor systems and enable machine learning capabilities on Edge and Cloud . In this blog post I will try to open a bit more on data modeling and how to overcome a few typical pitfalls – that are not always only data related.

Creating super powers

Research and development (R&D) include activities that companies undertake to innovate and introduce new products and services. In many cases if company is big enough R&D is separate from other units and in some cases R is separated from D as well. We could call this as separation of concerns –  so every unit can 100% focus on their goals.

What separates R&D and Business unit ? Let’s first pause and think about what business is doing. A business unit is an organizational structure such as a department or team that produces revenues and is responsible for costs. Perfect so now we have company wide functions (R&D, business) to support being innovative and produce revenue.

Hmmm, something is still missing – how to scale digital solutions in a cost efficient way so we can have profit (row80) in good shape ? Way back in 1978 information technology (IT) was used first time. The Merriam-Webster Dictionary defines information technology as “the technology involving the development, maintenance, and use of computer systems, software, and networks for the processing and distribution of data.” One the IT functions is to provide services with cost efficiency on global scale.

Combine these super powers: business, R&D and IT we should produce revenue, be innovative and have the latest IT systems up and running to support company goals – in real life this is much more complex, welcome to the era of data driven product and services.

 

Understanding your organization structure 

To be data driven, the first thing is to actually look around in which maturity level my team and company is. There are so many nice models to choose from: functional, divisional, matrix, team, and networking.  Organizational structure can easily become a blocker in how to get new ideas to market quickly enough. Quite many times Conway’s law kicks in and software or automated systems end up “shaped like” the organizational structure they are designed in or designed for.

One example of Conway’s law in action, identified back in 1999 by UX expert Nigel Bevan, is corporate website design: Companies tend to create websites with structure and content that mirror the company’s internal concerns

When you look at your car dashboard, company web sites or circuit board of embedded systems, quite many times you can see Conway’s law in action. Feature teams, tribes, platform teams, enabler team or a component team – I am sure you have at least one of these to somehow try to tackle the problem of how an organization should be able to produce good enough products and services to market on time. Calling same thing with Squad(s) will not solve the core issue. Neither to copy one top-down driven model from Netflix to your industrial landscape.

 

Why does data contextualization matter?

Based on facts mentioned above, creating industrial data driven services is not easy. Imagine you push a product out to the market that is not able to gather data from usage. Other team is building a subscription based service for the same customers. Maybe someone already started to sell that to customers. This solution will not work because now we have a product out and not able to invoice customers from usage. Refactoring of organizations, code and platforms is needed to accomplish common goals together. A new Data Platform as such is not improving the speed of development automatically or making customers more engaged.

Contextualization means adding related information to any data in order to make it more useful. That does not mean data lake, our new CRM or MES. Industrial data is not just another data source on slides, creating contextual data enables to have the same language between different parties such as business and IT. 

A great solution will help you understand better what we have or how things work, it’s like a car you have never driven and still you feel that this is exactly how it should be even if it’s not close to your old vehicle at all. Industrial data assets are modeled in a certain way and that will enable common data models from floor to cloud, enabling scalable machine learning without varying data schema changes.

Our industrial AWS SiteWise data models for example are 100% compatible with modern data warehousing platforms like Solita Agile Data Engine out of the box. General blueprints of data models have failed in this industry many times, so please always look at your use case also from bottom up and not only the other way round.

Curiosity and open minded

I have been working on data for the last 20 years and on the industrial landscape half of that time. Now it’s great  to see how Nordics companies are embracing company culture change, talking about competence based organization, asking from consultants more than just a pair of hands and creating teams of superpowers.

How to get started on data contextualization ?

  1. Gather your team and check how much of time it will take to have one idea to customer (production) – is our current organization model supporting it ?
  2. Look models and approach that you might find useful like intro for data mesh or a  deep dive – the new paradigm you might want to mess with (and remember that what works for someone else might not be perfect to you)
  3. We can help with with AWS SiteWise for data contextualization. That specific service is used to create virtual representations of your industrial operation with AWS IoT SiteWise assets.

I have been working on all major cloud platforms and focusing on AWS.  Stay tuned for the next Blog post explaining how SiteWise is used for data contextualization. Let’s keep in touch and stay fresh minded.

Our Industrial data contextualization at scale Kickstart

 

Factory Floor and Edge computing

Happened last time

In the first part of this blog series I discussed the industry 4.0 phenomenon: Smart and Connected Factory, what benefits it brings, what is IT/OT convergence and gave a short intro about Solita’s Connected Factory Kickstart

This part is more focused on the data at Factory floor and how AWS services can help in ingesting the data from factory machinery.

Access the data and gain benefits from Edge computing

So what is the data at the Factory floor? It is generated by machinery systems using many sensors and actuators. See the following picture where on the left there is a traditional ISA-95 pyramid for factory data integrating each layer with the next. The right side represents new thinking where we can ingest data from each layer and take advantage of IT/OT convergence using AWS edge and cloud services.

PLC (Programmable Logic Controller) typically has dedicated modules for inputs and dedicated modules for outputs. An input module can detect the status of input signals like switches and an output module controls devices such as relays and motors.

Sensors are typically connected to PLC’s. To access the data and use it in other systems, PLC’s can be connected to an OPC-UA server. The server can provide access to the data. One traditional use case is to connect PLC to factory SCADA systems for high level supervision of machines and processes. OPC-UA defines a generic object model and each object can be associated with data type, timestamp, data quality and current value and they can have a hierarchy. Every kind of device, function, and system information can be described using this meta model.

 

AWS services that ease data access at the factory

AWS Greengrass is an open source edge software which integrates to AWS Cloud. It enables local processing, messaging, Machine Learning (ML) inference, device mesh and many pre-baked software components for speed up application development. 

AWS Sitewise is a cloud service for collecting and analyzing data from factory environments. It provides Greengrass compatible edge components for example for data collection from OPC-UA server and for streaming data to AWS Sitewise. Sitewise has a built in time-series database, data modeling capabilities, API layer and portal, which can be deployed and run at the edge as well (which is amazing!). 

The AWS Sitewise asset and data modeling is for making a virtual presentation of industrial equipment or process. Data model supports hierarchies, metrics and real time calculations, for example for calculating OEE (Overall Equipment Effectiveness). Each asset is enforced to use data mode that validates incoming data and schema.

Why industrial use cases with AWS?

I prefer more hands-on work than reading Gartner papers; anyhow AWS has been named as a Leader for the eleventh consecutive year and has secured the highest and furthest position on the ability to execute and completeness of vision axes in the 2021 Magic Quadrant for Cloud Infrastructure and Platform Services. It’s very nice to see how AWS is taking industrial solutions seriously and packaging those to a model that is easy to take in use for building digital services to the factory floor and cloud.  

 1. AWS Sitewise – The power of data model, ingest, analyze and visualize

Sitewise packages nice features which I feel are the greatest are the data and asset modeling, near real time metric calculations (even on edge), visualization and build in time series database. Sitewise is nicely supported by CloudFormation, so you can automate the deployment and even build data models according to your OPC-UA data model automatically (Meta driven Industry standard data model). The Fact that there are edge processing and monitoring capabilities with a portal available makes the Sitewise a really competitive package.

2. AWS Greengrass – Edge computing and secure cloud integration

Speeds up edge application development with public components, like the OPC-UA collector, StreamManager and Kinesis Firehose publisher. The latest Greengrass version 2.x has evolved and has lots of great features. You can provision and run a solution on real hardware or simulate on an EC2 instance or Docker, as you wish. One way to provision Greengrass devices to AWS cloud is to use IoT Fleet Provisioning, where certificates for the device are created on the first connection attempt to the cloud. Applications are easy to deploy from cloud IoT Core to edge Greengrass instances. You can also run serverless AWS Lambdas at the edge, which is really superb! All in all, the complete Greengrass 2.0 package will speed up development.

3. Cloud and Edge – Extra layer of Security

SItewise and Greengrass use AWS IoT Core security features, like certificate based authentication, IoT policies, TLS 1.2 on transport and device defender, which brings the security to a new level. It’s also possible to use custom Certificate Authorities (CA) to issue edge device certificates. Custom CA’s can be stored in AWS CloudHSM and AWS Certificate Manager. Now I can really say that security is our best friend.

4. Agile integration to other solutions

Easy way for integrating data to other solutions is to use Sitewise Edge and Cloud APIs. If you deploy Sitewise to the edge the API is usable there as well, and you can use the data for other factory systems, like MES (Manufacturing Execution System). At least I think this will combine the IT and OT worlds like never before.

5. AutoML for Edge computing

AutoML is for people like me and citizen data scientists, something that will speed up business insights when creating a lot of notebooks or python code is not needed anymore.

These AutoML services are used to organize, track and compare Machine Learning training. When auto deploy is turned on the best model from the experiment is deployed to the endpoint and the best model is automatically selected using the Bandit algorithm. Besides these Amazon SageMaker model monitor will continuously monitor the quality of your machine learning models in real time and I can focus on talking with people and not only machines. 

 

Stay tuned for more

I think that AWS is making it easier to combine cloud workloads with edge computing. Stay tuned for the next blog post where we dig more into the cloud side of this, including Sitewise, Asset and data model, visualization and alarms. And please take a look at the “Predictive maintenance data kickstart” if you haven’t yet:

https://www.solita.fi/en/solita-connected/

 

bb

Building the Builders

Strive for excellence not perfection

Sounds like a typical day?

It’s another morning. Senior Developer Taavi wakes up and tears the sleep out of his eyes. Long-lasting own project was finished by late night. “What a python god I am”, he thinks. Taavi makes coffee, puts his old seedy college pants on and starts a new work day. Half-eaten frozen pizza lies on the desk. Smell of empty energy drink cans wafts on his nose. Fatigue hits, but no can do. Same old tool set is opened: DevOps, Git, VSCode, terminal, CI/CD. First meeting begins. Proudly he showcases his project to colleagues and all current topics are discussed. A new junior software developer Ville has just started in the project, and Taavi is forced to be his mentor. He doesn’t really bother helping Ville. “I’d do that in a couple hours, as for Ville it takes a week”, Taavi angers and opens up a new can of energy drink. There seems to be no end to the day as Ville’s is making new problems all the time and those need to be solved. Deployment to production is in a week. “Why we gotta have those juniors, it just causes costs, deadlines aren’t met, customer satisfaction decreases and quality sinks”, Taavi thinks. 

 

Young and fresh minds to boost culture

Sounds familiar? Well, luckily that is not the case in Solita. We want to break those prejudices related to people working in the IT industry. At Solita, we think that the young and fresh minds have the ability to become industry-leading gurus and develop new work methods while also taking care of others in its various aspects: mentally, technically, physically. No one is ready in their early years, we admit that, nor will they ever be. Life is continuous learning and developing. We fail, we learn, we iterate. Until certain goals are met. 

 

Well-being and motivation

Each Solitan also commits to our core values, and even within project team members we develop our co-operation, interaction and soft skills. One team can quarterly organize team days where focus is on mental development and well-being in and outside work life. Ways to handle tough situations and recover. Another team can have a different model so we are very flexible to try out and find what works best for you, for the team and for the company. 

 

Now we would like to invite YOU to join us in our team day! Yes, you read correctly. Forget the letters of application, CVs, transcripts of records and come along with us to have discussions about motivation, development and self-leading what, and find out what kind of a company Solita is and how we work. If you are interested in joining our growing team, join us in an event organized by Solita in Toijala on Friday 19.11.2021 15:00-17:00 . Maybe you are passionate about data, or you have already been the employee of the month, and willing to face new meaningful challenges? Or you are a former Taavi, and you have a fresh attitude towards new success?

Just fill up your contact information and we will be in touch! Only 5 people will get a place at the event where you meet a few Solita seniors and youngsters as well. Location and agenda will be informed to participants by email.

Rough event agenda

  • Introduction and welcoming
  • Training and sweating together to get our brains working (60 min)
  • Small break and snacks
  • Individuals in the work community – Motivation – Growth attitude – Self-management – Focus (60 min)
  • Open job positions: https://www.solita.fi/avoimet-tyopaikat/  

 

We offer you a different approach to the IT industry and are keen to hear what would make your time in Solita the best time of your life! Check out some stuff about what our team is doing currently with connected factories. Making this story no senior or junior developers were harmed.

Smart and Connected Factories

Smart connected factories are a phenomenon of the fourth industrial revolution, Industry 4.0.

What is a connected factory?

Connected Factories utilizes machinery automation systems and additional sensors to collect data from manufacturing devices and processes. The data can be analyzed and processed on site at the factory, before being sent to the cloud platforms for historical and real time data analysis. Connected factory enables a holistic view for data over all customer factories. Connectivity is a key enabler of  IT/OT convergence.

Operational Technology (OT) consists of software and hardware systems that control and execute processes at the factory floor. Typically these are MES (Manufacturing Execution System), SCADA (Supervisory Control And Data Acquisition) and PLCs (Programmable Logic Controllers) at manufacturing factories. 

Whereas Information Technology (IT) refers to the information infrastructure covering network, software and hardware components for storing, processing, securing and exchanging data. IT consists of laptops and servers, software, enterprise systems software like ERPs, CRMs, inventory management programs, and other business related tools.

Historically OT is separated from IT. In recent years industrial digitalization, connectivity and cloud computing have made it possible for OT and IT systems to join and share data with each other. On IT/OT data convergence factory floor OT data is combined with IT data:

IT/OT Convergence
IT/OT Convergence

 

When IT and OT collide we need to align things like “How to handle different networks and control the boundaries between them”. IT and OT networks are totally for different purposes and they have different security, availability and maintainability principles. IT/OT convergence can definitely be beneficial for the company but at the same time it might pop up new challenges for the traditional OT world, like “How often and what kind of data should we upload to the cloud?” and “What are the key attributes to combine different data assets?”. Here are few examples where IT/OT is  converged:

  • Welding station monitoring with laboratory data. Combining with IT data we can improve customer specific welding quality. 
  • Getting OT data from equipment and merging that with customer contract data we can start upselling predictive maintenance solutions.
  • Getting real time metrics it is also possible to create subscription based billing. For this we need asset basic information and CRM customer contract information.
  • Creating a Digital service book is easy when you have full traceability based on OT data joint to IT product lifecycle data.

I think that in order to combine IT and OT together is nowadays much easier than just a few years ago thanks to hyper scalers like AWS and others. Now we can see in action how Cloud can enable smart manufacturing using purpose built components like AWS Greengrass and SiteWise. Stay tuned for next blog posts where I will explain basics on Edge computing in a harsh factory environment.

 

Kickstart towards smart and connected factory

Solita has made a kickstart for companies to start a risk free journey. We package pre-baked components for edge data ingestion, edge ML, AWS Sitewise data modelling, visualization, data integration API and MLOps in one deliverable using only 4 weeks time.

Check it out from https://www.solita.fi/en/solita-connected/ and let’s connect!

 

Power BI Deep Dive

Power BI is the self-service business intelligence platform of Microsoft. Power BI Service came to life in 2015 with an ambitious vision: to bring analytics to the business, where the data is. Since then, Power BI has not stopped bringing new reporting capabilities to both users and developers. Today there are plenty of new visuals, connections, AI features, licensing options or infrastructure solutions and indeed, one of the preferred platforms in the market.

This is the third post in our Solita’s blog series about self-service business intelligence (BI). Our first post, “Business Intelligence in the 21st century”, describes the evolution of BI for the last 20 years. This first blog introduces us to the modern BI world. More than ever, business talks about data. And although the discussions are generally dominated by big data, AI and machine learning, modern BI still has a lot to say. Thus, we aim to do a deep dive into all main BI solutions in the market. You can already find our blogpost about Tableau. Tableau is one of the leader platforms and can be considered the pioneer of the modern self-service BI. 

This blogpost will focus on Power BI. We will deep into its history, functionalities, components, licensing, and more. We don’t aim to rewrite Microsoft’s own documentation. Most probably we are missing to mention specific Power BI components, features and other facts. But we aim to awaken your interest in learning about this passionate area of self-service reporting and Power BI. If this is the case, please contact us for more detailed evaluation or a demo.

From SSRS to self-service BI

Pointing out an exact date for the launch of Power BI might be rather difficult and somewhat daring. Power BI is not a single BI tool but the combination of multiple reporting and data warehousing solutions. Most probably Power BI developers can notice the legacy from 15 years of continuous development. Thus, Power BI was born with each of those independent solutions.

Some of these components are from 2004. In this year, Microsoft launched Reporting Services as an add-on of SQL Server 2000. This developed further into SQL Server Reporting Services (SSRS), a server-based reporting solution today part of the suite of Microsoft SQL Server Services. Within this decade, development projects Gemini and Crescent would lead to Power Pivot and Power View. Power Pivot was available as an Excel add-in in 2009. Power View was released in 2012 as part of SharePoint. And Data Explorer, which was launched in 2013, set the start of Power Query.  This same year, all these components and Power Map, a 3D data visualization tool, were combined under the umbrella name of Power BI. Power BI became part of the Office 365 package.

Each component was performing very different tasks within the BI domain. But all of them had in common one to fulfil a big business need: ”Data is where the business lives so data definitely has a story to tell about it”. These tools were born with this idea in mind, at times when Tableau was the novelty among the business users of the 2010s. In 2015 Power BI Service was finally launched. This enabled Power BI users to share their reports and to add the first steps towards a complete self-service analytics solution.

What does Power BI mean?

Power BI was born with the goal of eliminating obstacles for business users to do data analysis and visualization. It is clearly targeted to the business world, which is becoming more data driven. For those non-technical fellows manipulating data might be rather intimidating. Power BI makes easy connecting to data sources and is a playground for business to give shape and meaning to data.

Power BI can be defined as a collection of tools that connects unrelated sources of data and brings insights through dynamic and interactive visualizations. For several reasons, Power BI is one of the leader self-service reporting products.

Ready available connections: Power BI supports data connections of all kinds, Whether data is On-Premise or Cloud, structured or unstructured datasets, within a Microsoft data warehouse or any other from top industry leaders, IoT and real time data streams, your favourite services…

Beautiful visualizations: Since visualization is the core of Power BI, users can find multiple plug & play types of visuals such as Line chart, Bar chart, Scatter chart, Pie chart, Matrix table, and so on. For the most exigent users, Microsoft platform provides third parties visualizations. And for the brave ones, Power BI provides the options to build your own visuals with Python or R.

Storytelling: Developers can build their own stories. Power BI brings flexibility with dashboards that combine tiles and reports, built on same or different datasets. The canvas and pages support pixel-based designs. All are integrated to deliver wonderful stories with buttons, tooltips and drill-through features.

Share it: Share reports and dashboards with people from inside and outside the organization. This is administered through a Power BI portal and Azure Active Directory. The range of possibilities is very wide, from sharing within workspaces, to sharing through power BI apps or embedding reports in a company’s website.

DAX & M: Data Analysis Expressions (DAX) is a language developed by Microsoft for data processing not only in Power BI but also PowerPivot and SSAS tabular models. It supports more than 200 functions, many having similarities to the well known Excel formulas. M is the language used in Power Query. This functional language is very powerful when transforming and loading the data so that it is ready for business analysts.

Backed by Azure: The BI platform is built on top of Azure. Thus, all security and performance concerns rely on azure capabilities. This is no small feat, considering that Azure is one of the most reliable and extended cloud computing solutions in the world. But Power BI benefits from Azure don’t end here. Power BI developers can enjoy a broad range of functionalities such as Azure Machine Learning and Cognitive service.

Be ready for some challenges

Power BI is continuously evolving. Its users are probably already familiar with its strict monthly releases. Actually, users can vote for improvements to be included in future releases. Despite being a market leader, the users have observed areas where Microsoft could put some development efforts.

One commonly criticized aspect is that product functionality depends on many factors. For instance, the Power BI SaaS options include functions not available in On-premises solutions, and vice versa. Developers might encounter that reporting is limited to some functionalities depending on the connection mode, or the data source. Or even different scripting languages (M and DAX) might be used for different purposes. Thus the starting point might result in being slightly overwhelming for new developers. Additionally, these wide variability of options might add complexity for developers to decide about how to build their very specific use cases.

Another common discussion is the strong dependency on Azure. There are specific tools functionality such as user admin, building data flows or security that are integrated partially to Azure. This can cause some problems to companies not using Azure as their cloud platform. To fully deploy a new Power BI platform would force them to add Azure competencies to their teams.

When talking about Power BI challenges, it is impossible to avoid talking about DAX. Although it clearly is a very powerful analytical language, it is also hard to learn. New developers usually avoid getting fluent on it because it is still possible to build nice reports using Power BI implicit measures (automatic calculations). However, sooner or later, developers will need to master DAX to deploy more complex requests from the consumers of the reports. 

In addition, challenges might be found in content governance. This is quite a challenge in self-service reporting platforms in general. It is common to find datasets growing out of control, poor utilization of licensing and capacity, or the lack of strategy for designing workspaces, apps and templates. Managing this platform requires data expertise. This complexity is sometimes underestimated by adopters since Power BI announces to be a self-service reporting platform.

The Power BI family

The main components

Power BI mainly consists of 3 components: Power BI Desktop, Power BI Service and Power BI Mobile. A typical workflow would start with Power BI Desktop, which is a desktop application dedicated specifically to data modelling and report development. This is the main tool for Power BI developers, since it enables building queries with Power Query, modelling relationships between those queries and calculating measures for visuals.

Once the report is built, next step on the workflow is to publish it into Power BI Service, which is Microsoft online SaaS offering for Power BI. Power BI Service adds a collaboration layer where both report developers and consumers interact. Power BI Service is organized mainly in workspaces where both report developers and consumers share, test, develop further and consume reports, dashboards, and datasets.

The last of the components is Power BI Mobile. With the mobile app, consumers can be always connected to their favourite reports and dashboards.

Power BI main components. Source: Microsoft documentation

In addition to these core 3 components, Power BI features 2 other ones: Power BI Report Builder and Power BI Report Server. The first one is a desktop app to design and deploy paginated reports. These reports are different from the ones developers can build with Power BI Desktop. The main difference is that paginated reports are usually designed to be printed and formatted to fit on an A4 page. So for instance, all the rows in a table are fully displayed independently of its length. 

The second component, Power BI Report Server, is an on-premises report server with its own web portal. It offers reporting features similar to Power BI Services and server management similar to what users can achieve with SQL Server Reporting Services. This is what Microsoft has to offer to those who must keep their BI platform within their own infrastructure.

Building blocks

The already mentioned Power BI components are built around 3 major blocks: datasets, reports, and dashboards. These blocks are all organized by workspaces, which at the same time are created on shared or dedicated capacities. Let’s talk about all of these important Power BI elements more in depth.

Building blocks in Power BI and common workflow

Capacities are the resources that host and deliver Power BI content. They can be either shared or dedicated. By default workspaces are created on shared capacity. This means that your Power BI content shares the capacity provided by Microsoft with other Power BI customers. On the other hand, a dedicated capacity is fully reserved to a specific customer. This will require special licensing.

Workspaces are collaboration spaces that contain, among others, dashboards, reports, and datasets. As a workspace admin, you can add new co-workers and set roles to define how they can interact with the workspace content. There is one requirement: all the members need at least a Power BI Pro license, or the workspace must be placed to a dedicated Premium capacity. 

Closely related to workspaces are apps. Apps are containerized within workspaces so that an app makes use of the workspace content. This is the most common and recommended way to share information at an enterprise level.  Its consumers can interact with its visuals but cannot edit the content. Apps are also the best medium to share dashboards and reports outside the limits of your organization.

When describing Power BI it is important to write about datasets. A dataset is a collection of data (from a single or multiple sources) associated with one workspace. The dataset not only includes the data but also the tables, relationships, measures and connections to the data source.

Connecting to data sources can happen on three different connectivity modes depending on the data source. The most common one is import mode. Importing data means to load a copy of data to Power BI. This mode allows users to utilize full functionality of Power BI and to achieve maximum calculation speed. However, loads are limited by hardware. Another connectivity mode is DirectQuery. In this mode data remains within the data source and Power BI only stores metadata. A third mode is available: Live Connection. This is a similar connection than DirectQuery with the advantage of using the engine of  SQL Server Analysis Services Tabular.

In recent years, Power BI has enabled connection to streaming datasets for real-time reporting. There are several options on how to connect to data streams but they all have their own limitations: some restricts the size of the query, others suffer from limited visual functionality. As a particularity, connecting to streaming dataset is only possible at a dashboard level, so developers need to use Power BI Service. 

Independently of the connection mode, the user needs to use source credentials to create the connection. If data is located on-premises or behind a firewall in general, Power BI Gateway can be used to create a connection between the data and Power BI Service without creating any inbound rules to the firewall. 

Nowadays these connection modes can be combined within the same dataset. These recent development have had a big impact on BI since companies can share standardized datasets between workspaces. Reports can connect to multiple type of source and to existing Power BI datasets.

A Power BI report is probably the most well known building block by both readers and editors. It consists of pages where data comes to live through all kinds of charts, maps and interactive buttons. All these visualizations are called visuals and their size and location can be defined at a pixel level. The reports can be created from scratch with Power BI Desktop. But also you can import them from shared reports or to bring them from other platforms such as Excel. Reports have two view modes: Reading and Editing view. You might have access to both modes of the reports, depending on what role has been assigned to you when sharing it. By default, reports always open in reading mode.

But reports are not the only way to communicate your insights. In Power BI we can do that also through dashboards. These are canvas in which to find tiles and widgets. Tiles are the main visuals. They can connect to real time stream dataset, visuals in a report, other dashboards or Q&A reports. Compared to reports, dashboards are commonly used to monitor, at glance, the most relevant KPIs for a business, and they can only be built directly in Power BI Service. By linking them to reports, the dashboard gives flexibility in storytelling of your data.

According to Gartner and Forrester 

Market and technology advisors such as Gartner and Forrester agree that Microsoft Power BI is a leader player among the BI platforms. In 2021 Gartner published “Critical Capabilities for analytics and BI” report and rated Power BI above average in 11 out of 12 critical BI capabilities. Gartner recognizes Power BI as a Magic Quadrant leader once more in 2021, repeating position for the last 14 consecutive years. The same result is obtained from the Forrester Wave: Augmented BI Platforms (Q3 2021)

Power BI 2021 position and path in the Gartner MQ for Analytics and BI.
Source: Tero Honko’s report in Tableau Public

Both organizations have clear what are the strengths of Power BI in the current market. Its leader position is the result of the large market reach of Microsoft and Power BI’s ambitious roadmap. Power BI inclusion in O365 E5 SKUs and integrations with Microsoft Teams enable Power BI access to tens of millions of users around the world. Thus it becomes a clear option for those companies that choose Azure as their preferred cloud platform.

Additionally, Gartner suggests that Power BI has impacted the price of its competitors, reducing the price of BI tools without limiting its own capabilities. Actually, as Gartner mentions, the Power BI new releases happen every month. Among the latest releases, both technical advisors appreciate Microsoft’s efforts and ambition towards increasing augmented BI capabilities with new AI services such as smart narratives and anomaly detection capabilities. Also Power BI is supporting developers with guided ML and new ML-driven automatic optimization to autotune query performance.  

However, Gartner’s and Forrester’s report make a call for actions around not as popular aspects of the solution. Both organizations find functional gaps in on-premises versions of Power BI. Some of the functionalities of Power BI Service such as streaming analytics and natural language Q&A (question and answer) are still not available for on-premises offerings. The lack of flexibility for customers to use a different IaaS than Azure is also spotted by both technology advisors despite Azure’s wide reach globally. Finally, Gartner highlights what many users have complained about: self-service reporting governance capabilities. Power BI’s investment has not yet brought the result of better management for Power BI environments. And the catalog capability is still behind the market offering. Forrester also gives voice to consumers who complain about the inconsistency of Q&A features.

An Infrastructure for Security

Security is at the forefront of data concerns. Microsoft has built solutions trying to cover the security needs of its customers. As we have mentioned, Power BI can be offered as SaaS with both shared and dedicated capacity, but also as an on-premises solution for companies to govern its own IaaS. 

Power BI Service is SaaS built on Azure. For security reasons, its architecture is divided into 2 clusters: the web front end (WFE) and the back-end. The WFE cluster manages the connections and authentication to Power BI Service. Authentication is managed by Azure Active Directory (AAD). And connection set with Azure Traffic Manager (ATM) and Azure Content Delivery Network (CDN). Once the client is authenticated and connected, the back-end cluster handles all user interactions. This cluster manages the data storage using Azure BLOB, and metadata using Azure SQL Database.

For those with higher security restrictions, Microsoft offers an on-premise BI platform alternative. Companies can build their BI capabilities on top of an on-premise report server branded as Power BI Report Server. The main developer tool is still Power BI Desktop. But the platform governance and report visualization resides in Power BI Report Server. Power BI Report Server is a web portal that recalls SSRS with additional functionalities for hosting .pbix files. The reports are published into folders and consumed through the web or across mobile devices.

In this case the company has total control over the IaaS. And consequently the security depends on the companies decisions. You will need to configure the web service, the database, the web portal, the connections…and manage security. Power BI Report Server supports this aspect enabling 3 different security layers. The first one is the portal itself, where you can define who has access to the web service. The next security layer you can configure consists of folders. And finally security can be managed at report level.

Licensing Options – A Hard Decision

Independently of the licensing options, Power BI Desktop is always free. You can connect to any data (when given right access), compute analysis, build your own datasets, use available visuals and format your reports for free. The limitations come in the next step, when sharing your reports with the rest of the world. You can always send the .pbix file by email, but you cannot use the Power BI Service to share it and build a company BI platform.

Once you have decided that Power BI is the right platform for your company, it is time to decide about how to roll it out for your users. Microsoft licensing is very flexible offering a large range of possibilities. But this sometimes makes the decision rather complicated. All licensing options can be bought through Microsoft 365 Admin Portal. The Power BI admin assigns them either to users or to capacities.

User-based Licensing

We can find three licensing options that are assigned directly to users. From the most standard option to the most comprehensive, we can find Power BI Free licenses, then Power BI Pro license and finally Power BI Premium Per User license. Every user within an organization can own a free license unless the organization disables this possibility. Free license gives you just access to Power BI Service but no sharing capabilities. However this becomes relevant when consuming reports running on Power BI Premium capacity.

The next step would be Power BI Pro license. This license is relevant for both developers and consumers. Developers can create workspaces in Power BI Service and to share their reports with small audiences or for other collaborative practices. At the same time, consumers need the license to read the reports either directly from the workspace or from a workspace app. Additionally, the pro license has multiple features such as Analysis with Excel, use of dataflows, 1GB dataset, 8 automatic refreshes per day, App sharing, and more. Power BI Pro is included with Microsoft 365 E5 enterprise license. For those with other Microsoft 365 plans, Power BI Pro licensing can be bought for 8,40 €. This license mode is crucial when deciding to build a self-service BI platform in your organization.

If you wish to increase the reporting capabilities with features such as paginated reports, AI, higher refresh rate and model size limit, application lifecycle management, and others, then you need Power BI Premium Per User. Same way than with Power BI Pro, both developers and content consumers need to have the same licensing options. And in contrast to Power BI Pro, the licensing is also assigned to a specific workspace. This is the lowest entry-point for Power BI Premium features.

Capacity-based licensing

Next step would require from you to buy capacity-based license options, so Power BI Premium or Power BI Embedded. With these licensing options developers, consumers and admins have access to the same features as Power BI Premium Per User and more. They benefit from dedicated capacity for a greater scale and more steady performance of the BI platform. And this option enables on-premises BI with the use of Power BI Report Server.

Power BI Premium includes features that your data engineers and data scientists will enjoy such as enhanced dataflows, broader range of storage solutions and AI cognitive services. Power BI Premium is available in two SKU (Stock-Keeping Unit) families: P SKUs and EM SKUs. The first one is for embedding and enterprise features, and requires monthly or yearly commitment. EM SKU is for organizational embedding, so to enable access to the through internal collaboration tools such as SharePoint or Teams. EM SKUs require yearly commitment. Pricing depends on the selected SKU and it starts at around 4.200 € per month (price by October 2021).

Description of P and EM SKUs. Source: Microsoft documentation

Power BI Embedded is a capacity-based licensing option too. This licensing option is designed for those developers who want to embed visuals into their applications. This is shipped with an A SKU, which doesn’t require any commitment and can be billed hourly. This introduces flexibility for scaling up or down as well as to pause or resume your solutions. Pricing depends on the selected SKU. You can find more details in the following table.

A SKUs prices by October 2021. Source: Microsoft website

Now that you know all the licensing possibilities, you might have clear what license to buy. Or most probably you just have more doubts. This is a quite criticized aspect on the adoption of Power BI, especially when deciding what premium capacity license to buy. Estimating what SKU is the most suitable for the solution you have in mind is very hard. There is no other way than testing. Thus, now that you have a basic idea of licensing, our recommendation is always the same: start with small PoCs and keep on upgrading until finding the right SKU for your report.

So, How do we start?

If you have already made the decision and Power BI is your BI companion, how do you start? Start testing! And Power BI makes it easy because Power BI Desktop is free. You just need to download the last version and install it on your machine. Build your first reports. There are plenty of things to learn at this stage. Go through Power BI basic documentations. Why not try some Power BI paths and modules from Microsoft Learn. And learn the power of DAX!

Next natural step would be to start setting up your own Power BI platform. At this stage you will probably need to buy your first Power BI Pro licenses, create workspaces and start sharing your reports. Solita can help you take these first steps. We can give you support with the roll out of your new platform, provide licensing consulting and training at different levels. Our specialist can help you design your first use cases and implement them. And for those first successes, we can offer maintenance and further support. In short, we are happy to be your companion on this trip towards building your own enterprise Power BI platform.

Some interesting links

Data mesh – the new paradigm you might want to mess with

Transforming to a successful data-driven enterprise remains as one of the key strategic goals for modern companies. Data mesh entered the data industry as a paradigm shift to manage analytical data more efficiently. In this blog post, we explore data mesh through a case study.

What is Data Mesh?

We dove into the rapidly developing world of data engineering and dug a little deeper into a trendy topic called data mesh. For the first time, data mesh was introduced in 2019 by Zhamak Dehghani on a highly appreciated blog site by Martin Fowler.

Data lakes and similar monolithic data architectures have been the most common place for organizations to store data. Over time, these have demonstrated their limitations when it comes to scalability and cost [1]. Data engineering usually comes few years after its bigger brother – software engineering. It may just be that the data industry follows the path to abandon monolithic architectures, thus creating new decentralized data architectures to scale with.

Figure 1: Data Mesh is about recognizing and defining data domains in an organization.

 

To break the ice, we will briefly go through the basic concept around data mesh. Data mesh builds its existence on four principles, and these principles aim to explain the core standpoint. These principles gather together domain-oriented data ownership and architecture, data as a product thinking, data as platform capabilities in a self-service solution, and federated computational governance model keeping decision making as domain specified as possible. These principles embrace the change towards distributed data architecture, where domain teams can scale while developing services and data products. It is important to understand that data mesh is not a technological transformation [1]. Data mesh challenges the organizational and cultural operating models.

Tackling the issues:

Data mesh consists in the implementation of an architecture where data is intentionally distributed among several mesh nodes, in such a way that there are no chaos or data silos to block rapid scaling. Figure 2 shows us the basic concept of a distributed architecture, similar to the popular microservices represented in the software world. However, this requires high data literacy, clear domain definitions, and a good understanding of data ownership to scale your data business even further. We took a closer look at our case organizations and their capabilities to apply the decentralized model.

Figure 2. Data Mesh as a Software Architecture. Adapted, Zhamak Dehghani, 2020.

 

Data mesh and its almost revolutionary suggestions towards data management and monolithic solutions challenge the traditional standpoints. Professionals who have worked within data management, analytics, and data systems may find these suggestions bold. Highlighting the distributed camp, hear me out first, and then decide which camp is for you, or maybe even something between them.

A social technical approach that interests me

Writing a thesis together with an organization is a massive opportunity.

I personally learned a lot and grew as a professional during our research journey. Thanks to my supervisors, this research was tackled with a courageous attitude and a passion for new things in the world of data.

The research itself:

Data mesh is a new thing, based on many battle-tested concepts from the development world, but it’s not studied in any formal context. We wanted to understand to what kind of organizations it is a good fit for and what could be blocking its adaptation. We did this by interviewing 7 different size companies from different domains and backgrounds.

A total of seven interviews were conducted for the implementation of the research. These case companies represent a wide scale of different industries that have a significant impact in Finland. Case organizations represent the following industries: Wood/Forest industry, Telecommunications, Oil refining & Renewable products, Energy generation, Waste recycling, and Construction industry. The variety of different industries and data management solutions formed a strong scheme to create insights on the suitability of the data mesh framework.

These major Finnish companies with a great view into their own industry and data-driven thinking were chosen for the theme interview process. They all are challenging their industry to develop further, as well as their own ways of working with data. All these companies have a specific way of demanding, producing, and consuming the available data. Case organizations were asked a set of questions including topics such as domain definition, ways of working, maturity level, data ownership, and data as a product.

Now we can dive a little deeper into the interviews. Two out of seven organizations already had decentralized data architecture and they described their journey with distribution in the following fashion:

“We are currently fully decentralized. A common so-called “data handbook” is required for multiple data teams across domains. Business areas have benefitted directly from having decentralized teams. There must be an opportunity to make creative solutions”. (Case Organization 3).

“Decentralization has brought data closer to business. As a result, responsibility is given to business data experts. Our operations are more streamlined, and you don’t have to ask every single thing from a centralized data unit.” (Case Organization 5).

These answers tie up some evidence that distributed data teams and architecture benefits organizations throughout their journey to more efficient data utilization. All seven organizations agreed that distributed architecture would increase their maturity level across the organization, and they would be able to create more straightforward processes for data product development. In most case organizations, data ownership seemed to be a common question mark and data mesh could be one solution for more explicit ownership across all the domains.

Distribution isn’t a new thing for enterprises to optimize their functions. Human resources and IT departments are classic instances of commonly distributed units. Data teams are overall very agile and adaptive towards new trends and features to advance ways of working.

Data mesh is not a new model; it is now rebranded” (Case Organization 3).

While reading articles, blogs, or whitepapers about data mesh and distributed architecture, you can always see someone saying that their organization has done things this way and adapted certain methods years ago. Organizations doing this previously might be true, but data mesh includes precisely designed patterns that must match.

According to this research, organizations struggle with different data management challenges. Bottlenecks are a typical way to describe a part of the process which slows down the production line. Different bottlenecks pointed out during our interviews were: Slow development, amount of data sources (or lack of them), data teams, the complexity of substance systems, huge amount of raw data, data quality, and workload (backlog challenges). These bottlenecks are a good example of organizations having a variety of different roadblocks. Data mesh aims to solve at least a few of them [2].

Graph 1 ties together key points that challenge the use of data mesh. Having a clear domain definition ended up being one of the most important aspects of this research. All of the case organizations recognized the importance of a clear domain definition. Ambiguous domain understanding can block the distribution that data mesh requires.

Graph 1: Factors blocking data mesh adaptation.

 

Key takeaways

Overall, multidimensional organization models and higher complexity of data domains seem to create a better breathing ground for data mesh principles and implementation.

Even before this research, we had some insight that larger organizations with complex domains would fit data mesh better. Our results most definitely support this finding, and we can safely say that data mesh has certain organizational standards it requires to be completely efficient. However, data mesh is still looking for its litmus test to prove the full potential behind it [2].

Although, this doesn’t prevent organizations with a few questions marks lying around to be unsuitable for data mesh principles. This research gathered together a few points that should be taken into consideration before moving towards data mesh.

Hence, we can state that data mesh has a stormy future ahead, and it should not be disregarded by any organization willing to scale with data. This new data paradigm shift could just be the one for your organization.

Lastly, I would like to express my greatest appreciations to the wonderful people from Solita I had the privilege to work with during my master’s thesis journey, Vesa Hammarberg, Antti Loukiala, and Lasse Girs.

If you are interested in data mesh solutions in Solita, please feel free to contact simo.hokkanen@solita.fi.

References:

[1] Dehghani Zhamak, Data Mesh principles and Logical Architecture. Martin Fowler’s Blog, 2020 https://martinfowler.com/articles/data-mesh-principles.html as of February 15th, 2021.

[2] Hokkanen Simo, Utilization of Data Mesh Framework as a Part of Organization’s Data Management. University of Eastern Finland, Master’s Thesis, 2021.

Tableau – a pioneer of modern self-service business intelligence

Tableau can rightly be called a pioneer of modern data visualisation and self-service BI. Founded in 2003, the company launched the first version of its visual analytics product back in 2004. The basic principles of the tool, the way it’s used to analyze data and create visualisations, have remained similar ever since. Tableau still stands out from other tools especially in the flexibility of building visualisations and interactions, as well as the versatility of out-of-the-box map visualisations and geospatial capabilities. In addition to visualisations, Tableau is a fully-fledged analytic solution – to understand and act on data.

This is the second post in the blog series about BI tools. The first post was about the evolution of business intelligence in the 21st century. This time we delve into one of the leading tools in the market. We will describe what differentiates Tableau from key competitors, what the platform consists of, what the licensing options are and much more. We will try to be as comprehensive as possible, but all the features can’t be considered or even mentioned. Describing a BI tool thoroughly in a blog post is extremely challenging. Contact us if you need a more detailed evaluation or want to see Tableau in action with real-life data contents.

Update: Read our blog post about the new features introduced at the Tableau Conference 2021.

To help people see and understand their data

This is what Tableau mentions as their mission: to help people see and understand their data. Tableau aims to be easy to use so everybody can utilize it and derive usable insights out of their data. Tableau was originally built based on data visualisation research done at Stanford University; how to optimally support people’s natural ability to think visually and to intuitively understand certain graphical presentations.

Tableau Desktop did a very good job in the era of Enterprise BI dinosaurs to make data analytics easier and even fun (read the previous blog post for reference about dinosaurs). The success and market penetration with the Tableau Desktop meant the platform needed to be expanded. Tableau Server, Online, Public, Mobile and Prep have been released since then. Nowadays the Tableau offering is a comprehensive analytical platform with a certain twist compared to competitors.

The Tableau twist

Quickly and easily to insights
In general it is very fast to get from source data to valuable insights with Tableau. Analysing data and creating visuals and dashboards is mostly very easy and smooth. There are out-of-the-box time hierarchies available, drag and drop analytical templates to use and a good amount of easy to create calculations (running totals, moving averages, share of total, rank etc.). Ease of use also goes to data preparation and modeling. Both of those can be done without deep technical knowledge and coding skills. Perhaps what I’m most grateful for in this area is how new features are published and old ones deprecated: in a way it just works. For example when the new in-memory extract storage replaced the old technology in 2018 it was done with minimal effect and maintenance work to the users. Same thing happened in 2020 when a new semantic data model layer was introduced, and again, no laborious migrations from old to new, everything just worked.

Extraordinary creativity
Tableau was originally a tool for data visualisation and visual analytics, and for that it remains extremely strong. Tableau uniquely enables user creativity and ingenuity when analyzing data and developing content. What does this mean? In other tools you usually first select the desired outcome you are looking for (the visualisation type e.g. line, area, bar, pie, etc.) and then assign the fields to the roles the visualisation type supports (e.g. values, legend, axis, tooltip, etc.). If the visualisation doesn’t support something you would need (e.g. size or small-multiples) then there isn’t much you can do.

Tableau works very differently: you can drag and drop fields to the canvas and Tableau will visualize the data in a suitable way. Certain properties of a field can be changed on the fly: dimensions can be changed to measures, discrete fields converted to continuous, and vice versa. Almost any field can be assigned into any role, and different types of visualisations can be combined. This approach is more flexible than in any other tool I have used. However, this can seem complicated at first. Fortunately, Tableau has a Show Me menu to help you to create different visualisations and to understand how the tool works. Once you get the hang of it, you can do powerful visual analytics like never before.

A bunch of different Tableau dashboards and visualisations. All of these are available in Tableau Public.

Maps and spatial capabilities
As mentioned earlier, the different types of visualisations are very diverse and flexible in Tableau, but especially maps and spatial analytics are top notch. Here’s a short list of what makes Tableau’s spatial capabilities so great:

  • Tableau is able to read spatial data from many different data sources. Point, line and polygon geometries can be used directly from Snowflake, SQL Server, PostgreSQL and Oracle databases. Spatial data can also be ingested from different files, like GeoJSON, KML, TopoJSON, Esri Shapefile etc.
  • An unlimited number of layers can be defined to the same Tableau map. Different layers can display various kinds of data and geometries. And users can toggle layer visibility on/off.
  • Data on a single layer can be visualised in various ways: as points (symbols) , lines, polygons (filled areas), heatmaps, pies, paths etc.
  • Tableau supports geocoding (transforming location related attributes to a location on a map). Attributes that can be geocoded are for example: country, state, city and postal code.
  • Tableau supports spatial joins and functions. These enable location based data joins and calculations for example to make lines between points, calculate the distance between points, recognize if lines intersect or if a point is inside a polygon etc.
  • WMS (Web Map Service) maps and Mapbox are supported as background maps.
  • There is no limit to the number of data points on the maps in Tableau. Many tools can have a limit of 3500 or 10000 points, but Tableau can visualize hundreds of thousands of points with good performance.
  • With map tools, the user can interact with the map in many ways, e.g., zoom in/out, measure distance, calculate  areas, select points, toggle layer visibility, search locations, and more.
  • All of this mentioned above is available out-of-the-box, no additional components required.
Detailed city centre map with street map as a background, building layer containing dark grey polygons on the bottom and point layer on the top showing floor area (size) & heating fuel (color).

Interactions between user and visualisations
The third strength of Tableau is the abilities for the user to interact with visualisations and the ability for the developer to precisely define where and how these interactions take place. Interactions can be used, for example, to filter data, highlight data, show and hide layout objects, show tooltips, define values ​​for parameters and set objects, drill up and down, drill through to another dashboard or to an external url. Interactions can enable especially non-technical business users who consume pre-made content to get more information and insights from a single dashboard without the need to create multiple dashboards or going full self-service mode.

Flexibility of infrastructure and governance
Tableau is exactly the same tool regardless of how and where you choose to deploy it (on-premise, public cloud or SaaS). You can use Windows or Linux servers (or containers) and Windows and Mac computers for the desktop. You can use different authentication options, user directories and data sources without any mandatory dependencies to any cloud vendor whatsoever.

Same flexibility is there when creating the content. Data models can be created with exactly the same way and functionalities whether it’s in extract or live mode. And you can also combine extract and live mode contents on the same dashboard. The same scripting language is used when preparing the data and building the visualisations. And it is quite a powerful, yet easy and straightforward language to use. The flexibility carries on when publishing the content to Server/Online. You can structure the contents to folders exactly as you like and apply security policies on the detail level you need.

Active and passionate user community
The Tableau user community is more active and passionate compared to other corporate tool user communities. For example, Tableau Public has more than 3.7 million published visualisations from more than 1.5 million users. Anyone can browse and use these visualisations to learn about the data and how to use Tableau. The community supports and helps with issues and problems related to the tool, but I personally appreciate the work they do to spread data understanding and share best visualisation practices and examples.

Main functionalities & workflow

Tableau contains everything that a modern analytics platform can be expected to contain. There are no major deficiencies, but obviously there are some areas for improvements especially related to the newer features. Tableau can be used to master the whole visual analytics pipeline, from the data preparation to various ways of consumption, across multiple channels. This is how Tableau workflow usually goes.

Tableau platform core functionalities, components and related user roles.

Data Preparation
If you need data preparation capabilities Tableau offers this within Tableau Prep. This tool can be used as a desktop client or directly within Tableau Server or Online. Tableau Prep is built around the same easy to use mentality as the other components in the platform. Creating data manipulation steps and the whole workflow is very visual, the process is easy to understand and it’s easy to see what’s happening to the data along the way. Tableau Prep offers standard data wrangling capabilities to join, union, pivot, clean and aggregate data. You can also add new rows to the data and use custom R or Python scripts to calculate new insights. The result dataset can be pushed to a file, to a database or as a Tableau data extract. Already made data preparation workflows can be shared and reused, and the scheduling and execution can be monitored via the Prep Conductor add-on.

Data modeling
Most commonly data modeling is done using the Tableau Desktop client. Exceptions are, if you use Tableau Prep or some external tool with Tableau API to create and refresh the data extracts. With Tableau Desktop you connect to the data sources, select the objects you want and define joins and relationships between the objects. Nowadays Tableau data models include two layers: physical layer and logical (semantic) layer. The separation of the two makes it possible to reuse the same Tableau data model for different purposes. Logical layer functionality was published with version 2020.2 and it is a crucial update to the data model.

While modeling the data you selected whether to use live connection or extract data to Tableau’s columnar in-memory data storage. Whatever you choose, you have the exact same functionalities and capabilities in use and you can also change the connection type later on. One possibility is also to use incremental refresh so only new rows are inserted to the data extract. The best practice is to verify and define all field’s data type, default formatting & aggregation, geographical role etc. directly when modeling the data even though these can be altered later on while doing visual analytics. Row-level security filters can also be added to the data model to define different data visibility for different groups. While doing the data model you usually create the first visualizations in parallel to better understand the data and to make sure it is what you are expecting. When the data model is ready you can publish it to Tableau Server/Online to enable reusability.

Visual analytics
Then we get to the fun part, doing visual analytics. This and the following steps can be done either with Tableau Desktop or via Tableau Server/Online using the browser. There are so many ways to do this. You can drag and drop the fields to the canvas and let Tableau pick the proper visualisation type. Or drag and drop the fields to the exact roles and define the exact settings, filters and parameters you want.

When you get insights from the data and new questions arise you just modify the visualisation to also get the new questions answered. Perhaps create quick table calculations or various types of other calculations to get new insights. Sometimes it’s a good idea to try the Show Me menu to get some new perspectives. Or use the Ask Data functionality to write the questions you have and let Tableau build the vizzes. As previously mentioned, this is where Tableau truly shines. When you have individual visualisations ready you can start building a dashboard.

Dashboards
If you want you can create the dashboard very quickly: just drag and drop the visualisations to the canvas, enable visual filtering, show filter selections, legends and some descriptive headers, and you are ready. On the other hand, you can also plan and finetune the layout and interactions to great detail. Create objects with conditional visibility controlled via show/hide buttons or selections in other visualisations, add multiple tabs and drill-throughs to other contents etc.

Nowadays you can even have fully customizable objects via Tableau Extensions, for example new types of visualisations, predictive analytics, interactions, write-back, etc. If the dashboard will be consumed via different devices you can define distinct layouts and contents tailored to for example tablets and phones. In addition to dashboards, users can also create stories with multiple steps/slides containing different visualisations and comments, a bit like PowerPoint presentations with interactive visuals.

Example screenshot of Tableau Dashboard Extensions offering.

Metrics (KPI’s)
You can create many kinds of KPIs and metrics within a dashboard, but there is also a distinct Metrics feature in Tableau. Metrics objects can be created in Tableau Server/Online folders to view the most important figures already while navigating the contents. Metrics are a nice way to gather key figures from different dashboards to a single place in a very easy way. And if there’s a date field available in the data the metric can also contain a small trend graph.

Other ways to consume contents
There are still many ways in Tableau to consume the contents that I haven’t yet written about. Dashboard users can subscribe to the content, set alarms to get notifications when thresholds are exceeded, save filter & parameter combinations as bookmarks, export data, comment and discuss about the dashboards etc. In addition to Tableau Server/Online, content can be consumed with mobile apps (also offline possibility), integrated to Slack or embedded to external services.

With Ask Data functionality Tableau data models can be queried using written questions. Someone might ask for “top 20 customers in Europe by sales in 2021”, and Tableau would show the answer as a graph. A few years ago I was very sceptical about this kind of feature, thinking it wouldn’t work. But after using it a couple of times during this year I think it is actually quite neat, although I still have my doubts for more complex use cases. Another nice automated insights type of feature is Explain Data which can show fairly basic info about the selected datapoint from statistical perspective.

Administration and Governance
One crucial part of the workflow is governance and monitoring. Most of the governance definitions are created before the development work even starts. Administrator sets up the authentication and creates appropriate user groups either manually or from the user directory. Administrators can mandate domain owners to control their contents but still have visibility to the contents in the platform. Administrators have a variety of tools to monitor and govern the environment, also to a very detailed level if needed.

There are also a few add-on components available to enhance the use of Tableau Server/Online. Tableau Data Management add-on contains Tableau Prep Conductor to orchestrate and monitor Tableau Prep workflows and Tableau Catalog to view more details about the contents, data lineage and impact analysis. Tableau Server Manager add-on gives more power managing Tableau Server environment, to enhance performance, scalability, content migration, resource usage etc.

Also several API’s are available to control and use Tableau programmatically. These include ways to manage Tableau Server environments via code, connect to data, create and use Tableau data sources, use external analytical capabilities like R and Python, create and use dashboard extensions and embed Tableau content to external services and mobile apps.

Room for improvements

Even though Tableau data models nowadays contain a semantic layer and are way more versatile than before, there is still something to improve. Better multifact support, possibility for secondary relationships and refined incremental refresh would be nice, but of course those might sometimes complicate the models quite a lot. The good thing about the current state is that models are still easy to understand and use. A bigger data model related improvement would be the ability to reuse existing data models when creating new ones, a bit of what you already can do with the data flows in Tableau Prep. This would really improve the ability to do end-to-end bimodal BI on the data model layer. Most important data models could be built centrally and then decentralised content development could add their own data to their own models without duplicating the model and the data of the centralised model.

Some augmented analytics or autoML features have been released during this year, but those still feel very basic and a bit disconnected from the core platform. This capability somewhat relies on Salesforce Einstein Analytics capabilities and is not (at least yet) fully built-in to Tableau platform. The current Explain Data feature is able to show basic details about the selected datapoint, but I would like it to emphasize the most interesting data points and related insights (anomalies & trends etc.) automatically.

The history of being originally a desktop tool is still quite visible. Contents are somewhat workbook and visualisation oriented. This is not necessarily bad, because it can help to structure the contents in a logical way, but there are few things to improve. I would really love to be able to more easily create dashboard navigation and drill through between contents in distinct workbooks. Within the same workbook it’s very easy, but among different workbooks it gets a bit clunky.

Desktop tooling can create pressure for IT or whoever needs to maintain, deliver and update the client software on a regular basis. Keeping up with major updates (4 times a year) and possible minor updates can be a hassle. Tableau is moving towards a browser based approach but for now some of the functionalities are still only available via Tableau Desktop client installed on users’ laptops.

Building the visualisation and doing visual analytics is a somewhat manual process in Tableau. After all, it wouldn’t be visual analytics if the outcome would just appear, without the journey to see different viewpoints and learning the insights along the way. Ask Data and Explain Data features are one way of making visualisations faster and in a more automated manner, but I would also like to see more code driven options to build and manage the contents. This would make it possible to use the visual power of Tableau in a more data ops oriented way. To build visualisations and dashboards on the fly already in the data pipelines and to deploy the contents automatically to different environments.

Then I have to mention the pricing, even though the importance of the licence price is commonly exaggerated over the other components affecting the total cost of ownership (TCO). What I do like about Tableau pricing is the fact there are no hidden costs to be discovered later. With the default price you get the capabilities and there rarely is a need to buy something more expensive later on. You just  buy more licenses if you want to increase the number of users. And here lies the criticism I have. Normally you use per user licensing when the number of users is rather small (something like 10-300 users). With Tableau Server you can switch to core based licensing when the number of users gets bigger or you want to enable guest access etc. But when using Tableau Online there is no possibility to select core or node based licensing, you just have to stick with the user based license model. Of course Tableau might offer you some discounts if you have a lot of users within the Tableau Online, but that’s just something I really don’t know nor can’t promise.

Greetings from Gartner and Forrester

Gartner has placed Tableau as a leader already for 9 consecutive years in the Magic Quadrant for Analytics and Business Intelligence. In the latest report Gartner recognizes the analytics user experience, and the very strong community and customer’s fan-like attitude towards the product as a core strengths of Tableau. Gartner also mentions the potential with the Salesforce product family to integrate Tableau more tightly to different solutions and to easily embed Tableau visualisation with the Tableau Viz Lightning web component. As a caution, Gartner mentions Tableau’s non-cloud native history and install base as well as premium pricing and possible integration challenges with Salesforce products.

Tableau 2021 position and path in the Gartner MQ for Analytics and Business intelligence. Check out the visualisation in Tableau Public.

In the Critical Capabilities for Analytics and Business Intelligence Platform 2021 report Gartner focuses more on the actual capabilities and functionalities. In the report Gartner rates Tableau as excellent in data preparation, which is simple and visual to use and easy to publish, schedule and monitor. Also more complex tasks can be executed via R & Python scripts. Gartner also praises the Tableau governance capabilities to promote and certify contents as well as control the workflows and view data lineage to better understand data assets. Gartner says Tableau is the clear leader in the area of data visualisation, but there are things to improve in the augmented analytics area, partly because of the lack of integration with Einstein Analytics. This however has improved since the publication of the Gartner report with the Einstein Discovery Extension and other functionalities.

The Forrester Wave for Augmented BI Platforms Q3 2021 names Tableau (Salesforce in the report) as a leader. Forrester recognizes visual and geospatial analytics as core strengths. The Forrester report, being published later than the two Gartner reports, rates Tableau much better in augmented analytics. Forrester mentions the Einstein Discovery functionality and out-of-the-box ML models that significantly boost Tableau capabilities beyond descriptive and diagnostic analytics towards guided ML. Forrester sees room for improvement among business application connectors.

Infrastructure options

Tableau offers a wide variety of options in how to be deployed in organisations and Tableau doesn’t favor any cloud or infra provider. Tableau Desktop is available for both Mac and Windows. It is used to connect to data in databases, services or files and to visualise that data in charts and dashboards. Tableau also offers a web authoring mode where no software installation is required.

In order to share visualisations with a wider audience, Tableau Server is used. Tableau Server is available as a server application and a cloud service (Tableau Online). If you want to host your own server, you can do it as an on premise server, in a private cloud or house it in a public cloud such as AWS, Azure or GCP. Tableau Server can be installed on Windows or Linux operating systems and for Linux, it is also available to run inside a Linux Docker container.

In its basic form Tableau Server can be installed on a single node. For more complex solutions, the installation can be scaled out for specific scenarios such as high availability or high performance. Using your own server allows for total control over settings and customisations of the server, but then of course you have the extra effort to maintain and monitor the environment and take care of the infrastructure costs.

Tableau Online is the software-as-a-service offering for those not hosting their own servers. The Online service is divided into pods located all over the world and customers can select which pod that should house their Tableau site. Tableau Online obviously doesn’t provide so much control over the environment, but instead it’s much more straightforward to use and deploy. Accessing the portal in Tableau Server or Online can be done using all major browsers. There are also mobile viewer apps for iOS and Android.

Licensing and publicly available pricing

The default way of licensing Tableau is a per user subscription model. Additionally there is a core based licensing option available for Tableau Server (but not for Tableau Online) and possibility to license to a specific embedding use case with a discounted price. Tableau licenses can be purchased from Tableau partners, Solita can help you to find the optimal license combination, get the licenses, and everything else you might need.

Tableau licensing is divided by the usage roles for Creator, Explorer and Viewer. Capabilities depend on the role and Creators are the most capable of the lot. They can connect to data sources, prepare and model data, create visualisations and publish both visualisations and data models to Server/Online. Explorers can do visual analytics and use existing data models and reports to build and extend visualisations and dashboards. Viewers can browse and interact with content. All roles can and set up favourites, subscriptions and alerts to personalise their experience in the service.

All three roles are available for both Tableau Server and Tableau Online. Subscriptions are priced in USD per month. License fees are billed yearly. You can use the license price calculator in the Tableau Public to calculate total price for certain role combinations (notice: calculator contains only publicly available pricing information): Data Viz tool license pricing

  • Tableau Online (Oct/2021, per user per month)
    • Creator: $70
    • Explorer: $42
    • Viewer: $15
  • Tableau Server (Oct/2021, per user per month)
    • Creator: $70
    • Explorer: $35
    • Viewer: $12
  • Add-on modules (custom pricing from Tableau)
    • Data Management
    • Server Management
    • Einstein Discovery

Server licenses are also offered as license type Tableau Embedded Analytics with a 25% reduction on licenses, when organisations want to offer Tableau content as an analysis service to external parties.

For students and academic institutions there is a possibility to get a free 1-year license and access to eLearning contents.

There’s also a free version called Tableau Public. Tableau Public offers Tableau visual analytics power and possibility to save and share the results only via Tableau Public service. It is used by visualisation enthusiasts all over the world and is an excellent source to find creative ways to use Tableau. But be sure not to publish any non-public data to Tableau Public service since the contents can be found via url, even when the content is not searchable or listed within your profile.

Sometimes you might also hear about a tool called Tableau CRM. Tableau CRM is actually rebranded Salesforce Einstein Analytics. That is not originally part of Tableau platform, but Salesforce has plans to tighten the integration between the two in the future.

How to test and start with Tableau

  • Tableau Desktop trial: 14-days trial to try the capabilities in the Tableau Desktop.
    • Download and install the product from the Tableau site
    • Fill in your email when launching the tool for the first time
  • Tableau Online trial: Test the Tableau Online capabilities to share and analyse information.
    • Request the Tableau Online trial in the Tableau Online site
    • Activate the trial account with the link in your email
  • Tableau Public: To analyse and visualise primarily open and public data for free.
    • Create an account and download the app from Tableau Public site
    • Notice that you can also create visualisations directly in the Tableau Public service using the browser
  • Other relevant contents
  • Solita Tableau and visual analytics related offering
    • Tool evaluations and recommendations
    • License consulting and sales
    • Extensive training options
    • Analytics solution kickstart
    • Solution implementation and rollouts
    • Maintenance and support

Until next time

Thanks for reading and scrolling down here. In the next post for the series we will take a look at what Microsoft and Power BI has to offer. If you have questions or any kind of consulting needs about Tableau, you can contact us:

Tero Honko, Senior Data Consultant, Finland
tero.honko@solita.fi
Phone +358 40 5878359

Aron Saläng, Visual Analytics Tech Lead, Sweden
aron.salang@solita.se
Phone +46 70 144 67 87

Business intelligence in the 21st century

It's been interesting to follow and live the evolution of the business intelligence and data visualisation tools over the last 20 years. Leading vendors have changed, a lot of acquisitions have taken place, cloud became de-facto, big data hype came and went, self-service became possible, and the data culture & processes are evolving – little by little.

We are starting a blog series to go through the BI and data visualisation market. We will uncover each leading vendor in detail, take a look at the key challengers and anticipate where the market is going in the future. In this first post, we are going to delve into the world of business intelligence tools in the 21st century, and review the market and product changes over time.

Occasionally, this blog series tackles our personal experiences and views in relation to tools. Still, the actual assessments have been made objectively and technology agnostically – just like tool assessments are supposed to. If you wish to go through the interactive visualisation based on the content of “Gartner Magic Quadrant for Analytics & BI”, from where the attached figures have been taken, you can do so at Tableau Public: Gartner MQ for Analytics & BI visualisation

Current kings of the hill

For a long time now, the leaders in the data visualisation tool market have been Tableau, Microsoft, and Qlik. These vendors entered Gartner’s Magic Quadrant Leader section in 2008 (Microsoft), 2011 (Qlik), and 2013 (Tableau). And they have held their position ever since. Tableau and Qlik have remained quite stable within a small area, whereas Microsoft has bounced around the quadrant (possibly due to their transfer from the old SSRS/SSAS stack to Power BI).

Visualization about the Gartner MQ for Analytics and BI and the history paths of current market leaders.
“Gartner Magic Quadrant for Analytics & BI” 2021 and the paths of current market leaders.

 

These tools have gained a stable market position, and each of them has their own strengths and users. Various rivals are regularly knocking on the door in the hope of attending the party, but, for the moment, they have always come away disappointed and been forced to gain new momentum in other quadrants. Before going into more detail about these kings of the hill, let’s review how the current situation has come about in terms of vendors and tool evolution.

Acquisitions and Bitcoins

Previous kings of the hill, i.e., vendors in the leaders quadrant, were IBM/Cognos, SAP/BusinessObject, Oracle/Hyperion, SAS and MicroStrategy. During the first decade of the 21st century, especially in 2007, BI reporting market was consolidating fast. The IT giants of that time acquired the long-term market leaders: Oracle announced its acquisition of Hyperion in March 2007; SAP announced its acquisition of BusinessObjects in October 2007; and IBM announced its acquisition of Cognos in November 2007. The acquired market leaders were previously themselves purchasing industry rivals and minor companies (such as Crystal Decision, Applix and Acta Technologies).

Based on Gartner’s Magic Quadrant, the leaders were still going strong about four years after these acquisitions. But then they started to slip down the slippery slope. Well, to be precise, SAP/BusinessObjects started its decline a bit earlier. Maybe the strong identification with the SAP family did not promote success. I cannot say whether the decline of the leaders was more due to the uncertainty caused by these business acquisitions: difficulty to integrate the organisations and the products, or due to the fact that renewal is always hard for market leaders. Development stalls because companies don’t want to cannibalise their own market, and when customers abandon the ship and start rooting for more innovative rivals, companies complicate their licensing model and push up the prices. And this really gets the rest of the customers going!

Visualization of the downhill of prior market leaders in the Gartner MQ for Analytics and BI.
Prior market leaders positions in Gartner MQ over the years, based on Gartner Magic Quadrant for Analytics & BI data from 2006–2021.

 

MicroStrategy and SAS didn’t immerse themselves as much in business acquisitions, but still they shared the same fate with their rivals ruling the market at the turn of the 2010s. The offering stalled, at least in the area of data visualisation, and MicroStrategy is probably more famous today for its Bitcoins than its product offering.

OLAP-cubes

Let’s forget the vendors for a moment and start looking at product evolution. The first BI tools emerged at the end of 1980s, but they started to flourish in the 1990s. Data warehouses were rare in those days, and most BI tools included features that allowed users to obtain data directly from operative systems and download it into the tool’s own data model. One popular data storage was OLAP-cubes that were easy to use and view from different perspectives by filtering into the most interesting slice of information.

The most popular presentations were crosstabs and various pixel perfect listings, so the content was still not that visual. The users were mostly from finance departments, so for the end users, this numeric presentation was surely just the perfect one. Some example products from the 1990s worth mentioning include Cognos PowerPlay Transformer, Crystal Reports, and Oracle Discoverer. Qlikview also has its roots in the ‘90s, but let’s not go there yet.

OLAP-cube and report-centred solutions built directly on top of operative systems were often quite fragmented. Different departments could have made their own solutions in which each separate cube or report might have had its own data models and data refresh tasks straining the source database. This made the solution complex to maintain and caused unnecessary load to data sources. Partially due to these reasons, data warehouses increased in popularity and there was a demand for more centralised reporting solutions.

From a novelty to a dinosaur in 10 years

In early 21st century, comprehensive Enterprise BI systems started to emerge in the market. They enabled the creation of extensive solutions covering various departments and functions. The development work often required very specific competence, and it mostly focused on a BI competence centre under IT or finance departments. In the competence centre, or as subcontractors, BI developers tried their best to understand the needs of the end users and created metamodels, built OLAP-cubes, and produced reports. More graphs and KPI indicators started to appear in the solutions. Some even created dashboards containing the most essential data. In those times, graphic elements included speed gauge charts, 3D effects, gradient colors, pie charts, and other “fantastic” visual presentations. It’s not really surprising that users often wanted numeric data and these early graphs were not a hit.

New functionalities were added to these Enterprise BI tools as vendors acquired other companies and their products were integrated into existing systems. Existing components or functionalities were rarely discontinued and these newly integrated functionalities often seemed to be flimsy stick-and-bubble-gum contraptions. Over the years, Enterprise BI solutions became so fragmented and complicated that even experienced specialists struggled to make out what each component or “studio” was for (or maybe it was just me who didn’t always understand this).

Visual self-service

The clumsiness and difficulty of a centralised BI organisation and Enterprise tools accelerated the agile and easy-to-use self-service BI and data visualisation. At the turn of the 2010s, Tableau – established almost ten years earlier – started to gain a reputation as a new kind of visual analytics tool that could be used for data analysis even by people without much technical knowledge. Tableau wasn’t marketed to IT departments but directly to business operations. It didn’t try to replace existing Enterprise BI tools in companies but positioned itself alongside them directly in the business units, which now had the chance to create their own reporting content either without or partially with a data warehouse.

Gradually, other similar tools started to appear on the market: Microsoft Power BI, Qlik Sense, SAP Lumira, Oracle Data Visualisation Desktop etc. Also enterprise BI vendors started to include more features directed at business users in their solutions. In an evaluation of self-service BI tools I did a few years ago, already 13 different tools were included, so there were plenty of tools available at the time. However, when the tools were examined in detail, it was clear that some of them had resorted to shortcuts or had taken the easy way out. Most of these tools haven’t become hugely popular, and some might even be discontinued by now.

Dashboards from a self-service data visualization tool evaluation.
A glimpse to the Self-service BI tools evaluation a few years back.

New rivals

In the early 2010s, brand new start-ups were aiming to enter the data visualisation market with slightly different approaches. The big data hype brought along a bunch of Hadoop-based platforms, such as Platfora, Datameer and Zoomdata. Another trend was SaaS (Software as a Service) type reporting and visualisation services offered only in the cloud. These services included Clearstory Data, GoodData, Chartio, Domo, and Bime. The third trend was AI- and search-based solutions in which the user could analyse and retrieve data in a very automated manner, a bit like using a Google search. Some examples include Beyondcore and ThoughtSpot. Some new tools were very heavily relying on the performance of cloud databases, and they didn’t offer the possibility to extract and store data within the tool. A lighter version of this approach is Periscope Data, while a more versatile version is Looker.

Guess what has happened to most of these new rivals? Around 70% of the tools mentioned above are already acquired by another company. So again, consolidation lives strong in the market. The biggest business acquisitions in the industry in recent years have been Salesforce’s acquisition of Tableau ($15,7B) and Google’s acquisition of Looker ($2,6B). Both of these acquisitions were announced in June 2019.

A union between decentralised and centralised

Perhaps the biggest problem of self-service tools has been the limited possibilities to control and monitor the environment and the published content in a centralised manner. On several occasions, I’ve seen how a self-service environment has been filled with hundreds of data sets and thousands of reports and no one has had a clear visibility of which content is relevant and which is not. As governance is not enforced in the tools, they have to be created and implemented separately for each organisation. Luckily, the self-service BI tools of today are already offering better features to centrally control and monitor the environment and contents.

Another important aspect to consider when self-service tools and centrally controlled solutions are approaching each other is bimodal BI. This means that both centrally controlled content (often predefined and stable) and more agile self-service content (often more exploratory) can be flexibly developed and utilised in parallel. Current BI tools mostly support both of these modes but there are still gaps in how different types of contents can be infused together. A bigger challenge, however, is how to change the data culture, processes and governance practicalities to make the bimodal way of working easier and more flexible.

The death of data warehouses and dashboards 

In the past ten years, it has been repeatedly predicted that data warehouses are dying. A ton of Qlikview solutions that are based on a strong internal data storage have been implemented without use of a data warehouse, and this might be well justified on a smaller scale. Virtualisation, Hadoop, data lakes and the like have been killing data warehouses in turns but it is still going strong. This is more marketing hype rather than reality. It is true that building data warehouses has changed irrevocably. The ETL tools leading the market 10 to 15 years ago as well as the manual and slow way of building data warehouses has died. There have never been as many ways to implement and use a data warehouse as today. So data warehouses are alive and kicking. But don’t get me wrong – they are not and never will be the solution for everything.

Some people are predicting a similar fate for dashboards. The most provocative example might be the ad by ThoughtSpot which proclaims: “Dashboards are dead”.  Machine learning and AI based visualisation and data search solutions predict hard times for dashboards and traditional BI. Data science platforms have been implying the same. Most of this is purely a marketing gimmick. Or course the tools themselves and our ways of using them are constantly changing and developing. One direction for development is certainly machine learning and NLP (Natural Language Processing), and the convergence of different kinds of tools.

It will be interesting to see how the current market leaders will act when new functionalities are developed and diversified into tools. Will companies discontinue existing functionalities or parts of the tools when replacements are launched. Or will existing tools again turn into dinosaurs left to be trampled on by new rivals? Or will the giant vendors integrate their other offerings too tightly with their BI tools so that they won’t be viable options in environments already using competitors’ tech stack?

Thanks and stay tuned

In the following posts of this series, each of the key market-leading tools are covered one by one. A bit later we’ll also review some smaller rivals in detail. Leave us a comment or send an email if you want to read about a certain tool or aspect. We’ll also examine later where the Business Intelligence & data analysis tool market is going and what we can expect in the future. A preliminary schedule for the blog series is as follows:

If you are interested in data visualisation solutions or tools, please feel free to contact tero.honko@solita.fi. And finally a big thank you for reading the post!

Real-time BI with Power BI and Excel

New composite models capability is not just an ordinary monthly Power BI update. It is the beginning of new ways to do self-service reporting. In this blog post we explore a real-time BI solution using Excel as a dancing partner of Power BI.

Why Still Talking about Excel?

Most Power BI users probably know how to get data in from Excel. This is usually how everyone starts using Power BI and possibly the most used connection for building self-service reports. However, you might not be all familiar to the reversed process: getting data in Excel from a Power BI dataset. This sounds like a trip back to the 90’s of BI. Why would I dare to write about it?

Excel is perhaps the most well-known self-service analytical tool. Its success resides on the simplicity of getting value out of data even for non-technical fellows. After the release of Power BI, some of us thought it came to replace the king of the analytical tools.  I might accept I was wrong. Excel can still do something that Power BI can’t: to act on data.

Surprisingly, this is a very common request by Power BI users. They often might ask for changing a forecasted value in a report to see its impact on the results. There are some new solutions in Microsoft for solving this type of requests, such as Power Apps. But these tools are still not that well known, and their implementation requires developers to acquire specific training. Hence, I believe that these two, Power BI and Excel, are still going to be dancing together for some time.

A New Era after Composite Models

Not only they are good dancers, now the music sounds fantastic too. Good tunes are played since December 2020, when Microsoft announced Power BI composite models. This seems to be a great achievement in the BI world. Sincerely, I am just a beginner, so I did not see this to come. But if Alberto Ferrari says it publicly, then we must believe that this is the beginning of new BI era.


We got used to monthly updates with Power BI, but not all the months are the same. Guys, the December 2020 version of Power BI is an historical milestone in the development of Business Intelligence. Historical. Milestone. I am not saying this lightly; I am old enough to have seen many things happen in the Business Intelligence world. Some were nice, some were cool… this is neither nice nor cool: this is huge: finally, can seal the marriage between self-service and corporate BI”  –  Alberto Ferrari


With composite models, Power Bi developers connect datasets located in the cloud with new datasets saved locally in their computer. Datasets define the analytical power of our reports. But now with composite models, developers expand the limits of their data models, and consequently their analytical power too.  As Alberto said, this is a great opportunity for making self-service BI more self-service and to start doing real-time analysis. Indeed, we, as modelers, are now the obstacle for this transformation to happen.

Hints on Analysing Power BI Datasets in Excel

Accordingly, I believe that a brief refresh on how to bring data from Power BI to Excel would be beneficial.

  • Copy table. As easy as it reads. The user copies data from Power BI Desktop to Excel with a right click on the desired table. This method might be useful for a quick analysis and only if the user has access to the .pbix file.
  • Export data. This is a fast way to get data from a specific visual in Power BI. You might export data to Excel when performing own analysis on numerical values behind a visual. These are usually one-use type of analysis. The data is not connected to the Power BI dataset and any new update requires of manual work. For detailed description of the feature, visit the link https://docs.microsoft.com/en-us/power-bi/visuals/power-bi-visualization-export-data
  • Analyse in Excel. This option creates a pivot table connected to the Power BI dataset. Due to the existing live connection, Excel has access to the full Power BI data set, without row limitations, secured by Microsoft account credentials and row level security. For the same result, only available with some specific Office SKUs, Excel users click Get Data feature to connect to their available Power BI datasets. For more specific info, check Microsoft documentation in https://docs.microsoft.com/en-gb/power-bi/collaborate-share/service-analyze-in-excel
  • Power BI featured tables. You can create a connection to enterprise data so that you enrich your Excel tables. This feature is found with the name of Data Types under the Data tab. Don’t forget to set “Is featured table” to Yes in Power BI Desktop. Then  publish the dataset into the Power BI web service and ready. Full documentation about this exciting feature can be found in the following links: https://docs.microsoft.com/en-us/power-bi/collaborate-share/service-excel-featured-tables and https://docs.microsoft.com/en-us/power-bi/collaborate-share/service-create-excel-featured-tables.

A Game Changer: Excel Data Types

All these possibilities might be considered in your future use case. However, among all of them, I find the last option very relevant when seeking for real-time BI. Featured tables and Data Types allow developers to combine manually input and Power BI data in the same Excel table. Together with composite models, companies can enrich existing enterprise data models. I would rather show you how with a current customer use case.

Use Case: Leveraging CMDB in M&A Projects

The Business Case

Company A is large and international enterprise and as such, it is involved in several mergers and acquisitions (M&A) cases at a time. It seeks for leveraging the utilization of their existing configuration management database (CMDB) in their M&A projects. They aim to build a resilient virtual data room (VDR) and vendor due diligence (VDD) process. So, the company needs up-to-date reporting and multiple sources connections.

The lifecycle of the reports is long enough to fulfil the needs of the M&A project, from several months to few years. During this time, project scope and IT entities (i.e applications and workstations) change continuously. And these changes are not shown in the spreadsheets that product managers and analyst work with. Currently, these Excel files are manually updated every now and then. In addition to CMDB data, the Power BI reports include the manually input data from these Excel files. With the existing capabilities, data changes pass unnoticed, analysis are never 100% certain, and manual work slows down processes.

Company A wants to increase their capacity to do analysis on actual data while speeding up the process. This way, the company aims not only to report about individual projects, but to unify the analysis and get overall conclusions from all ongoing M&A projects.

Solution architecture

Step 1: Golden Dataset

The first step has been to build a golden dataset with all available data from an on-premises database. Generally, direct access to the on-premises data has required specific IT knowledge and skill, only available in the IT department. With golden datasets, Company A lowers the barriers for business departments to have access to relevant and secured enterprise data. To build a working architecture, we have followed Matt Allington’s fabulous post  https://exceleratorbi.com.au/new-power-bi-reports-golden-dataset/

Step 2: Export to Excel

The second step is to facilitate project managers with tools to set up the project scope. Within the golden dataset workspace, project managers have now reports to support project scoping. Project managers don’t have rights to modify the on-premises data. So they need always to communicate their changes to IT department for database updates. They use Excel to export a list of the IT elements in scope. For this, they use the Export to Excel feature actionable through the visuals in the reports.

Step 3: Setting the Workspace

Next step is about setting a new workspace for the new project. This way we restrict access to project information only to the project contributors. Only them has access to this specific workspace, which uses Teams as a collaboration environment. In this workspace, they can save their analysis tools such as Excel workbooks with their standardize tables. Additionally, they can find ready-made reports connected to the golden dataset.

Step 4: Power BI Reports

The last step is to build the Power BI reports. The reports combine data from the golden dataset and manually input data in Excel files. This is only possible due to composite models capability. The developer uses Get Data to connect to the golden dataset (Power BI dataset). And the same way to connect to the Excel shared in Teams (SharePoint folder). Power BI does the rest to establish a live and secured connection. Now the reports are ready, but not automatically up-to-date.

Bonus Step: Featured Table and Excel Data Type

For an optimally automated solution, we need to make use of Power BI featured tables. The team needs up-to-date data from the golden dataset. They want to perform their analysis without having to open many windows. Consequently they want to have the actual data available in their standardized Excel tables. Here is when new Data Type feature of Excel comes to use. They just need to include the row ID from the featured table. Finally, the rest of the data automatically appears on the dedicated columns within the Excel table.

Now always up-to-date reports are ready. The project contributors can conduct their analysis, modify the values in the Excel and see the real-time impact in the Power BI reports.

Main Take Away

As Alberto Ferrari has mentioned, composite models enable the future of real time analysis in BI. Additionally, connecting Excel tables to golden datasets brings companies enormous flexibility for building future self-service BI reports. Although not necessary, the new Power BI featured table capability was missing to obtain automated end-to-end processes. This is key to increase the speed and, more importantly, the integrity of the data.

This real case includes many new features, still in preview, so we must be still careful about their impact. But do not hesitate, try it and let’s keep learning.

And why not learning together. Have you tried to build something similar? Dis you find a better solution? What did work to you? Is there a step you wish to know more about? Please, feel free to contact us.

Part 2: Untapped potential – analyzing Finnish esports broadcasting channels using Twitch API data

There are more than 117000 unique Twitch users following at least one of four Finnish esports channels: ElisaViihdeSport, Pelaajatcom, TES_csgo, or yleeurheilu. In this series of blog posts, I'm studying these channels and their users using data gained from Twitch API.

In the first part of this blog series, we discovered how many followers four Finnish esports channels share on Twitch. We also learned about Twitch as a platform and esports as a growing industry. If you haven’t read it, I recommend you to at least check the quick summary at the end of it, before reading this one.

In this part, we use Twitch API data to find out which Twitch channels “the average Finnish esports viewers” are following. Also, which esports channels were followed first before others were found and followed. Finally, we’ll see some top days for gaining new followers for Finnish esports channels.


Finnish esports channels that were studied in this series. Follower data fetched 2.8.2021.

Channel Owner Created Followers
ElisaViihdeSport Elisa Oyj 5.4.2018 69456
Pelaajatcom Pelaajatcom esports Oy 17.9.2018 67973
TES_csgo Telia Finland Oyj 29.3.2019 32879
yleeurheilu Yleisradio 16.1.2019 20325

At the time of the last post (data fetched 22.6.2021), four Finnish esports channels (ElisaViihdeSport, Pelaajatcom, TES_csgo, and yleeurheilu) had a total of 117049 unique followers. In this post, we are using a little bit more updated dataset, as data was fetched 2.8.2021. Just like last time, I’ll refer to channels as following: ElisaViihdeSport as Elisa, Pelaajatcom as Pelaajat, TES_csgo as Telia, and yleeurheilu as YLE.

Numbers didn’t change much during the summer as there were little to no broadcasts for any studied channels. This time we’re studying 117285 Twitch users, meaning that the dataset is practically the same.

Distribution of follows per studied channels among 117285 users following at least 1 of 4 of them. Users following only one channel (so called “dedicated followers”) were studied more in the previous part.

 

The average user among this group still follows only 1.62 of the four studied channels. As mentioned in the previous blog post, Finnish esports broadcasters compete more with international broadcasts instead of their Finnish competitors. Therefore, it would benefit them if they managed to share their followers, boosting everyone’s numbers.

Widening the scope

Using Twitch API, we can find out which are the most popular Twitch channels among our esports follower group of 117000 Twitch users. With that knowledge, we can draw a bigger picture of interests they might have. Of course, a follow doesn’t necessarily mean that they are intensely watching that channel, but at least they want to know if the channel is live.

For our broadcasters, it’s important to know their audience and recognize potential channels for marketing campaigns and other types of co-operations. Influencer marketing can be an effective tool to gain new followers and data helps to find the most suitable influencers.

Top 10 channels for our Finnish esports follower group

# Channel Followers % of 117285
1 ElisaViihdeSport 69424 59.2
2 pelaajatcom 67860 57.9
3 ESL_CSGO 61827 52.7
4 OfficialAndyPyro 41860 35.7
5 shroud 35156 30.0
6 Aleeksi 34198 29.2
7 DreamHackCS 33308 28.4
8 TES_csgo 32864 28.0
9 BLASTPremier 31513 26.9
10 ESL_CSGOb 30883 26.3
Top 20 most followed channels by followers of Elisa, Pelaajat, Telia, and YLE. YLE placed #21 with 20317 followers (17.3 % portion). Dark blue represents international CS:GO broadcasters, pink represents active professional esports players and yellow represents influencers.

 

As we already knew, followers of these accounts are – obviously, duh – interested in esports content, especially around CS:GO. The most popular channels feature international CS:GO tournament organizers such as ESL (ESL_CSGO and ESL_CSGOb), DreamHack (DreamHackCS), and BLAST (BLASTPremier). At the same time, there are a lot of active esports professionals, for example Aleksi “Aleksib” Virolainen (Aleeksi), Elias “Jamppi” Olkkonen (superjamppi) and Oleksandr “s1mple” Kostyliev (s1mple).

While it might be difficult to get these esports superstars to promote other Twitch channels than their own, the list contains many Finnish influencers, who could help to reach new audiences. There are already good examples, as Anssi “AndyPyro” Huovinen (OfficialAndyPyro) was part of Pelaajat.com bingo broadcast and Joonas-Peter “Lärvinen” Järvinen (Larvinen12) played Telia’s “aim challenge map” on his stream.

And why do these matter? Because as we take a closer look at channels that Pelaajat followers are following, we’ll see that AndyPyro’s channel is the third most popular channel with more than 30000 shared followers. Staggering 45 % of Pelaajat followers follow his channel too.

Top 10 channels for Pelaajat followers

# Channel Followers % of 67860
1 ESL_CSGO 41593 61.3
2 ElisaViihdeSport 32076 47.3
3 OfficialAndyPyro 30564 45.0
4 Aleeksi 28123 41.4
5 DreamHackCS 23104 34.0
6 TES_csgo 22495 33.1
7 superjamppi 22387 33.0
8 BLASTPremier 21995 32.4
9 ESL_CSGOb 21810 32.1
10 shroud 20897 30.8
Top followed channels for followers of Pelaajat. Dark blue represents international CS:GO broadcasters, pink represents active professional esports players and yellow represents influencers.

 

There are only two channels with more “related followers”. Both of them are their competitors, as ESL_CSGO is the main channel of tournament organizer ESL and ElisaViihdeSport is, as you already know, a Finnish competitor. The content of AndyPyro resonates with followers of Pelaajat, meaning that his follower base contains potential followers for them.

Telia’s co-operation with Mr. Lärvinen isn’t just about numbers, as he has been a part of their other esports broadcasts as a commentator. But the numbers should be a major factor as well. He is one of the most followed influencers among Telia followers, which we can see from this top 15 list:

Top 15 channels for Telia followers

# Channel Followers % of 32864
1 pelaajatcom 22497 68.4
2 ElisaViihdeSport 21300 64.8
3 ESL_CSGO 20967 63.8
4 Aleeksi 16335 49.7
5 OfficialAndyPyro 15750 47.9
6 superjamppi 14202 43.2
7 DreamHackCS 11854 36.1
8 ESL_CSGOb 11599 35.3
9 BLASTPremier 11447 34.8
10 eeddspeaks 11142 33.9
11 allub 10465 31.8
12 yleeurheilu 10261 31.2
13 Larvinen12 10181 31.0
14 TeliaEsportsFI 9625 29.3
15 StarLadder_cs_en 9219 28.0
Top followed channels for followers of Telia. Dark blue represents international CS:GO broadcasters, pink represents active professional esports players and yellow represents influencers.

 

If we dug deeper within the realms of Twitch API, we could make some kind of estimations of how many follows were influenced by these co-operations. Unfortunately, that’s something we have to leave out of this scope. The data is available for free, so broadcasting companies should take a look to find potential channels for campaigns like these and check how well they worked.

Has there been “following influence” from one esports channel to another?

Before we can check who followed who first, we should take a look at how many followers these channels share. As we learned from the previous post, “dedicated followers” are the biggest chunk with more than 70000 followers, but luckily for us, there are still more than 40000 Twitch users following 2-4 Finnish esports channels we’re studying here.

It was important to have no more than four channels to study because now we can check all the possible combinations – six pairs, four trios, and one quartet – of shared followers.


Important reminder: Twitch API numbers aren’t the absolute truth, as a follow gets wiped out of history books when user unfollows the channel. For example: If someone has found and followed a channel three years ago, unfollowed later and followed again two weeks ago, only the latter follow is in the books.


All the possible pairs of channels and their shared followers

Elisa and Pelaajat created their channels one year earlier and have way more followers than Telia and YLE, so it was expected that their followers would have followed other channels more than the other way around. The following orders between their shared followers were pretty close, as the split went 51-49 in favor of Elisa.

All the possible trios of channels and their shared followers

No surprises there either, as Elisa and Pelaajat have been the most common “gateways” to follow other channels. Finally, all 24 combinations of how 8907 users have ended up following all four channels:

Five users have followed some channels simultaneously, which causes them to be excluded from this graphic.

 

There are 24 different possibilities of how a user might have followed all four channels. If we check how 8907 four-channel-followers have ended in their positions, 46 % of them have had Elisa and Pelaajat as their first and second channels. Only 22 % had neither Elisa nor Pelaajat as the first channel they followed out of these four.

Twitch has established some kinds of channel recommendations for its new users during the account creation phase as some users have followed two or more of these channels at the same second. For existing users, Twitch has algorithms to recommend new channels at all times. These recommendations can be seen automatically and they can be browsed manually too.

Twitch Android app has a separate Discover section for recommended channels. Following section shows some recommendations too.

Whether it happens through these algorithms or by other social media channels, Twitch broadcasters can and will find new followers from other channels with similar content. Even if Elisa and Pelaajat have gained significantly fewer followers from Telia and YLE than the other way around, it’s still important to “convert” all the potential followers they can. Compared to international numbers, the Finnish esports audience is relatively small, and as we’ll see in the next chapter, one-dimensional.

Finding recipes for success

Follower numbers have been great for all these channels, especially Elisa and Pelaajat. However, it’s important to remember that those numbers were reached with free-to-watch content. Pelaajat has hundreds of monthly subscribers on Twitch, but it’s a completely optional supporting fee. They need advertisement revenue too.

There’s a long way to go to make esports broadcasts sustainable, let alone profitable. YLE is the only one without any financial pressure to grow, as it’s funded by the government. Of course, usage of taxpayers’ money needs to be justified now and then, so healthy follower numbers are useful for them too.

As it was shown, all these channels can grow whether or not they can attract completely new viewers to watch their esports content. Co-operation with different influencers is one way to gain more followers, but finding the right content is the most important one. One thing is certain, you can’t just buy these followers to yourself, you’ll need some luck too.

That luck could be described as “having broadcasting rights for a tournament with a CS:GO team that draws interest from Finnish followers”.

All four channels have gained most of their “new follower spikes” by broadcasting CS:GO tournaments. It’s no coincidence that the biggest gains have happened during CS:GO matches featuring Finnish teams of Finnish esports organizations ENCE and HAVU Gaming. Starting from the beginning of 2019, all broadcasters have benefited a lot from successful periods of Finnish Counter-Strike teams.

Top days for gaining new followers according to Twitch API data 2.8.2021. Exceptions for “the ENCE CS:GO rule”: 1377 new followers for ElisaViihdeSport during 2020 April 27 were influenced by a PUBG tournament instead of CS:GO. ENCE’s PUBG team was part of that tournament though. All the other Top 15 days had ENCE CS:GO playing. Both days of 2021 (March 12 and March 28) featured ENCE vs. HAVU match-up. Telia’s best day brought them 1015 new followers on 29th of August 2019.

Conclusions

Finnish esports channels have grown their numbers during the last few years, and hopefully, they keep on growing. As broadcasts are free and rarely done simultaneously, it’s possible to get followers from competitors without hurting their business. It would be a whole new ball game if broadcasters had to constantly fight for the same concurrent viewers. It’s bound to happen sometimes with some open qualifiers, but it’s not that common because of exclusive broadcasting rights.

Different channels have different styles and it’s impossible to please everyone. Nevertheless, it’s possible to find followers from competitors, and as you saw from the previous chapters, it has already happened a lot. As their average Twitch follower follows tens of other channels, Finnish esports broadcasters compete for attention against international broadcasts and other Twitch streamers rather than each other.

If esports broadcasts weren’t free to watch, would their Twitch followers watch other, more casual Twitch streamers instead of “buying the ticket” for esports tournaments? Even if someone follows a channel, how likely is it that they watch it regularly, instead of all the other 40+ different channels on their list? Even if they watch, are they really watching and paying attention to advertised products?

Getting lots of followers is only the beginning. Hopefully, this series provided some insights on that part. Monetizing esports followers is for someone else to figure out.

Thank you for reading, I’ll gladly answer questions on Twitter or LinkedIn if anything comes to your mind!

(Cover photo: Unsplashed)

Summary of this post

For those of you in a hurry, I’ll try to sum up key points of this post:

  • Which channels Finnish esports audience follow on Twitch
    • International tournament organizers
    • Active and retired esports professionals
    • Finnish and international influencers
  • Following patterns for “shared followers” of Finnish esports channels
    • Most have followed either Elisa or Pelaajat first, before following others
  • The content that produces the biggest “new follower spikes”
    • CSGO tournaments featuring a team that draws interest for Finnish audience, especially ENCE

Summary of the series

In this series, Finnish esports channels were used as examples of Twitch channels with similar content and follower base. While we learned about 117000 Twitch users following them, we learned about different Twitch API data use cases along the way.

Different use cases for Twitch API data:

  • Finding how many followers are “shared” between channels of similar content to define the potential size of follower base
  • Discovering other interests of current followers
  • Determining the most suitable influencers to do marketing campaigns
  • Studying patterns in how users have followed different channels of similar content
  • Identifying which types of content produce the most new followers

 

Untapped potential – analyzing Finnish esports broadcasting channels using Twitch API data

There are more than 117000 unique Twitch users following at least one of four Finnish esports channels: ElisaViihdeSport, Pelaajatcom, TES_csgo, or yleeurheilu. In this series of blog posts, I'm studying these channels and their users using data gained from Twitch API.

Twitch, also known as “Twitch.tv”, is a live streaming platform where anyone can stream their live content to a worldwide audience. It’s mostly known for different types of gaming content, although non-gaming content has also gained a whole lot of popularity during the last few years. This series of blog posts focuses on gaming content – more specifically esports content – by analyzing the four biggest Finnish esports broadcasting channels on the platform: ElisaViihdeSport, Pelaajatcom, TES_csgo and yleeurheilu.

“Oh geez, that’s a lot of stuff to go through, why should I bother reading all of it?”

In this series, I’m trying to make interesting points about the most popular Finnish esports channels and their follower base. I’ll show that all those channels have a lot of room to grow, whether they keep broadcasting for a Finnish audience or expand to an international audience, which is naturally a lot bigger.

Hopefully, you’ll learn something new about esports and the broadcasting side of it, whether you have prior knowledge or not. I can assure you that I found things that I didn’t expect, even though I’m a former professional in the field. In any case, you’ll get to watch some numbers and graphs, not just a wall of text. Enjoy!

As an important disclaimer, I’ve had the pleasure to work with many of these broadcasters during my few years as an esports journalist. These channels were chosen because they consist of four different companies, they have more than 20000 followers and – to be frank – I had to narrow it down somewhere, so I ended up with these four channels. I have no strings attached to these channels and/or companies anymore, and they have not influenced anything in this text, except for doing esports broadcasts for the Finnish audience.

Before jumping to Finnish esports, it’s important to understand some things about Twitch and esports in general. This first post of the series focuses more on introducing these topics to someone, who hasn’t been that familiar with them.

What is Twitch and why should I care about it?

Twitch is one of the most popular live streaming platforms. According to them, seven million Twitch streamers go live every month, and there are users from over 230 countries. TwitchTracker data states that the all-time maximum number of concurrent viewers on Twitch was more than 6.5 million in January 2021. It’s a huge platform globally, but it’s really popular for Finnish viewers and Finnish content too. Twitchtracker data states that there are thousands of Finnish channels broadcasting every month.

Metaphorically speaking, it has replaced television for younger generations. While browsing Twitch at any time of day, there’s always live content to watch. You can switch effortlessly between different channels, but since the vast majority of content is completely free, you need to watch advertisements in between changing channels and sometimes during broadcasts.

Just like many other social media platforms, users of Twitch can “follow” other users, to get notified when their broadcast begins. At the moment, every Twitch user can follow 0-2000 channels. In addition to getting notifications, users can check manually which of their followed channels are live.

Screenshot of Twitch Android application. Live channels are sorted by concurrent viewers.

The more followers, the more people see that channel is live, hopefully luring them to watch it. The more concurrent viewers, the better for the broadcaster, because they appear higher in their followers’ list of live channels. The competition for viewers is fierce, and bigger channels tend to gain more views.

As expected, Twitch users are mostly just watching and chatting, instead of streaming to their channel. There are three types of accounts in Twitch:

  • Regular: Every account is a regular account at first.
  • Affiliate: First upgrade from a regular account after reaching specific milestones of streamed content and viewing numbers. Can receive money from monthly subscriptions and donations made using Twitch’s own “Bits” currency.
  • Partner: The most wanted account status. Upgrade from Affiliate status, as the user gets “Verified” status and more perks than an affiliated user.
Analyzing user types and view counts of approximately 117000 Twitch users, that follow 1-4 channels of ElisaViihdeSport, Pelaajatcom, tes_csgo and yleeurheilu.

As a glance to 117000 Twitch users following ElisaViihdeSport, Pelaajatcom, TES_csgo or yleeurheilu: more than 95 % are regular users, about 4 % are affiliates and less than 0.5 % are partners. Nevertheless, if we sum the view counts of all their channels, more than 80 % consist of partners’ views, 14.5 % of affiliate channel views, and less than 4 % of the views were for regular accounts.

View count isn’t the best and most accurate metric for measuring a channel’s success, but it’s the only view-specific metric you’ll find from Twitch API. Luckily, it’s good enough to prove that views are heavily condensed towards partner accounts on Twitch, at least for this sample of users we are studying.

And why do we concentrate on just Twitch instead of Youtube or other live streaming platforms? Because Twitch is so popular for esports content, it’s almost like it has a monopoly of being “the esports broadcasting platform”.

What is esports and why should I care about it?

Finnish esports scene has grown a lot during the last few years. Photo: Tubecon 2019, SEUL ry / Arttu Kokkonen

As a short introduction, esports is the concept of playing video games competitively. To add more details, esports consists of video games of all types, but the most important ones are competitive multiplayer games played with personal computers, gaming consoles and mobile devices.

Just like in traditional sports, some games are played individually (one versus one), some as teams of different sizes. The most popular esports titles like League of Legends and Counter-Strike: Global Offensive are team sports, played as five versus five, similarly to many traditional ball games. Esports tournaments and leagues are broadcasted like traditional sports, as they share many competitive aspects.

As a spectator sport, esports has huge business potential. According to Sponsor Insight, in the Spring of 2019 esports was “the most interesting sport” among Finnish men between ages of 18 and 29. In my opinion, the result might be a bit too flattering, because all different esports titles were combined and compared to single sports titles like football and ice hockey, but it’s still a sign of growing interest. Esports has a lot of potential, but there is still one huge puzzle to solve: monetization.

In traditional sports – at least for the most popular ones – viewers usually have some kind of monthly subscription to be able to watch live broadcasts. They are used to paying about tens of euros per month for watching their favorite sport. As it’s a monthly subscription, they watch when they have the time for it.

Even if they can’t watch everything, they are still paying customers. The viewing statistics aren’t neglected by broadcasting companies, but at the end of the day, the amount of paying customers is more important than the highest number of concurrent viewers. Of course, traditional sports need advertisement money too, but compared to esports, it’s a whole different ball game.

Esports section of Twitch.

For esports broadcasts, the viewer numbers are everything, because all broadcasts are free to watch, and there are no signs for that to change in the future. Just like mentioned in the previous chapter, the more followers, the more concurrent viewers. The more concurrent viewers, the more money to be made from advertisements.

Some channels accept different types of optional donations (Twitch terms: “Subscriptions” and “Bits”), but at the moment they are not used as mandatory payments. Therefore, broadcasts are usually funded by partner companies who want attention for their products and services. There’s a long way to go to monetize esports broadcasts, but this blog series isn’t about that, so let’s not dig too deep into it.

One more thing to note about the competition between esports broadcasters. One could think that broadcasting companies of the same nationality would be the biggest competitors for each other. However, in the esports context, national channels battle most with the main broadcast of a tournament. As opposed to traditional sports, viewers can choose between multiple broadcasts and languages of the same tournament, even if the broadcasting company has bought exclusive rights for country or language.

Main channels – ESL_CSGO, BLASTPremier and DreamHackCS to name a few for a video game called Counter-Strike: Global Offensive or CS:GO – have more or less one million followers, and they tend to be highlighted on Twitch during their tournaments. None of the Finnish esports channels have reached even the milestone of 100,000 followers, meaning that many potential Finnish Twitch users might select English broadcasts purely because they are recommended, while Finnish channels need to be searched manually.

This blog series analyzes data about followers of the four biggest broadcasters of Finnish esports: Elisa, Pelaajatcom, Telia and YLE. Only one Twitch channel was analyzed for each of them, even though some of them have more than 10000 followers for their secondary channels too.

Honorable mentions that weren’t in the mix, but who would deserve some analyzing: Finnish Esports League, PUBG Finland and Kanaliiga.

The major players of Finnish esports broadcasting

First of all, it’s not just these four esports channels that have had remarkable Twitch numbers in Finland, but these four are the most interesting ones in my opinion. All of them have built most of their follower base using the success and popularity of Finnish CS:GO teams, especially ENCE during 2019-2020.

Three of four channels – ElisaViihdeSport, Pelaajatcom and TES_csgo – are owned by commercial companies, while yleeurheilu is run by Finland’s national public service media company Yleisradio Oy.

Pelaajatcom ownership is under a consulting company called North Empire Oy. ElisaViihdeSport and TES_csgo are in another league, as their owners are telecommunication giants Elisa Oyj and Telia Finland Oyj.

Channel Owner Created Followers
ElisaViihdeSport Elisa Oyj 5.4.2018 69341
Pelaajatcom North Empire Oy 17.9.2018 67819
TES_csgo Telia Finland Oyj 29.3.2019 32684
yleeurheilu Yleisradio 16.1.2019 20377
Data for all graphs in this post was fetched from Twitch API 22.6.2021.

To make the text more readable, I’ll ignore the actual and factual channel names and refer to ElisaViihdeSport as Elisa, pelaajatcom as Pelaajat, TES_csgo as Telia and yleeurheilu as YLE.

All the numbers in this series have been fetched from Twitch API, if not presented otherwise. It’s a public API that allows registered users to make 800 requests per minute for their endpoints. It’s important to note that a “follow” is wiped out of the Twitch database – or at least API – if a user unfollows a channel. What this means is that historical Twitch API data might be missing some follows here or there, because some followers have unfollowed the channel afterward.

Twitch API data is not the absolute truth about follower statistics, but it provides good insights about the content that makes Finnish Twitch users press that purple Follow button.


Glossary

Quick little glossary to clear up some of these terms:

  • User: Twitch account made for a person or company.
    Can stream on own channel, follow other users and chat during Twitch broadcasts. In this post, users aren’t considered as broadcasters or channels, even if they technically are those too.
  • Channel: Twitch account of a broadcasting company
    Technically every user is also a channel and vice versa. In this text, I’ll present four broadcaster users as channels, while users that are following them, are presented as followers.
  • Follower: In the context of this post: Twitch User that follows 1-4 of esports channels that are studied. Not all users are followers, but all followers are users. One user can follow up to 2000 channels on Twitch but usually follows less than 50 channels.
  • Dedicated follower: A completely made-up term by myself. It represents a user, who follows only one of four channels that are studied in this blog post. In other words, a dedicated follower might follow up to 2000 channels on Twitch, but only one of them is either Elisa, Pelaajat, Telia or YLE. In other words, no “dedication” needed, might be just pure coincidence.

Digging into user data – looking for new followers

Firstly, an important note to remember with all these numbers flying around: This chapter studies the following numbers of Twitch users that follow one to four Finnish esports channels. All other Twitch users are excluded, and on the other hand, all those four specific channels are studied, not just followers of one or two.

There are 117049 Twitch users that have followed at least one of these four channels. And most of them have followed only one. On average, these users follow 1.62 channels out of four.

In total, these channels had 117049 unique Twitch users as their followers, when data was fetched from Twitch API 22.6.2021. Less than 9000 users (7.6 %) followed all four channels. Almost two thirds, 74783 users (63.9 %), followed only one of these four channels.

These numbers surprised me a lot. After all, at least in my perspective, all these channels have broadcasted similar esports content. In other words, all these channels have broadcasted both Finnish and international CS:GO tournaments featuring Finnish top teams.

Counter-Strike: Global Offensive is one of the most popular first-person shooter video games in the world. Teams of five players try to win rounds by either eliminating the enemy team or finishing their objective. Photo: Seul ry / Pekka Nummela, Vectorama 2019

If we take a closer look at this huge chunk of users following only one of four channels, we’ll find that they are mostly following either Elisa or Pelaajat. For the sake of clarity, I’ll call these users “dedicated followers”, since they aren’t following any of the other three channels that are inspected.

Distribution of “dedicated” followers and users who follow other studied channels as well. For example: 36000 followers of Elisa are also following Pelaajat, Telia or YLE. However, more than 33000 Elisa followers aren’t following any of those three.

Not only Finnish CS:GO content draws followers

Some growth has been made with other than Finnish CSGO teams’ success. For Elisa, part of their dedicated followers seems to be gained through their broadcasts of PUBG esports tournaments.

Playerunknown’s Battlegrounds, PUBG, is a first-person shooter battle royale game. In esports tournaments, it’s played with 16 teams fielding four players. Photo: PUBG

Almost none of these followers have followed Pelaajat, Telia or YLE in addition to Elisa, which is most likely influenced by these two things:

  • They followed during a PUBG tournament. Other channels aren’t broadcasting this video game that much.
  • They followed the channel when the broadcast was made in English instead of Finnish. Other broadcasters have their productions only in Finnish.
Top 10 days for gaining dedicated followers for Elisa. Two days with significantly more non-dedicated followers (30.4.2020 and 11.5.2019) were CSGO broadcasts that featured ENCE. All other days were PUBG broadcasts with English-speaking production.
Day New dedicated followers New other followers New followers (Total) Share of dedicated (%)
27/04/20 1361 20 1381 98.6
14/05/20 703 16 719 97.8
12/05/20 703 18 721 97.5
30/04/20 700 1237 1937 36.1
19/05/20 698 16 714 97.8
11/05/19 614 987 1601 38.4
20/05/20 576 13 589 97.8
21/05/20 476 14 490 97.1
20/04/20 410 31 441 93.0
20/09/20 390 8 398 98.0

Of course, the definition of “English-speaking esports broadcasts” is not that simple when we take a look at history books of Finnish esports broadcasting. If you’ve been following the scene for a while, you’re probably already asking: “But what about the 2019 Berlin Major rally English broadcasts by Pelaajat.com?”.

And that is definitely something we should inspect more closely.

Pelaajat couldn’t do their broadcasts in Finnish, because Telia owned exclusive rights for the Berlin Major 2019 tournament’s Finnish production. Pelaajat found a loophole around the exclusive broadcasting rights by making their broadcasts using broken English, as English broadcasts were not controlled as strictly as Finnish. They became a viral hit during the tournament and “rally English” made them known outside of Finland too.

Yes, Pelaajat gained thousands of followers during ‘rally English broadcasts, but not just dedicated followers. Almost 5000 followers of those 12 days have stayed as dedicated followers, but in addition to them, more than 4000 new followers have followed other Finnish esports channels as well.

Pelaajat grew its follower count massively during Berlin Major 2019 tournament. About half of these “Rally English” followers have followed other Finnish esports channels as well. Off-days of the tournament (2-4.9.2021) were left out of this graph, but Pelaajat gained more than 200 new followers during those days too, thanks to social media hype.
Day New dedicated followers New other followers New followers (Total) Share of dedicated (%)
28/08/19 534 628 1162 46.0
29/08/19 1200 1162 2362 50.8
30/08/19 888 747 1635 54.3
31/08/19 691 448 1139 60.7
01/09/19 476 348 824 57.8
02/09/19 45 36 81 55.6
03/09/19 63 73 136 46.3
04/09/19 17 20 37 45.9
05/09/19 744 681 1425 52.2
06/09/19 74 62 136 54.4
07/09/19 86 59 145 59.3
08/09/19 92 61 153 60.1
Total 4910 4325 9235 53.2

For comparison, here are the all-time top 10 days for gaining new dedicated followers for Pelaajat. Rally English was part of four top 10 broadcasts, but most of their top-follower-gaining broadcasts were made in Finnish, and approximately half of these new followers have found other Finnish esports channels as well.

Top 10 new dedicated followers per day for Pelaajat. The battle between CS:GO teams ENCE and HAVU on 12.3.2021 broke all kinds of records for Finnish esports broadcasting, new followers per day being one of them.
Day New dedicated followers New other followers New followers (Total) Share of dedicated (%)
12/03/21 1835 1411 3246 56.5
29/08/19 1200 1162 2362 50.8
01/02/20 1119 1597 2716 41.2
03/03/19 1077 600 1677 64.2
28/02/19 911 812 1723 52.9
30/08/19 888 747 1635 54.3
02/03/19 873 654 1527 57.2
05/09/19 744 681 1425 52.2
31/08/19 691 448 1139 60.7
23/02/19 615 792 1407 43.7

Certainly, some dedicated users won’t be following other Finnish esports channels in the future. If one follows a channel for PUBG content spoken in English, it’s not going to happen for any other channel than Elisa. However, a lot of potential followers can be found from dedicated users who have followed because of “traditional” Finnish CSGO broadcasts.

Plenty of space for follows

Twitch users can follow a maximum of 2000 channels with their accounts. If we take a look at how many channels these dedicated users are currently following, we’ll find that there is a lot of room left. Half of them follow less than 40 Twitch channels. Not following any other than one of four Finnish esports channels is not about the limits of Twitch, it’s about something else.

When we widen our sights to all 117049 users following 1-4 Finnish Esports channels, the median mark is at 46 followed channels. In other words, half of the users (50.1 % and 58688 users to be precise) follow 46 channels or less, while the other half is following 47 to 2000 channels. More than one third (39286 users, 34 %) of all users follow less than 25 channels.

Followed channels per user, grouped. For most of the users studied in this research, there’s plenty of room to follow more channels, as 2000 is the limit. The amount of users following less than 50 channels is bigger than all the other groups combined.

The average number of followed channels is 87.3, but that is heavily influenced by users following more than 1000 channels. There are only 386 users (0.3 %) with that amount of followed channels. One could (and probably should) question if these are legitimate users at all, but with this sample size, their impact is irrelevant.

Fun fact: Six users have somehow managed to break the barrier of 2000 followed channels by getting to 2001, 2002 or even 2003.

The “following amount numbers” could be a study of its own. I’m part of this group of 117000 Twitch users and I’m currently following 231 channels on Twitch. It sounds like a big number after studying these numbers, but somehow it’s still possible that the only live channel on my Twitch is a rerun of some random ESL tournament.

Conclusions

This was the first blog post about Twitch API data of Finnish esports broadcasters and I hope you learned something new. In the next part, I’ll dig deeper into the follower data of these four channels.

If you somehow managed to skip all the content and stop right here, I’ll try to summarize the key points and findings of this first part:

  • Twitch is one of the most popular platforms for live streaming, especially for esports tournaments.
  • Esports is a growing industry that needs to figure out its monetization.
  • Esports broadcasters need good follower numbers on Twitch to even have a chance to succeed financially.
  • These four studied Twitch channels (ElisaViihdeSport, Pelaajatcom, TES_csgo and yleeurheilu) have about 117000 unique followers in total
    • Almost two thirds of these Twitch users follow only one of these four channels that were studied
      • They are (usually) potential followers for other channels.
      • Most of them follow either Elisa or Pelaajat.
    • Half of these users follow 1-46 channels on Twitch, so it’s not about the limits of Twitch, which is 2000 followed channels per Twitch user.

The second part will be published after the summer holidays. If you have any questions about anything in this post, you can drop me a DM on Twitter or LinkedIn!

(Cover photo: Unsplashed)

Positioning data build tool, dbt, in the data tooling landscape

dbt has gained a lot of traction, and we’ve seen some projects and offers, where dbt is already chosen as the tool. But what dbt actually is? Where can you position it in the data tooling landscape? What does it actually do, what it sort-of does, and what it doesn’t do?

There’s a lot of buzz going around the data build tool, or dbt. It has gained a lot of traction. We’ve seen some projects and offers, where dbt is already chosen as the tool to be used. But what dbt actually is? Where can you position it in the data tooling landscape? What does it actually do, what it sort-of does, and what it doesn’t do? This blog post focuses on how to position dbt, what are its strengths and weaknesses, and how you should compare it to other tools in the market.

What is data build tool?

Data build tool, dbt for short, is a tool to manage and, to an extent, orchestrate the transformations inside a database. It’s an open source command line tool written in Python. Every entity is defined as a “SELECT”-statement in SQL with the possibility to use Jinja templating. Configurations are written in yml files. It’s a “unixey” command line tool designed to do one task, “Transformations” of “ETL”, do them well, and provide easy integrations to other parts of your data toolbox.

dbt, fully supports modern cloud data warehouses: Snowflake, Redshift and BigQuery. Azure Synapse (and Azure SQL Server for that matter) is supported by community created plugins. The full list of supported databases is listed at: https://docs.getdbt.com/docs/supported-databases/ 

dbt comes in two flavors: “dbt core” which is the open source cli tool, and a paid “dbt cloud”. We will focus on the free open source cli version. dbt cloud offers for example ci/cd pipelines and a browser-based IDE, but runs using the same dbt core. We’ll create our own pipelines, environments and a production ready setup in a later post.

 

dbt transform

https://www.getdbt.com/product/

What dbt does well

Data build tool really focuses on doing one thing and doing it well straight from the command line. In that sense, it can be seen sort of an “Unixey” tool to do transformations inside a data warehouse. But there’s much more of what dbt does well than just being a nice cli tool.

The already mentioned data transformations are at the heart of dbt. Data is transformed step by step by simply writing the required SQL statements as “SELECT nnn, from <a reference syntax to another dbt entity>”. There is no DDL needed as dbt takes care of that based on configuration. Different SQL statements are stored in a project structure, one file per one database entity. The project and file structure is pretty much freeform, and there obviously are some design considerations to make here. More on the project structure in a later post about creating a production worthy dbt environment.

As mentioned, dbt is a command line interface tool. Its setup on your own machine is really as simple as brew install dbt. Of course, you might want to run it in a virtual environment, but that’s not the point. The point is that getting dbt up and running is really simple and fast:

  • Install dbt
  • Setup project structure
  • Setup your database connection in yml profile
  • Write familiar SQL in files
  • dbt run
  • git commit

In addition to the entities.sql files in the project, there are .yml configurations. These provide a great way to have wide yet granular configurations on all aspects of your project. It is easy to tell staging entities to be tables in “staging” schema, transformations to be views in “transform” schema, and publish layer to be incremental loads in “publish” layer. And then, as the project progresses, you can just as easily make exceptions to these.

To make things better, the SQL used to write the transformations is spiced with Jinja-templating language. You use special syntax to reference your source tables, you can do for-loops, and what makes it especially useful, you can create macros. Macros and different packages really extend what the basic dbt can do out-of-the-box. More about packages in dbt (data build tool) – Explore Packages (getdbt.com). As an example, there’s a really nice small macro to do a “select * expect a and b columns”, which then evaluates to include all the named columns without a and b. On the other end of the spectrum, there’s a package called dbtvault, which contains plenty of macros to aid in creating a Data Vault in Snowflake using dbt.

Besides transformation, dbt does data testing. As with the transformations, you can write your own SQL or use ready-made statements like “unique” or “not_null” to test for those properties in any table/column.

Last thing dbt does really well is the documentation. There’s a built-in documentation generation, which documents the current project and SQLs as a static webpage, which can then be viewed either locally or put up on a simple Blob Storage or S3 (since it’s a static webpage).

dbt documentation

Documentation | dbt Docs (getdbt.com)

To put it short, dbt really shines in doing the transformations. The transformation capabilities are greatly enhanced with Jinja templating, macros and hooks (different times to fire a macro or SQL, like “before a table” or “at the end of a run”). To top it off, built-in testing capabilities and the documentation generation make it really easy to start building a data warehouse.

What dbt sort-of does

In addition to the transformations, there are some other capabilities data build tool can do, but you might find them quite lacking quite quickly: Orchestration and SQL generation.

The orchestration here is a bit of a mixed bag. dbt can do “runs” against a target database, running all the transformations you tell it to run using a pretty simple syntax or the whole project. But any more fine grained workflow orchestration requires a lot more work and moving parts than just dbt. Say you want to run one source system data on an hourly schedule, the rest daily, and do the runs going “run stage -> test stage -> run transform -> test transform -> run publish -> test publish -> switch to new publish”. Pretty basic need for a larger setup, but pretty difficult, if not impossible, to achieve with (the current version) of dbt. That is not to say you can’t use dbt to orchestrate your runs, you can and probably should start with it. But you should be aware of the limitations when the business requirements come.

The other lacking part is the SQL generation. This has partly to do with the SQL, Jinja, macros and packages. They do expand on the capabilities, but out-of-the-box, dbt doesn’t contain any SQL generation capabilities besides DDL. Say you want to have a common structure and metadata columns for all staging and publish tables: you either need to find a package for it, or create your own macro. Of course this is a double edged sword: dbt is really expandable in this way, but limited out-of-the-box.

What dbt doesn’t do

Data build tool is not an integration tool. Its only focus is managing things inside a data warehouse. What this means is that you should manage your data loads from storage to the database somewhere else, and you should also account for your persistent staging somehow.

dbt is built on the assumption that you can always completely rebuild your data warehouse from the source data. This comes up here and there when you get to the details of running a dbt project, but perhaps the biggest thing to note here, together with the lack of integrations, is how to account for persistent staging.

This means that not only you need to have a persistent staging available for dbt to rebuild the database, you also need to manage it outside dbt since you really can’t run integrations with it. This comes down to managing landing tables and populating those tables with some other tool, and possibly integrating that other tool with dbt for seamless scheduling.

As stated earlier, dbt is a cli tool using file based project structure to manage your data warehouse. What this means is that dbt in itself doesn’t contain any way to manage deployments to different environments or anything. For this, you need to build your version control and ci/cd pipelines. This, again, is something for the next part of this blog.

Things to consider

Then there are things you need to consider in your project and in the environment in which you develop. Not the dev/qa/prod environments, but the size, number and capabilities of developers and so on. As a “Unixey” tool with pretty much no guidance, there are some aspects which come in to play in different sized projects and different environments.

For larger projects and large and/or multiple teams, using dbt can have it’s own problems. For larger projects, you will need to design your project structure in a way that it supports your development. Having a huge all-in-one dbt project has its own pros and cons: consistent way of using dbt, common packages and macros, managing a huge repository, deploying consistently working versions and dbt performance (as of writing this blog post). Splitting a data warehouse project into multiple dbt projects again has its own problems: not having all of the entities available for simple ref-syntax, managing packages and macros among projects, deploying consistently working versions to name a few.

Another thing to consider is the fact that dbt works straight on top of the target database. There are no abstraction layers between. At first, this might seem like not a big deal, but what this means is that you need to have some understanding of the underlying database when using dbt: even if you are writing the SQL with Jinja, what you type is what you get. There are no abstractions, no SQL generators based on the target database engine. Out-of-the-box. Again, you can create your own macros and use packages. Your SQL files are the entities, there are no more abstract models and the relations between them: it’s “select nn from aa”.

So when to use dbt?

As presented above, dbt shines in doing (mostly) one thing and one thing only: managing the SQL transformations on your data. It’s fast to set up, excels at creating transformations and a data lineage based load ordering is easy. You can write data quality tests. And you even have a way to execute those runs and tests.

Equally important are the parts that dbt doesn’t do anything about: it doesn’t do integrations. It doesn’t do load orchestration.

And you should also know the pitfalls: you should plan and divide your dbt project in advance. Manage testing and releasing new builds. How to keep scaling and making sure that all isn’t behind “the-one-dev-that-set-it-all-up and knows everything about dbt”?

So in conclusion, dbt really is a “Unixey” tool. It does transformations and their management from the command line and provides integration points for tools before and tools after it in the load chain. To have a working DataOps toolbox in addition to just the cli dbt you would require at least:

  • Version control and the processes surrounding it
  • Managing different versions in different target environments
  • Source data integrations and data loads
  • Creating and managing data load workflows

But this is not all bad. As with all projects and business cases, you want to slice them up into smaller and more manageable parts: in a sense, dbt does just that with focusing on being only one tool in the toolbox.

This also means that you can’t easily compare just dbt to tools like Fivetran, Matillion, WhereScape, Airflow, Azure Data Factory or Solita Agile Data Engine. You need to know what are the capabilities that are missing from your data project toolbox, what capabilities different tools bring to the table, and what the benefits and drawbacks of each are. And if you don’t know what capabilities you should have, go check out Vesa’s nice post about DataOps platforms: What to look for in a DataOps platform? – Solita Data

Tableau has removed minimum purchase requirement from their license policies

Tableau has removed minimum purchase requirement from their licenses. The change has enabled Tableau to be deployed at very low cost and with exactly the number of users needed for each organisation.

In February 2021, Tableau announced that they will remove minimum user amount restrictions from their licensing. For example, earlier the Viewer license had a minimum sales volume of 100 users. The change has enabled Tableau to be deployed at very low cost and with exactly the number of users needed for each organisation.

The Tableau Creator license is intended for individuals who prepare data for their own or others’ use and publish content. With this license, user can take advantage of all Tableau’s capabilities from preparation, analysis, visualisation and publishing.

With the Tableau Explorer license, the user can create visualisations based on ready-made data models with a browser. 

With the Tableau Viewer license, you can view and use published visualisations and dashboards interactively in a variety of ways based on given permissions.

At Solita, we see Tableau as a visualisation platform that gives our customers the best visibility into their data. We are the gold level partner of Tableau and through Solita you get licenses, commissioning, training, design and implementation work at a scale that suits your needs!

We will be happy to tell you more about Tableau and together we can build a solution that is the most suitable size for your organisation!

Contact:

Suvi Korhonen, Tableau Partnership Manager in Solita Finland /
Data Consultant
suvi.korhonen@solita.fi
+358503096268

Tero Honko, Data Consultant
tero.honko@solita.fi
Phone +358405878359

Jenni Linna, Data Consultant / People Lead
jenni.linna@solita.fi
Phone +358440601244

Data monster: Azure purview is for empowering different data users and connecting with different data sources. Build for hybrid world

Getting started with Azure Purview

Understanding your data is driving innovation and markets around the world. Leading companies leverage their data already so well that new products and services are fully built on data. Let's take a look at the hottest Azure service on the market Azure Purview. First let's see what the promise of the service is and then let's dive into the product itself.

Over the last few years, Microsoft has brought several different data solutions to the Azure platform, like Data Factory, Machine learning studio, Synapse, and now Azure Purview. 

 Azure Purview is a unified data governance service that helps you manage and govern your on-premises, multi-cloud, and software-as-a-service (SaaS) data. Easily create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage. Empower data consumers to find valuable, trustworthy data. – Microsoft 

Why Azure Purview? 

What Azure Purview tries to solve is the data discovery and lay down the foundation for data governance. The business point is that everyone wants to know how data is connected between systems and where does the data come from. A very common issue in any organization. 

  • Centralized place for all the metadata
  • Track and visualize data lineages
  • Search and find answers about your data

The core problem organizations have is a lack of data ownership. Cataloging data and having a full picture of how different sources are connected, will definitely provide better ownership and transparency. The solution works across on-premises, multi-cloud, and SaaS sources.

Currently, Purview can do three things:

Catalog:

  • Source registration, automated scanning, and classification and data discovery.

At the moment there is some limitation on what type of source you can register. The majority of Microsoft products are offered in the selection. There will be custom sources available later, this will grow very fast. (Snowflake, Oracle, Salesforce, etc.)

  • Business glossary and lineage and lineage visualization

This area lets you see where the data is coming from and how it is connected with different systems. For example, it is possible to connect Purview with Data Factory and Power BI, so you can see the whole lineage of how data is joined, transformed, and stored in different parts of the pipeline.

Data insights:

  • Catalog insights and sensitivity insights

Combining all the metadata that you have and providing analytics and classification. This is defiantly the most interesting part, where you can also label and group different parts of your data into a collection.

Getting started 

The service is in preview, so don’t expect much. The ARM template can be found from here, There are not many things you can configure or change. If you want you can use the default parameters and deploy it to Europe west. Hopefully, there will be a similar Git integration like Data Factory has, till then source integrations will be done from the Azure portal. 

Now that we have deployed, let’s open up Azure Purview studio from Azure Portal. 

Azure Purview landing page, Knowledge center, Register sources, Browse assets and Manage glossary
Azure Purview landing page

Purview 

Dive into the data and let’s see how things work. From the left side choose source, from their registry, from credentials, you can either choose to select a key vault and search for the secrets there, but as we previously did the managed identity we already have rights to access the storage account. The best thing is that you can’t type your passwords or other credentials into the portal, like in Data factory. It forces you to use a key vault. This is the beginning of an end to hardcoded credentials? Maybe, let’s hope so!

Under source you can find collections that include different sources
Source collection that can be created

Remember that whatever you choose as a source system, Purview requires a lot of rights. I like to call owner rights as God rights because it can select * from all tables, which is superuser rights. 

Scanning

Now it gets interesting, some rules are applied by default. The list is long, so the more boxes you check the longer it will take to run your scan. This will help if you are working with sensitive data and want to make sure that you comply with the regulations. Creating a new rule, allows you to specify what you want to scan and what not.

Available source connections: Azure blob storage, Azure Cosmos DB, Azure data explorer, Azure Data lake, Azure Synapse analytics, Power bi
Available sources

 

Remember, this isn’t a live connection. You either do a one time only scan or set a schedule. When running a scan you can do the full scan or incremental, which will search for the changes,

The assumption is that there will be some sort of an event-grid option in the future, where you can trigger scans based on data modifications.

Glossary

This is the place where you create owners for the data. You can add people who have the domain understanding and who are working with the data and how it is connected to different resources. This is the thing you connect with the scanned data. What field relates to what data and what is the definition.

Browse assets, deep dive into how data is moved. Linage can show a hierarchy how data is delivered to end users
Browsing assets, example from Azure Sql server

Insights

I used two datasets, one is covid data and another one is credit data. The covid data didn’t automatically give any classifications, but credit data gave. Of course, you can do the classification yourself.

Assets insight, diagrams that contain data flows, amount of different sources and classification
Asset insight

Summary

Considering the amount of data circling in different silos, this will improve data ownership and transparency. At the moment it’s not a production-ready solution but very promising.

Administration level rights that are needed will become a bottleneck for individuals who would like to connect to different sources. Even getting a connection to Power Bi requires Admin level rights.

This needs to be implemented into your data strategy and data governance model. Services like this will need planning, we at Solita have delivered more than +400 different data projects over +20 years. Ping me on Linkedin, more than happy to help your organization out!

Are you building data services for the Mastermind or the Wannabe?

As in all development projects, you need to know your audience to meet their needs. Data projects are no different.

You are looking at your data platform user adaptation rate, and it is not what you expected. Are you puzzled why aren’t people jumping on board the new, more efficient data tool and ready to leave the legacy system behind?

To know where you are at, ask yourself these questions:

  • When this project was started, did I spend time on getting to know my users?
  • Am I able to tell what is their skill level and how it matches the platform capabilities
  • Do I know what are the problems the users are trying to solve with their data tools?
  • Do I understand what motivates them?

If you don’t know the answers and have a clear picture of your potential crowd, there might be something you have left undone. No worries! Read on, as we present one tool to help you to drill into these questions.

Data development is too often based on presumptions

Back in the day, web services were often designed by developers, people writing the actual code. End-user needs were typically gathered from product specialists, sales or customer service. These people brought their presumptions of the end-users to the design. User insight driven UI design was only lifting its head.

Today, the situation is somewhat different. User insight and service design have been quite well internalized in digital service development. When designing a customer facing web service, no designer wants to lean on best guesses about the end-user’s needs. Designers want to understand and experience themselves what are the motivations and underlying needs of the user.

Data projects shouldn’t really differ from digital development but in their user approach, however, they are at the same level as web design was more than ten years back.

In data projects, what is under the hood counts often more than the surface. The projects tend to rely on assumptions on the users and to some very old organizational hearsay, instead of taking a systematic approach to user insight and service design.

The reasons for this vary. First, modern data platform development projects are a relatively new phenomena, data used to be in the hands of a smaller crowd before (typically skilled specialists from finance). Now that self-service analytics have become popular, also the users are becoming more diverse. Second, the data end-users are generally internal users. The same emphasis is not put in their user experience as for services facing an external customer. This should of course not be the case since you are also trying to convert them – to change their behavior towards being data driven in their decision-making.

This is where service design tools could really help you out. With some quite basic user insight you can make some very relevant findings and change the priorities of your project to ensure maximum adoption by the end users.

How can your data project benefit from a service design approach?

To give you some food for thought on how to benefit from service design, we have drafted you something based on our work on various data projects. This is only one example of what service design tools can do for you.

Our experience and discussions from different organizations have led us to believe that there are universal data personas that apply to most data organizations. They differ in how motivated they are to change their ways of working with data and how well they master modern data development and skills. These personas are meant to inspire you to turn your eyes upon your users.

User personas are a tool widely used in service design to make the users come alive and facilitate discussion about their needs.

So, we proudly present: The Wannabe, The Enabler, The Mastermind and The Skeptic.


THE WANNABE

The Wannabe is very excited about data and craves to learn more. Is a visionary but doesn’t really quite know how to get there. Has basic data skills but is eager to learn more and be more data driven in work.

Give this user support and in return, make them your data community builder. Let them spread their enthusiasm!


THE ENABLER

The Enabler is both capable and motivated. This user is well connected in the data community and sees the advantage of collaboration and shared ways of working brings.

Keep this user close. They are the change makers due to their position in the organization as well as their attitude. Allow them to help others to grow.


THE MASTERMIND

The Mastermind is the user who doesn’t reply to your emails or show up in your workshops. This user doesn’t really need anyone’s help with data, as they have the tools and the skills already. Not very motivated to share expertise either or get connected.

Make them need you by providing help to routine work. This user requires much effort but can result in significant value in return when you can channel their exceptional skills to serve your vision. To get to the Mastermind user, use the Enabler.


THE SKEPTIC

The Skeptic knows all the things that have been tried out in the past and also how many of these have failed. Is an organizational expert and has a long history with data but feels left out.

Use some empathy, spend time with this user and listen closely. If you tackle their problem, you have a strong spokesperson and an ally, because they know everyone and everything in the organization.


As said, sketching these user personas is just one example of using design tools to change your approach. There might be people who don’t fit these descriptions and people who act in different personas depending on the project in question. The idea is to make generalizations in order to make it possible to “jump into the users’ shoes” and make different point of views more concrete.

Make the data personas work for you

So, besides changing your approach to get to different data personas, what else can you do? How can you utilize the information the personas provide you?

First of all, you can start by mapping your services and tools to the needs, skills and expectations of the users. You might be surprised. How many of your users are skilled to use the services you provide? Five? Is this enough? What is the relevance of their work? Does their work serve a wider group of people?

Try out this matrix below for your data services. How does the potential for your services look now?

When building data platforms, you’re actually building services

It’s about time we twist our minds into thinking about people first in data projects; what is the change you wish to see in people’s behavior. To create value, your data platform needs the users as badly as your sales need customers buying their products.

So, stop thinking about data development projects technology first. Start thinking about them as a collection of services. Find out about the users and their needs and prioritize your development accordingly. In the end of the day, people performing better by making better decisions is what you’re after, not a shining data platform that sees no usage.

The data personas are a starting point for discussion. Link this post to your colleague and ask if they identify themselves. And how about you, are you the Wannabe, the Enabler, the Mastermind or the Skeptic?


Authors of this blog post are two work colleagues, who share the same interest in looking at data services from the user perspective. At least once, they have laughed out loud at the office in such volume, that it almost disturbed co-workers in the common space. They take business seriously and life lightly.

You can contact Tuuli Tyrväinen and Kirsi-Marja Kaurala via e-mail (firstname.lastname@solita.fi) for further discussion.

How to manage your SQL Server and Snowflake hybrid environment EDW with Agile Data Engine

Example of how to run a hybrid environment EDW with Agile Data Engine. The on-prem data warehouse is SQL Server and in the cloud is Snowflake.

Problem description

Agile Data Engine (ADE) is designed to work with a single target cloud data warehouse, but sometimes you need to keep some of your data in an on-premises environment for example for regulatory reasons. You might not be able to open connections to your on-premises infrastructure, but would like to keep both on-prem and cloud data warehouses managed in a single design infrastructure. Following is presented the high-level diagram of the environment in question. We are focusing on creating a system where ADE can be used to also develop an on-prem environment in addition to the main cloud data warehouse.

High level view of the hybrid setup

 

The premisses for this blog post are:

  • You have ADE edition with SQL Server export
  • The on-prem database is SQL Server
  • You have Azure Blob storage and Azure DevOps usable
    • Can be switched to other build and sharing locations quite easily

Reference architecture

In this blog post, we will go through a reference architecture where you have Azure as the cloud provider and SQL Server as the local database. The architecture presented here is not completely automated as it requires manual work at least in retrieving the SQL scripts from ADE. But as the on-premises system is secondary from the ADE development point of view, the number of objects should be small compared to the whole structure and developers can easily know what should be transferred to the on-prem environment.

NOTE: If you don’t want to create on-prem tables to the cloud data warehouse, keep them in a separate package that you don’t deploy.

Hybrid setup architecture

The steps of the diagram are explained in the following chapters. The first step is in Retrieval of SQL from ADE, second and third in Correcting the data to be built to DACPAC, fourth and fifth in Building DACPAC, and lastly both sixth and seventh in Executing the DACPAC.

Retrieval of SQL from ADE

Normally ADE runs the necessary SQLs on your behalf to the target database in the correct SQL dialect. But, ADE also has a feature to export the SQL scripts in the same or different dialects for manual inspection and testing. This is possible because ADE stores metadata and creates the SQL scripts as the last step. We are using this feature to get SQL scripts in a different dialect than the cloud data warehouse. First, go to the entity page that you want to export. From the SQL export options select SQL Server and check the box to create procedure versions of the loads.

Screenshot of ADE sql export functionality

Copy the generated SQL to separate files that are named based on the table/view/load names. Keep the files in folders based on schema and type. See the following example folder structure.

.
├── test-project.sqlproj
├── README.md
└── database
            ├── rdv
            │   ├── procedures
            │   │   └── load_h_author_from_source_entity_01.sql
            │   ├── rdv.sql
            │   └── tables
            │          ├── h_author.sql
            │          ├── l_author_book.sql
            │          └── s_author.sql
            └── staging
                ├── staging.sql
                ├── tables
                        ├── books.sql

ADE has a tagging functionality to help with managing large amounts of packages and entities inside the packages. Tags can be used to filter lists of both packages and entities. Tags are automatically created for words in description fields starting with a hashtag.  In the hybrid situation, the tagging feature can be optionally used to mark which entities are deployed to on-prem and writing in the description field when the deploy was done.

Screenshot of one ADE entity in listing mode

Correcting the data to be built to DACPAC

Let’s start with a brief explanation of DACPAC. DACPAC is Microsoft data-tier application (DAC) package. It is meant to contain all database and instance objects used for an application. Database objects mean for example tables and view when the instance objects include users. In our use case, it is used to deploy database object changes to SQL Server by comparing the structure of DACPAC to the target database.

The creation of DACPAC is done by building the Visual Studio data-tier application defined by the project file and included SQL scripts. Small changes are needed to be made to SQLs generated by ADE and also to the project file for the DACPAC build to succeed with all necessary files.

SQL changes

DACPAC generation requires that all separate commands have GO commands between them. For field comments, ADE generates separate EXEC commands that need to be separated.

Project file changes

The *.sqlproj needs to be updated with new files and folders for them to be included in the DACPAC. This happens by adding under <ItemGroup> element a new <Folder Include=”<FOLDERPATH>” /> and <Build Include=”<FILEPATH>” />. In the case of SQLs being updated inside already existing files, no project file changes are needed.

  <ItemGroup>
            <Folder Include="database" />
            <Folder Include="database\rdv\procedures" />
            <Folder Include="database\staging" />
            <Folder Include="database\staging\tables" /></ItemGroup>
  <ItemGroup>
            <Build Include="database\rdv\rdv.sql" />
            <Build Include="database\rdv\procedures\load_h_author_from_source_entity_01.sql" />
            <Build Include="database\staging\staging.sql" />
            <Build Include="database\staging\tables\books.sql" />
</ItemGroup>

Building DACPAC

Building is done in Azure DevOps using the MSBuild component. Then the required information is copied to the artifact folder for the upload step. The demo setup uses a bash script to upload to blob storage with a Shared Access Signature (SAS) token, but if possible AzureFileCopy component should be used.

NOTE: The SAS token should be stored separately from the pipeline in a secure variable, an environment variable is used for demo purposes.

pool:
  name: Azure Pipelines
  demands: msbuild

steps:
- task: MSBuild@1
  displayName: 'Build solution *.sqlproj'
  inputs:
    solution: '*.sqlproj'

- task: CopyFiles@2
  displayName: 'Copy Files to: $(Build.ArtifactStagingDirectory)'
  inputs:
    Contents: |
     **\*.DACPAC
     **\*.publish.xml
    TargetFolder: '$(Build.ArtifactStagingDirectory)'

- bash: |
   FOLDER="$(Build.ArtifactStagingDirectory)/bin/Debug/"
   az storage blob upload -c test-blob -n test-project.DACPAC -f$FOLDER/test-project.DACPAC
  displayName: 'Bash Script'
  env:
    AZURE_STORAGE_CONNECTION_STRING: BlobEndpoint=<ENDPOINT-WITH_TOKEN>

Screenshot of Azure blob storage with dacpac file

Executing the DACPAC

There are two options for executing the DACPAC. One is for the developer to download the file from blob-storage and run it against the target database with for example Azure Data Studio (requires microsoft.schema-compare extension). This option requires that there is network connectivity to the target database for example through a VPN tunnel.

The second option is to have an execution tool in the on-prem environment like SSIS. The tool can download the blob file and execute it against the target database. It is recommended to have some kind of manual approval before running the file against the production database. In addition to normal production safety measures, the reason here is that DACPAC does not guarantee that it will do the same steps every time, instead it relies on the comparison between DACPAC and the target database.

Dacpac schema comparison

Operations

The DDL transfer process can be used for transferring load steps as SQL Server procedures. As no on-prem connection was allowed from ADE in our use case, the load executing elements need to be found from the on-prem environment. One possibility for this is SSIS. The executor can be directed by configuration in a metadata table inside the target environment. The metatable has information on what procedures to run and when. The following sequence diagram explains at a high level the process of executing load steps.

Example sequence diagram for load execution

Summary

In this blog post, we went through an example architecture of a hybrid EDW setup with the Agile Data Engine. We looked at the technical level on how to use SQL export functionality in Agile Data Engine and DACPAC to move DDL scripts from Agile Data Engine to on-prem SQL Server. The operations side depends heavily on the existing on-prem infrastructure and was, therefore, only lightly touched.

DataOps machine

What to look for in a DataOps platform?

DataOps series is back at last. This time we are going through what to look for when choosing your technical approach for DataOps.

Building a resilient data capability can be a tricky task as there are vast number of viable options on how to proceed on the matter. Price and marketing materials don’t tell it all and you need to decide what is valuable for your organization when building a data platform.

Before going in to actual features and such I want to give few thoughts on cloud computing platforms and their native services and components. Is it always best to just stay platform native? All those services are “seamlessly integrated” and are easy for you to take into use. But up to which extent do those components support building a resilient data platform? You can say that depends on how you define a “data platform”.

My argument is that a data warehouse still is a focal component in enterprise wide data platform and when building one, the requirements for the platform are broader than what the platforms themselves can offer at the moment.

But let’s return to this argument later and first go through the requirements you might have for DataOps tooling. There are some basic features like version control and control over the environments as code (Infastructure as code) but let’s concentrate on the more interesting ones.

Model your data and let the computer generate the rest

The beef in classic data warehousing. Even if you call it a data platform instead of a data warehouse, build it on the cloud and use development methods that originate from software development, you still need to model your data and also integrate and harmonize it across sources. There sometimes is a misconception that if you build a data platform on the cloud and utilize some kind of data lake in it, you would not need to mind about data models.

This tends to lead to different kinds of “data swamps” which can be painful to use and the burden of creating a data model is inherited to the applications built on top of the lake.

Sure there are different schema-on-read type of services that sit on top of data lakes but they tend to have some shortcomings when compared to real analytical databases (like in access control, schema changes, handling deletes and updates, performance of the query optimizer, concurrent query limits, etc.).

To minimize the manual work related to data modelling, you only want the developers to do the required logical data modelling and derive the physical model and data loads from that model. This saves a lot of time as developers can concentrate on the big picture instead of technical details. Generation of the physical model also keeps it more consistent because there won’t be different personal ways in the implementation as developers don’t write the code by hand but it is automatically generated based to modelling and commonly shared rules.

Deploy all the things

First of all, use several separate environments. Developers need an environment where it is safe to try out things without the fear of breaking things or hampering the functionality of the production system. You also need an environment where production ready features can be tested with production-like data. And of course you need the production environment which is there only to serve the actual needs of the business. Some can cope with two and some prefer four environments but the separation of environments and automation regarding deployments are the key.

In the past, it has been a hassle to move data models and pipelines from one environment to another but it should not be like that anymore.

Continuous integration and deployment are the heart of a modern data platform. Process can be defined once and automation handles changes between environments happily ever after.

It would also be good if your development tooling supports “late building” meaning that you can do the development in a DBMS (Database Management System) agnostic way and your target environment is taken into account only in the deployment phase. This means that you are able to change the target database engine to another without too much overhead and you potentially save a lot of money in the future. To put it short, by utilizing late building, you can avoid database lock-in.

Automated orchestration

Handling workflow orchestration can be a heck of a task if done manually. When can a certain run start, what dependencies does it have and so on. DataOps way of doing orchestrations is to fully automate them. Orchestrations can be populated based on the logical data model and references and dependencies it contains.

In this approach the developer can concentrate on the data model itself and automation optimizes the workflow orchestration to fully utilize the parallel performance of the used database. If you scale the performance of your database, orchestrations can be generated again to take this into account.

The orchestrations should be populated in a way that concurrency is controlled and fully utilized so that things that can be ran parallel are ran so. When you add some step in the pipeline, changes to orchestrations should automatically be deployed. Sound like a small thing but in a large environment something that really saves time and nerves.

Assuring the quality

It’s important that you can control the data quality. At best, you could integrate the data testing in your workflows so that the quality measures are triggered every time your flow runs and possible problems are reacted to.

Data lineage can help you understand what happens with the data before it ends up in the use of end users.

It can also be a tool for the end users to better understand what the data is and where it comes from. You could also use tags for elements in your warehouse so that it’s easier to find everything that is related to for example personally identifiable information (PII).

Cloud is great but could it be more?

So about the cloud computing platforms. Many of the previously mentioned features can be found or built with native cloud platform components. But how about keeping at least the parts you heavily invest work in platform agnostic if at some point for some reason you have to make a move regarding the platform (reasons can be changes in pricing, corporate mergers & acquisitions, etc.). For big corporations it’s also more common that they utilize more than one cloud platform and common services and tooling over the clouds can be a guiding factor as well.

Data modelling especially is an integral part of data platform development but it still is not something that’s integrated in the cloud platforms themselves on a sufficient level.

In the past, data professionals have used to quite seamless developer experience with limited amount of software required to do the development. On cloud platforms this changed as you needed more services most having a bit different user experience. This has changed a bit since as cloud vendors have started to “package” the services (such like AWS Sagemaker or Azure DevOps) but we still are in the early phases of packaging that kind of tooling.

If the DataOps capabilities are something you would like to see out-of-the-box in your data platform, go check out our DataOps Platform called Agile Data Engine. It enables you to significantly reduce time to value from business perspective, minimize total cost of ownership and make your data platform future-proof.

Also check previous posts in DataOps series

DataOps – new kid on the data block

Why and how to enable DataOps in an organization?

Cover image from Azure Synapse

Azure Synapse Analytics – Unifying your data pipeline toolbox

This blog post will serve as an overview of the capabilities to give you insight on how to position Synapse Analytics to your overall data platform architecture. We won’t do a deep dive to technical implementation in this post.

Introduction

Azure SQL capabilities have evolved a lot over the years. Azure offers everything from VMs for running your own SQL Server setup, to SQL DB Hyperscale where you are getting your extremely scalable, but still traditional, DB as a service. But what about massively parallel processing data warehousing? Data integrations? Data science? What about limitless analytics? For that, there’s Azure Synapse Analytics.

In November 2019 Microsoft announced Azure Synapse as limitless analytics service and the next evolution of SQL Data Warehouse. Only thing released in GA was re-naming of SQL DW. Other synapse features were only available in limited private preview and for most setting up Synapse Analytics meant setting up former SQL DW. Nowadays Azure Synapse Analytics is a name for the whole analytics service with former SQL DW being part of it. While writing this blog the former SQL DW (also known as Synapse SQL provisioned) is still the only thing in GA, rest of the features are in public preview and anyone can set up Synapse Analytics workspace. Yes, the naming is confusing but hang on, we will try to clear that for you.

This blog post will serve as an overview of the capabilities to give you insight on how to position Synapse Analytics to your overall data platform architecture. We won’t do a deep dive to technical implementation in this post.

Synapse Analytics Architecture

Note: At the moment of writing this blog, Synapse Analytics can refer to both “Synapse Analytics (formerly SQL DW)” and “Synapse Analytics (workspace preview)” in Azure documentation. Here we are talking about unified experience in the workspace preview.

Azure Synapse Analytics is the common naming for integrated tooling providing everything from source system integration to relational databases to data science tooling to reporting data sets.  Synapse Analytics contains the following

"<yoastmark

  • Synapse SQL
    • This is the data warehouse part formerly known as Azure SQL DW.
    • Provides both serverless (SQL on-demand) and pre-allocated (SQL Pool) resources
    • Shares a common metastore with the Spark engine for seamless integration
  • Apache Spark
    • Seamlessly integrated big data engine.
    • Shares a common metastore with Synapse SQL
  • Data Flow and Integrations
    • Shares codebase with Azure Data Factory, so has everything you expect and more
    • Integrate to nearly a hundred data sources to ingest data
    • Orchestrate SQL Procedures and Spark Notebooks
  • Management and Monitoring
    • Familiar management and monitoring tools from Azure Data Factory are available.
  • Synapse Studio
    • Single web UI where you can create everything:
      • Integrate to source systems
      • Land data to Azure Data Lake
      • Explore the data using Spark notebooks
      • Load data to Synapse SQL using T-SQL scripts
      • Predict what needs predicting using Python, Scala C# or SQL in Spark
      • Publish data sets to Power BI
      • Manage and orchestrate everything with pipelines

In conclusion, Synapse Analytics refers to all of the capabilities available to you and not a single included tool. Synapse SQL, as the name suggests, is perhaps the most recognizable part with the SQL DW and T-SQL. However, Synapse Analytics also contains a serverless SQL form factor named SQL On-demand. So using T-SQL no longer requires a provisioned SQL Pool.

Synapse Analytics Unique value proposal

By unifying all of the tools mentioned above, Synapse Analytics really brings something new to the table. Using Synapse Analytics you can, without ever leaving the Synapse Studio, connect to a new on-premises data source, extract, load and transform that data to Data Lake and Synapse SQL, enrich it further with ML models, and provide it for reporting usage.

Synapse Analytics provides capabilities for each of the steps in a data pipeline: ingestion, preparing data, storing, exploring, transforming and serving the data:

Azure Synapse Analytics fit

Ingest

If you have previously used Azure Data Factory, you will be right at home using the data integration tools in Synapse Analytics. Synapse Analytics even shares the same codebase with Data Factory, so everything you have grown accustomed to is already there (almost everything, you can check the complete differences from https://docs.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences). You can use an Integration Runtime running inside your on-premises network to access all your data sources inside your network, or an Azure hosted one for massive scale. The list of natively supported systems is constantly growing, and you can find up-to-date information from https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-overview#supported-data-stores-and-formats. And if your source system is not on the list, you can always use a self hosted integration runtime with ODBC or JDBC drivers.

Despite their similarities, there’s currently no tooling to migrate ADF pipelines to Synapse Analytics Pipelines, and you can’t use the same Integration Runtime for both ADF and Synapse Analytics.

Another interesting preview feature is Azure Synapse Link. It allows you to run analytical workloads on top of operational data in Azure Cosmos DB near real time without affecting the operational usage. This means that application data in Cosmos DB is available for Synapse SQL and Spark using cloud native HTAP without building ETL workflows. https://docs.microsoft.com/en-us/azure/cosmos-db/configure-synapse-link

Prepare

After connecting to your data source, you can extract your data to Azure. There are numerous ways to organize the data in the data lake. Generally, you’ll probably want to have the original untouched data in a “raw” area, and processed and more refined data in another area.

With the new COPY INTO T-SQL command, you can use gzipped csv format to reduce the storage footprint. If using csvs, it is advisable to split large files into smaller ones, depending on your DWU capacity. And finally, you don’t have to worry about the 1 MB row limit and hard coded separator values as with PolyBase.

To really leverage all the possibilities of Synapse Analytics, you should use the parquet format in your data lake. In addition to being a compressed format, using parquet supports predicate pushdown in Synapse SQL and Spark. This greatly speeds up your exploratory queries to data lake from Synapse SQL or Spark, as you don’t have to read all the files from a given folder structure to get the rows you want. With parquet, the COPY INTO command automatically splits your files to speed up the processing.

Using Data Flows to clean and unify your data

Data Flows provide a no-code approach to transforming your data. As integrations, Data Flows are also already familiar from Azure Data Factory. With Data Flows, you could for example combine a few columns, delete a redundant one, calculate a working unique identifier and join the data with another flow before storing back to Data Lake as a processed file and also persisting it to SQL Pool table.

Explore data

Synapse Studio gives you multiple options to explore your data. We can graphically view the data lake structure, and easily get a SQL script or Spark Notebook to view the file contents just by right clicking a file.

Select from data lake file

With SQL On-demand, you can explore the contents of files without moving or importing them anywhere, and generate simple graphs in Synapse Studio UI to give you an idea of what you are looking at.

Graph T-SQL results in Azure Synapse Analytics

Synapse SQL, both on-demand and provisioned can be connected outside Synapse Studio with different clients using application protocols like ODBC, JDBC and ADO.NET.

In Apache Spark (example in Scala), using data from the SQL Pool is as easy as:

%%spark
val df = spark.read.sqlanalytics("SQLPool.schema.table")

This doesn’t require any configurations on Spark’s side, as the Spark and SQL are just two different runtimes operating on the same metadata and data sources.

Transform and enrich data

For transforming and enrichment, Synapse Analytics offers Spark Notebooks in addition to T-SQL. However, it has already been pretty easy to add Databricks notebooks as a part of your Azure Data Factory pipelines. With Synapse Analytics, again this integration is a bit more ready-made and easier.

One interesting possibility is SQL On-demand and it’s external tables. SQL On-demand doesn’t get access to SQL Pool’s tables (as you don’t even need to provision a SQL Pool or Synapse SQL to use SQL On-demand), but you can create external tables using T-SQL. External tables are stored as parquet backed files in Azure Data Lake Storage Gen 2. Compared to the current Azure General Availability offering, moving towards SQL On-demand and external tables in your transformations, offers serverless and fully scalable architecture. What’s the performance of external tables and SQL On-demand is going to be like, remains to be seen.

Serve

Synapse Analytics offers a few new interesting features for serving enriched data: SQL On-demand and Power BI Service connection.

With SQL On-demand, it is possible to create a highly scalable data pipeline all the way from source systems to transforming and storing data to serving it to Power BI, without any predetermined service level or capacity choices:

  • Ingest and Transform data using Data Pipelines and Data Flow. Use SQL On-demand or Data Flow as a part of your pipeline where needed
  • Store the results with “CETAS”, a T-SQL command CREATE EXTERNAL TABLE AS SELECT to store the results of your final SELECT statement to Azure Data Lake Storage Gen 2
  • Create a data set in Power BI. You can either use the files directly or use Direct Query via SQL On-demand.

Lastly, Synapse Analytics has the capability to link to Power BI Service! You can create linked services with Power BI Workspaces, view datasets and also create Power BI reports without ever leaving Synapse Studio. This greatly simplifies the process of creating reports from new data.

Connect Synapse Analytics with Power BI workspace

Do note that at the moment, you can only link to ONE Power BI Workspace. After connecting to a Power BI Workspace, you can create new Power BI reports based on published data sets without leaving Synapse Studio.

Create Power BI Reports straight from Synapse Studio

Manage and orchestrate

For orchestrating your pipelines, Synapse Analytics offers pretty much the same tooling as Azure Data Factory. You can easily combine your Data Flows, Spark Notebooks, T-SQL queries and everything to form pipelines as in Azure Data Factory. The familiar triggers are also there.

Management views also share much of the same as Azure Data Factory.

Synapse Analytics excels

Synapse Analytics strengths lie in it’s unified and yet versatile tooling.

  • Starting a new data platform project
    • Synapse Analytics offers an unified experience creating ingestion, preparation, transformations and serving your data in one place
  • Architecture with Data Lake
    • Synapse Analytics’s unified tooling makes it easier to work with and explore Data Lake
  • Architecture with Cosmos DB (or other future possible Synapse Link sources)
    • Near-real-time analytics based on operative data sources without any manual ETL processes
  • Explorative work on unknown data
    • Many tools and languages in one place
  • Security management in Synapse Workspace
    • Instead of setting up and configuring multiple separate tools with authentications and networks between them, you have only the Synapse Analytics workspace to setup
  • Possibility for both no-code and code approaches
    • A suitable approach probably exists for many developers to get started
  • Synapse SQL Pool as an MPP database
    • Synapse SQL is a powerful database in it’s own right

What we wish to see in future updates

  • While Synapse SQL Pool does pack a punch, it does require dba work and manual maintenance. Maybe more automation regarding this in the future?
    • Different indexes, partitions and distributions will have an impact on performance and storage costs
    • Workload management: classification, importance, isolation all need to be planned for and addressed for full scale operations
  • More dynamic resource scaling
    • Scaling Synapse SQL is an offline operation
    • While SQL On-demand offers some exciting possibilities, it fails to deliver any database functionalities: no access to relational tables for example as it doesn’t require a SQL Pool. There are parquet backed external tables, but their performance remains to be seen
  • Development in the Spark-sector as Synapse’s Spark is not on par with Databricks
    • For example, one notebook locks one Spark pool
    • Different runtime versions compared to Databricks
  • Implementation of version control
    • There’s no GIT or any other version control system support. The only way to save work is to publish, which also makes it visible to all other developers.
    • Managing different environments without version control system requires the use of SDK or APIs, which means a lot of work to get production ready
  • Getting from preview to GA
    • There still are instabilities and UI issues. The usual Preview shenanigans which are hopefully taken care of before moving to general availability.

Conclusion

In conclusion, Synapse Analytics’s vision is a great one: unified tool and experience to create almost everything you require in Azure to get your data from a data source to a published report. At the moment, there are some drawbacks as the whole workspace experience is still in preview, of which the lack of version control is definitely not the smallest.

At this point, we haven’t made any performance analysis and it is too early to say, if Synapse Analytics can really be a silver bullet for combining data pipelines, warehousing and analytics. In addition to performance, some key unique features are somewhat handicapped at this point: for example you can only connect to one Power BI Workspace. The vision is definitely there, and we will be anxiously waiting for future updates.

Snowflake external functions, Part 2 – How to do Natural Language Processing and analyze product reviews stored in Snowflake

This tutorial is a hands-on tutorial for Snowflake external functions and shows how you can translate and categorize your Snowflake data using Amazon Translate and Amazon Comprehend services. This tutorial will translate Finnish product comments to English and analyze them.

This the second part of the external functions blog post series and teaches how you can trigger Amazon services like Translate and Comprehend using Snowflake external functions. I will also explain and go through the limits of external functions in this blog post.

In the first blog post, I taught how you can set up your first Snowflake external function and trigger simple Python code inside AWS Lambda. In case external functions are a new concept for you, I suggest reading the first blog post before diving into this.

External functions limitations

Previously I left the limitations of external functions purposely out, but now when we are building something actually useful with them, you should understand the playground what you have and what are the boundaries.

Let’s start with the basics. As all the traffic is going through AWS API Gateway, you’re bound to the limitations of API Gateway. Max payload size for AWS API Gateway is 10MB and that can’t be increased. Assuming that you will call AWS Lambda through API Gateway, you will face the Lambda limits, which are maxed to 6MB per synchronous requests. Depending on the use-case or the data pipeline you’re building, there are workarounds, for example, ingesting the raw data directly through S3.

Snowflake also sets limitations; for example, the remote service at a cloud provider, in our case AWS, must callable from a proxy service. Limitations include that external functions must be scalar functions which mean single value for each input row. It doesn’t though mean that you can’t process only one row at the time. The data is sent as a JSON body which can contain multiple “rows”. Additional limitations set by Snowflake is that Snowflake optimizer can’t be used with external functions, external functions can’t be shared through Secure Data Sharing and you can’t use them in DEFAULT clause of a CREATE TABLE statement or with COPY transformations.

Things to consider

The cloud infrastructure, AWS in this case, sets also limitations or rather things to consider. How does the underlying infrastructure handle your requests? You should think how your function scales, acts on concurrency cases and how reliable it is. Doing a simple function which is called by a single developer usually functions without any issues, but you must design your function in a way that it works with multiple developers who are calling the function numerous times within hour contrasted to the single call which you made during development.

Concurrency is an even more important issue as Snowflake can parallelize external function calls. Can the function you wrote handle multiple parallel calls at once or does it crash? Another thing to understand is that with multiple parallel calls, you end up in a situation where the functions are in different states. This means that you should not write function where the result depends upon the order in which user’s rows are processed.

Error handling is also a topic which should not be forgotten. Snowflake external functions understand only HTTP 200 status code. All other status codes are considered as an error. This means that you need to build the error-handling to the function itself. External functions also have poor error messages as stated above. This means that you need to log all those “other than 200 status codes” to somewhere for later analysis.

Moneywise you’re also on the blindside. Calling out Snowflake SQL function hides all the costs what are related to the AWS services. An external function which is implemented poorly can lead to high costs.

Example data format

External functions call the resources by issuing HTTP POST request. This means that the sent data must be in a specific format to work. The returned data must also conform to a specific format. Due to these factors, the data sent and returned might look unusual. For example, we must always send integer value with the data. This value appears as a row number for the 0-based index. The actual data is converted to JSON data types, i.e.

  • Numbers are serialized as JSON numbers.
  • Booleans are serialized as JSON booleans.
  • Strings are serialized as JSON strings.
  • Variants are serialized as JSON objects.
  • All other supported data types are serialized as JSON strings.

It’s essential also to recognise that this means that dates, times, and timestamps are serialized as strings. Each row of data is a JSON array of one or more columns and can sent data can be compressed COMPRESSION syntax with CREATE EXTERNAL FUNCTION -SQL clause. It’s good to though understand that Amazon API Gateway automatically compresses/decompresses requests.

What are Amazon Translate and Amazon Comprehend?

As Amazon advertises, Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. What does that truly mean? It means that AWS Translate is a fully automated service where you can transmit text data to Translate API and you get the text data translated back in the chosen language. Underneath the hood, Amazon uses its own deep learning API to do the translation. In our use case, Amazon Translate is easy to use as the Translate API can guess the source language. Normally you would force the API to translate text from French to English, but with Translate API, we can set the source language to ‘auto’ and Amazon Translate will guess that we’re sending them French text. This means that we only need minimal configuration to get Translate API to work.

Amazon Translate can even detect and translate Finnish language, which is sometimes consider a hard language to learn.


For demo purposes Translate billing is also a great fit, as you can translate 2M characters monthly in your first 12 months, which start from your first translation. After that period the bill is 15.00$ per million characters.

Amazon Comprehend is also a fully managed language processing (NLP) service that uses machine learning to find insights and relationships in a text. You can use your own models or use built-in models to recognize entities, extract key phrases, detect dominant languages, analyze syntax, or determine sentiment. Like Amazon Translate, the service is called through an API.

As Finnish is not supported language for Amazon Comprehend, the translated text is run through the Comprehend API to get insights.

Example – Translating product comments made in Finnish to English with Amazon Translate and Snowflake external functions

As we have previously learned how to create the connection between Snowflake and AWS, we can focus on this example on the Python code and external function itself which is going to trigger the Amazon Translate API. As with all Amazon services, calling Translate API is really easy. You only need to import the boto3 class and use the client session to call the translate service.

After setting up the translate, you call the service with a few mandatory parameters and you’re good to go. In this example, we are going to leverage the Python -code, which was used in the previous blog post.

Instead of doing simple addition of string to the input, we’re going to pass the input to Translate API, translate the text to English and get the result back in JSON -format for later use. You can follow the instructions in the previous example and replace the Python -code with this new code stored in my Github -account.

After changing the Python -code, we can try it right away, because the external function does not need any change and data input is done in the same way as previously. In this case, I have created a completely new external function, but it works in a similar way as previously. I have named my Lambda -function as translate and I’m calling it with my Snowflake lambda_translate_function as shown. 

Calling the function is easy as we have previously seen, but when we call the Translate API directly we will the get full JSON answer which contains a lot of data which we do not need.


Because of this, we need to parse the data and only fetch the translated text.

Snowflake external functionsAs you can see, creating functions which do more than simple calculations is easy with external functions. We could gather a list of product comments in multiple languages and translate them into one single language for better analysis e.g. understanding in this case that Finnish comment means that snickers sold are rubbish in quality.

Example – Categorizing product comments with Amazon Comprehend and Snowflake external functions

Extending the previous example, we have now translated the Finnish product comment to English -language. We can extend this furthermore by doing sentiment analysis for the comment using Amazon Comprehend. This is also straight forward job and requires only you to either create a new Python function which calls the Comprehend API or modify the existing Python code for demo purposes.

Only needed changes are needed for the Python code and to the IAM role which the Lambda uses. The Python code is again really similar as previously. This time we call comprehend service using the same boto3 class.

Snowflake external functions

To detect sentiment we use the similarly named sub-class and provide the input source language and text to analyze. You can use the same test data which was used with Translate demo and with the first blog. Comprehend will though give NEUTRAL -analysis for the test data.

Before heading to Snowflake and trying out the new Lambda -function, go to IAM -console and grant enough rights for the role that Lambda -function uses. These rights are used for demo purposes and ideally only read rights for DetectSentiment action are enough.

These are just example rights and do contain more than are needed.

Once you have updated the IAM role rights, jump into the Snowflake console and try out the function in action. As we want to stick with the previous demo data, I will translate the outcome of the previous translation. For demo purposes, I have left out the single apostrophe as those are used by Snowflake.

Snowflake external functions

As you can see, getting instant analysis for the text was right. To be sure that we getting correct results, we can test out with new test data i.e. with positive product comment.

Snowflake external functions

As you can, with Snowflake external functions it’s really easy to leverage Machine Learning, Natural Language Processing or AI -services together with Snowflake. External functions are new feature so this means that this service will only grow in the future. Adding Azure and Google compatibility is already on the roadmap, but in the meantime, you can start your DataOps and MLOps journey with Snowflake and Amazon.

Snowflake external functions, Part 1 – Hello World tutorial for triggering AWS Lambda

This tutorial is a hands-on Hello World introduction and tutorial to external functions in Snowflake and shows how to trigger basic Python code inside AWS Lambda

External functions are new functionality published by Snowflake and already available for all accounts as a preview feature. With external functions, it is now possible to trigger for example Python, C#, Node.js code or native cloud services as part of your data pipeline using simple SQL.

I will publish two blog posts explaining what external functions are in Snowflake, show how to trigger basic Hello World Python code in AWS Lambda with the result showing in Snowflake and finally show how you can trigger Amazon services like Translate and Comprehend using external functions and enable concrete use cases for external functions.

In this first blog post, I will focus on the showing on how you can set up your first external function and trigger Python code which echoes your input result back to Snowflake.

What external functions are?

At the simplest form, external functions are scalar functions which return values based on the input. Under the hood, they are much more. Compared to traditional scalar SQL function where you are limited using SQL, external functions open up the possibility to use for example Python, C# or Go as part of your data pipelines. You can also leverage third-party services and call for example Amazon services if they support the set requirements. To pass the requirements, the external function must be able to accept JSON payload and return JSON output. The external function must also be accessed through HTTPS endpoint.

Example – How to trigger AWS Lambda -function

This example follows instructions from Snowflake site and shows you in more detail on how you can trigger Python code running on AWS Lambda using external functions like illustrated below.

Snowflake External Functions

To complete this example, you will need to have AWS account where you have the necessary rights to create AWS IAM (Identity and Access Management) roles, API Gateway endpoints and Lambda -functions. You will need also a Snowflake ACCOUNTADMIN -privileges or role which has CREATE INTEGRATION rights.

These instructions consist of the following chapters.

  • Creating a remote service (Lambda Function on AWS)
  • Creating an IAM role for Snowflake use
  • Creating a proxy service on AWS API Gateway.
  • Securing AWS API Gateway Proxy
  • Creating an API Integration in Snowflake.
  • Setting up trust between Snowflake and IAM role
  • Creating an external function in Snowflake.
  • Calling the external function.

These instructions are written for a person who has some AWS knowledge as the instructions will not explain the use of services. We will use the same template as the Snowflake instruction to record authentication-related information. Having already done a few external function integrations, I highly recommend using this template.

Cloud Platform (IAM) Account Id: _____________________________________________
Lambda Function Name...........: _____________________________________________
New IAM Role Name..............: _____________________________________________
Cloud Platform (IAM) Role ARN..: _____________________________________________
Proxy Service Resource Name....: _____________________________________________
Resource Invocation URL........: _____________________________________________
Method Request ARN.............: _____________________________________________
API_AWS_IAM_USER_ARN...........: _____________________________________________
API_AWS_EXTERNAL_ID............: _____________________________________________

Creating a remote service (Lambda Function on AWS)

Before we create Lambda function we will need to obtain our AWS platform id. The easiest way to do this is to open AWS console and open “Support Center” under “Support” on the far right.

This will open a new window which will show your AWS platform id.

Record this 12-digit number into template shared previously. Now we will create a basic Lambda -function for our use. From the main console search Lambda

Once you have started Lambda, create a new function called snowflake_test using Python 3.7 runtime. For the execution role, select the option where you create a new role with basic Lambda permissions.

After pressing the “Create function” button, you should be greeted with the following view where you can paste the example code. The example code will echo the input provided and add text to confirm that the Snowflake to AWS connection is working. You can consider this as a Hello World -type of example which can be leveraged later on.

Snowflake External Functions

Copy-paste following Python code from my Github account into Function code view. We can test the Python code with following test data which should create following end result:

After testing the Lambda function we can move into creating an IAM role which is going to be used by Snowflake.

Creating an IAM role for Snowflake use

Creating an IAM role for Snowflake use is a straight forward job. Open up the Identity and Access Management (IAM) console and select “Roles” from right and press “Create role”.

You should be greeted with a new view where you can define which kind of role you want to create. Create a role which has Another AWS account as a trusted entity. In the box for Account ID, give the same account id which was recorded earlier in the instructions.

Name the new role as snowflake_role and record the role name into the template. Record also the role ARN.

Creating a proxy service on AWS API Gateway

Create an API Gateway endpoint to be used. Snowflake will use this API endpoint to contact the Lambda -service which we created earlier. To create this, choose API Gateway service from the AWS console and select “Create API”. Call this new API snowflake_test_api and remember to select “Regional” as the endpoint type as currently, they are the only supported type.

Create a Resource for the new API. Call the resource snowflake and record the same to the template as Proxy Service Resource Name.

Create Method for the new API from the “Actions” menu, choose POST and press grey checkmark to create.

During the creation choose Lambda Function as Integration type and select “Use Lambda Proxy Integration”. Finally, choose the Lambda function created earlier.

Save your API and deploy your API to a stage.

Creating a new stage can be done at the same time as the deploy happens.

Once deployed, record the Invoke URL from POST.

Now were done creating the API Gateway. Next step is to secure the API Gateway that only your Snowflake account can access it.

Securing AWS API Gateway Proxy

In the API Gateway console, go to your API method and choose Method Request.

Inside Method Request, choose “AWS_IAM” as the Authorization mode.

Record the Method Request ARN to the template to be used later on. You can get the value underneath the Method Request.

Once done, go to Resource Policy and deploy the following policy from my Github account. You can also copy the policy from the Snowflake -example. In AWS Principal, replace the <12-digit number> and <external_function_role> with your AWS platform id and with IAM role created earlier. In AWS Resource, replace the resource with the Method Request ARN recorded earlier. Save the policy once done and deploy the API again.

Creating an API Integration in Snowflake

Next steps will happen on the Snowflake console, so open up that with your user who has the necessary rights to create the integration.

With necessary rights type in following SQL where  <cloud_platform_role_ARN> is the ARN of the IAM role created previously and api_allowed_prefixes is the resource invocation URL.

CREATE OR REPLACE API INTEGRATION snowflake_test_api
api_provider = aws_api_gateway
api_aws_role_arn = ‘<cloud_platform_role_ARN>’
enabled = true
api_allowed_prefixes = (‘https://’)
;

The end result should like something like this

When done, obtain API_AWS_IAM_USER_ARN and API_AWS_EXTERNAL_ID values by describing the API.

Setting up trust between Snowflake and the IAM role

Next steps are done in the AWS console using the values obtained from Snowflake.

In the IAM console, choose the previously created role and select “Edit trust relationships” from “Trust relationships” -tab.

In Edit Trust Relationships modify the Statement.Principal.AWS field and replace the value (not the key) with the API_AWS_IAM_USER_ARN that you saved earlier.

In the Statement.Condition field Paste “StringEquals”: { “sts:ExternalId”: “xxx” } between curly brackets. Replace the xxx with API_AWS_EXTERNAL_ID. The final result should look something like this.

Update the policy when done and go back to the Snowflake console.

Creating an external function in Snowflake

In Snowflake create the external function as follows. The <api_integration name> is the same we created previously in the Snowflake console. The <invocation_url> is the resource invocation URL saved before. Include also the resource name this time.

CREATE EXTERNAL FUNCTION my_external_function(n INTEGER, v VARCHAR)
RETURNS VARIANT
API_INTEGRATION = <api_integration_name>
AS ‘<invocation_url>’
;

End result should like something like this

Calling the external function

You can now finally that the connection is working, by selecting the function with an integer value and any given string. The output should be as shown in the image. As you can see, this example is really basic and shows only that the connection is working.

If you face in errors during the execution, check the troubleshooting page at Snowflake for possible solutions. I can say from experience that you must follow the instructions really carefully and remember to deploy the API at AWS often to reflect your changes.

This blog post has now covered the basics of external functions e.g. how you can trigger basic Python code running inside AWS Lambda. Next time I will show how you can build something concrete using the same tools and Amazon services.

Are we there yet?

External functions are currently a Preview Feature and are open to all accounts, but they support currently only services behind Amazon AWS API Gateway.