MLOps: from data scientist’s computer to production

MLOps refers to the concept of automating the lifecycle of machine learning models from data preparation and model building to production deployment and maintenance. MLOps is not only some machine learning platform or technology, but instead it requires an entire change in the mindset of developing machine learning models towards best practises of software development. In this blog post we introduce this concept and its benefits for anyone having or planning to have machine learning models running in production.

Operationalizing data platforms, DataOps, has been among the hottest topics during the past few years. Recently, also MLOps has become one of the hottest topics in the field of data science and machine learning. Building operational data platforms has made data available for analytics purposes and enabled development of machine learning models in a completely new scale. While development of machine learning models has expanded, the processes of maintaining and managing the models have not followed in the same pace. This is where the concept of MLOps becomes relevant.

What is MLOps?

Machine learning operations, or MLOps, is a similar concept as DevOps (or DataOps), but specifically tailored to needs of data science and more specifically machine learning. DevOps was introduced to software development over a decade ago. DevOps practices aim to improve application delivery by combining the entire life cycle of the application – development, testing and delivery – to one process, instead of having a separate development team handing over the developed solution for the operations team to deploy. The definite benefits of DevOps are shorter development cycles, increased deployment velocity, and dependable releases.

Similarly as DevOps aims to improve application delivery, MLOps aims to productionalize machine learning models in a simple and automated way.

As for any software service running in production, automating the build and deployment of ML models is equally important. Additionally, machine learning models benefit from versioning and monitoring, and the ability to retrain and deploy new versions of the model, not only to be more reliable when data is updated but also from the transparency and AI ethics perspective.

Why do you need MLOps?

Data scientists’ work is research and development, and requires essentially skills from statistics and mathematics, as well as programming. It is iterative work of building and training to generate various models. Many teams have data scientists who can build state-of-the-art models, but their process for building and deploying those models can be entirely manual. It might happen locally, on a personal laptop with copies of data and the end product might be a csv file or powerpoint slides. These types of experiments don’t usually create much business value if they never go live to production. And that’s where data scientists in many cases struggle the most, since engineering and operations skills are not often data scientists’ core competences.

In the best case scenario in this type of development the model ends up in production by a data scientist handing over the trained model artifacts to the ops team to deploy, whereas the ops team might lack knowledge on how to best integrate machine learning models into their existing systems. After deployment, the model’s predictions and actions might not be tracked, and model performance degradation and other model behavioral drifts can not be detected. In the best case scenario your data scientist monitors model performance manually and manually retrains the model with new data, with always a manual handover again in deployment.

The described process might work for a short time when you only have a few models and a few data scientists, but it is not scalable in the long term. The disconnection between development and operations is what DevOps originally was developed to solve, and the lack of monitoring and re-deployment is where MLOps comes in.

ML model development lifecycle. The process consists of development, training, packaging and deploying, automating and managing and monitoring.

 

How can MLOps help?

Instead of going back-and-forth between the data scientists and operations team, by integrating MLOps into the development process one could enable quicker cycles of deployment and optimization of algorithms, without always requiring a huge effort when adding new algorithms to production or updating existing ones.

MLOps can be divided into multiple practices: automated infrastructure building, versioning important parts of data science experiments and models, deployments (packaging, continuous integration and continuous delivery), security and monitoring.

Versioning

In software development projects it is typical that source code, its configurations and also infrastructure code are versioned. Tracking and controlling changes to the code enables roll-backs to previous versions in case of failures and helps developers to understand the evolution of the solution. In data science projects source code and infrastructure are important to version as well, but in addition to them, there are other parts that need to be versioned, too.

Typically a data scientist runs training jobs multiple times with different setups. For example hyperparameters and used features may vary between different runs and they affect the accuracy of the model. If the information about training data, hyperparameters, model itself and model accuracy with different combinations are not saved anywhere it might be hard to compare the models and choose the best one to deploy to production.

Templates and shared libraries

Data scientists might lack knowledge on infrastructure development or networking, but if there is a ready template and framework, they only need to adapt the steps of a process. Templating and using shared libraries frees time from data scientists so they can focus on their core expertise.

Existing templates and shared libraries that abstract underlying infrastructure, platforms and databases, will speed up building new machine learning models but will also help in on-boarding any new data scientists.

Project templates can automate the creation of infrastructure that is needed for running the preprocessing or training code. When for example building infrastructure is automated with Infrastructure as a code, it is easier to build different environments and be sure they’re similar. This usually means also infrastructure security practices are automated and they don’t vary from project to project.

Templates can also have scripts for packaging and deploying code. When the libraries used are mostly the same in different projects, those scripts very rarely need to be changed and data scientists don’t have to write them separately for every project.

Shared libraries mean less duplicate code and smaller chance of bugs in repeating tasks. They can also hide details about the database and platform from data scientists, when they can use ready made functions for, for instance, reading from and writing to database or saving the model. Versioning can be written into shared libraries and functions as well, which means it’s not up to the data scientist to remember which things need to be versioned.

Deployment pipeline

When deploying either a more traditional software solution or ML solution, the steps in the process are highly repetitive, but also error-prone. An automated deployment pipeline in CI/CD service can take care of packaging the code, running automated tests and deployment of the package to a selected environment. This will not only reduce the risk of errors in deployment but also free time from the deployment tasks to actual development work.

Tests are needed in deployment of machine learning models as in any software, including typical unit and integration tests of the system. In addition to those, you need to validate data and the model, and evaluate the quality of the trained model. Adding the necessary validation creates a bit more complexity and requires automation of steps that are manually done before deployment by data scientists to train and validate new models. You might need to deploy a multi-step pipeline to automatically retrain and deploy models, depending on your solution.

Monitoring

After the model is deployed to production some people might think it remains functional and decays like any traditional software system. In fact, machine learning models can decay in more ways than traditional software systems. In addition to monitoring the performance of the system, the performance of models themselves needs to be monitored as well. Because machine learning models make assumptions of real-world based on the data used for training the models, when the surrounding world changes, accuracy of the model may decrease. This is especially true for the models that try to model human behavior. Decreasing model accuracy means that the model needs to be retrained to reflect the surrounding world better and with monitoring the retraining is not done too seldom or often. By tracking summary statistics of your data and monitoring the performance of your model, you can send notifications or roll back when values deviate from the expectations made in the time of last model training.

Applying MLOps

Bringing MLOps thinking to the machine learning model development enables you to actually get your models to production if you are not there yet, makes your deployment cycles faster and more reliable, reduces manual effort and errors, and frees time from your data scientists from tasks that are not their core competences to actual model development work. Cloud providers (such as AWS, Azure or GCP) are especially good places to start implementing MLOps in small steps, with ready made software components you can use. Moreover, all the CPU / GPU that is needed for model training with pay as you go model.

If the maturity of your AI journey is still in early phase (PoCs don’t need heavy processes like this), robust development framework and pipeline infra might not be the highest priority. However, any effort invested in automating the development process from the early phase will pay back later and reduce the machine learning technical debt in the long run. Start small and change the way you develop ML models towards MLOps by at least moving the development work on top of version control, and automating the steps for retraining and deployment.

DevOps was born as a reaction to systematic organization needed around rapidly expanding software development, and now the same problems are faced in the field of machine learning. Take the needed steps towards MLOps, like done successfully with DevOps before.

Career opportunities

Recapping AI related risks to Organizations

When they develop predictive models for business, Data Scientists often feel pressure to create results within a very short time span. These feelings may indicate a larger problem with risk management.

With uncertainty, the natural thing is to divest, i.e. not invest large sums in an uncertain endeavour. But AI risks are not easily disposed of in small projects either.

This might leave organizations perplexed as to what to do. On one hand, there is the call to embrace AI. On the other, the risks are real.

As a rule of thumb, a longer time perspective won’t hurt. Predictive  modeling and automation are long-running investments. As such, they should be subject to risk assessment and scrutiny. There should be management for their entire life span.

Because of AI solutions’ partly speculative nature, their risk of failure is relatively high. A recent study underlined this, suggesting that roughly four out of five AI projects fail in the real world.

A predictive model has its particular strengths and weaknesses. But it has some recurring costs too, both implicit and explicit. Some of these costs may fall immediately to the supporting organization. And some of them might even fall outside of it.

The following (otherwise unrelated) tweet from a couple of days back pinpoints these risks neatly.

Leaving aside the social discourse, I very much agree on observations about organizations. There is a certain mindset about DS magically fixing business perspectives and organizational shortcomings. In my personal opinion, this is naïve at best. It is not an overstatement to call it dangerous in some cases.

The use of automation requires a certain robustness from surrounding structures.

AI as part of larger systems

In classical control theory, systems are designed around the principle of stability. A continuously working system, like a production line, is regulated with the help of measured and desired outputs. The problem is to make processes optimal by making them smooth, and get a good output per used resource ratio out of it.

Often, AI is a part of a larger production machinery. The whole process may involve human beings and other machine actors as well. Recent examples of AI victory make a lot of sense when seen in this kind of framing.

If we look at a famous example, Google AlphaGo’s victory over human players was supported by human maintained tournament protocols, servers, and arrangements. Not to speak of news media that helped to sculpt the event when it took place.

The AI’s job was relatively simple as comes to inputs and outputs: receive a board position and suggest the next move. Also how that AI learned to play Go in the first place was a result of multiple years of engineering. Its training was enabled by human work, and its progress was assessed by humans along the way.

The case of  adverse outcomes

If we look at organizations, there are always hidden costs when adapting new procedures and processes. Predictive model performance, on the other hand, is largely measured by the number of explicit mistakes that it makes. These kinds of explicit mistakes may capture part of the cost of an automated solution. But fail rate is hardly a comprehensive measure in a complex setting.

Just like in a game some moves may be very costly as regards winning, some mistakes may be very costly to an organization.

One recent observation within the field has been about implicit “ghost” work that goes into keeping up AI appearances: fixing and hiding AI based errors, even fixing AI decisions in the first place before they have time to cause harm.

Now traditional production lines have fallback mechanisms. For example for turning the line off in a case of emergency. Emergency protocols are in place because unexpected events may occur in the real world. This is a very healthy mindset for any AI development also. We should embrace it fully. An organization should take these things into account when planning and assessing a new solution.

No matter how good preliminary results a solution should show, it will start failing sooner or later when something unexpected happens. And it will not fix itself. Its use will probably also create unexpected side effects even when it is doing a superb job.

Data career opportunities

Automatized Code Deployment from Azure DevOps to Databricks

Target audience are data practitioners looking for a method to practice DataOps with a simple method even in restricted environments. A walk-through of the code is detailed in the appendix.

The linked code repository contains a minimal setup to automatize infrastructure and code deployment simultaneously from Azure DevOps Git Repositories to Databricks.

TL;DR:

  1. Import the repo into a fresh Azure DevOps Project,
  2. get a secret access token from your Databricks Workspace,
  3. paste the token and the Databricks URL into a Azure DevOps Library’s variable group named “databricks_cli”,
  4. Create and run two pipelines referencing the YAML in the repo’s pipelines/ directory.
  5. Any Databricks compatible (Python, Scala, R) code pushed to the remote repository’s workspace/ directory will be copied to the Databricks workspace with an interactive cluster waiting to execute it.

Background

Azure DevOps and Databricks have one thing in common – providing industry standard technology and offering them as an intuitive, managed platform:

  • Databricks for running Apache Spark
  • DevOps for Git repos and build pipelines

Both platforms have much more to offer then what is used in this minimal integration example. DevOps offers wiki, bug-, task- and issue tracking, canban, scrum and workflow functionality among others.

Databricks is a fully managed and optimized Apache Spark PaaS. It can natively execute Scala, Python, PySpark, R, SparkR, SQL and Bash code; some cluster types have Tensorflow installed and configured (inclusive GPU drivers). Integration of the H2O machine learning platform is quite straight forward. In essence Databricks is a highly performant general purpose data science and engineering platform which tackles virtually any challenge in the Big Data universe.

Both have free tiers and a pay-as-you-go pricing model.

Databricks provides infrastructure as code. A few lines of JSON consistently deploy an optimized Apache Spark runtime.

After several projects and the increasing need to build and prototype in a managed and reproducible way the DevOps-Databricks combination became very appreciated: It enables quick and responsive interactive runtimes and provides best industry practice for software development and data engineering. Deployment into (scheduled), performant, resilient production environments is possible without changes to the platform and without any need for refactoring.

The core of the integration uses Databricks infrastructure-as-code (IaC) capability together with DevOps pipelines functionality to deploy any kind of code.

  1. the Databricks CLI facilitates programmatic access to Databricks and
  2. the managed Build Agents in DevOps deploy both infrastructure and analytic code.

Azure pipelines deploy both the infrastructure code and the notebook code from the repository to the Databricks workspace. This enables version control of both the runtime and the code in one compact, responsive repository.

All pieces of the integration are hosted in a single, compact repository which make all parts of a data and modeling pipeline fully reproducible.

Prerequisites

Log into Azure DevOps and Databricks Workspace. There are free tiers for both of them. Setup details are explained extensively in the canonical quick start sections of either service:

For the integration Databricks can be hosted in either the Azure or AWS cloud.

1. Import the Repository

To use this demo as a starting point for a new project, prepare a Azure DevOps project:

  • create a new project (with an empty repository by default)
  • select the repository tab and choose “Import a repository”
  • paste the URL of this demo into the Clone URL field: https://dev.azure.com/reinhardseifert/DatabricksDevOps/_git/DatabricksDevOps
  • wait for the import to complete
  • clone the newly imported repository to your local computer to start deploying your own code into the workspace directory

Then create two Azure pipelines which create the runtime and sync any code updates into it (see below).

2. Create Databricks Secret Token

Log into the Databricks Workspace and under User settings (icon in the top right corner) and select “Generate New Token”. Choose a descriptive name (“DevOps Build Agent Key”) and copy the token to a notebook or clipboard. The token is displayed just once – directly after creation; you can create as many tokens as you wish.

Databricks > User Settings > Create New Token

3. Add the token to the Azure DevOps Library

The Databricks Secret Token has to be added to a Variable Group named “databricks_cli”. Variable groups are created under Pipelines > Library. Note that the name of the variable group is referenced in both pipeline definitions (/pipelines/build-cluster.yml and /pipelines/build-workspace.yml). Two variables have to be defined: 1. databricks_host and 2. databricks_token

The variable names are referenced in the .yml file – changing them in the DevOps library requires also changing them correspondingly in the .yml files. When clicking the lock icon after defining the variable it is treated as a secret and not visible after that action in the DevOps project. Neither in the Library nor in the Build servers (even when accidentially echo-ing them. But of course writing them to the Databricks environment would potentially expose them. This is a security concern when collaborating with non-trusted parties on a Project.

Pipelines > Library > Add Variable Group

 

Azure DevOps

Generally the Azure DevOps portal offers as minimal functionality a git repository to maintain code and pipelines to deploy the code from the repository into runtimes.

Azure Repositories

The Azure repo contains the full logic of the integration:

  1. the actual (Python) code to run,
  2. the JSON specification of the Spark-cluster which will run the code,
  3. shell build scripts which are executed in the pipeline/ build server,
  4. the YAML configuration which define the pipelines.

The complete CI/CD pipeline is contained in a single Git repository in a very compact fashion. Following Databricks’ terminology the Python code (1) is located in the workspace/ directory. The runtime specification .json (2), build scripts .sh (3) and the pipeline configuration .yml (4) are located in the pipelines/ directory according to the Azure DevOps paradigm.

Azure Pipelines

The Pipelines menu provides the following functionality:

  • Pipelines (aka build pipelines),
  • Environments (needed to group Azure resources – not used here),
  • Releases (aka release pipelines – not used here)
  • Library (containing the variable groups)

The build pipelines exclusively used in this demo project are managed under the “Pipelines > Pipelines” menu tab – not really intuitive.

Azure Build Pipelines

The pipeline’s build agents are configured via YAML files (e.g. build-cluster.yml). In this case they install the Databricks CLI on the build machine and then execute CLI commands to create runtimes and move code notebooks to the runtime. The Databricks cluster is configured by a single JSON file (see config.cluster.json).

This minimal integration requires creation of two pipelines:

  1. cluster creation – referencing pipelines/build-cluster.yml and
  2. workspace synchronization – referencing /pipelines/build-workspace.yml

After importing the repo:

  • select the Pipelines > Pipelines menu tab
  • choose Azure Repos Git YAML
  • select the imported repository from the drop-down menu
  • select Existing Azure Pipeline YAML file
  • select the YAML file from the drop-down menu
  • Run the pipeline for the first time – or just save it and run it later.

At this point the Databricks secret access token mentioned in the prerequisite paragraph need to be present in a “databricks_cli” variable group. Otherwise the pipeline run will fail and warn; in this case just create the token (in Databricks) and the variable group (in DevOps) and re-run the pipeline.

After creating the pipelines and saving them (or running them initially), the default pipeline names reference the source repository name which triggers them. For easier monitoring the pipelines should be renamed according to their function, like “create-cluster” and “sync-workspace” in this case.

Summary

This concludes the integration of analytic code from an Azure DevOps repository into a hosted Databricks runtime.

Any change to the config.cluster.json deletes the existing cluster and creates a new one according to the specifications in the JSON file.

Any change to workspace/ will copy the notebook file(s) (R, Python, Scala) to the Databricks workspace for execution on the cluster.

The Databricks workspace in this example was hosted on Azure. Only minor changes are required to use an AWS hosted workspace. On all cloud platforms the host URL and security token is specific for the chosen instance and region. The cloud specific parameter is the node_type_id in the cluster configuration .json file.

Using this skeleton repo as a starting point, it is immediately possible to run interactive workloads on a performant Apacke Spark cloud cluster – instead of “cooking” the local laptop with analytic code – transparently maintained on a professional DevOps platform.

Appendix

Following, a detailed walk-through of the .yml pipeline configurations, .sh build scripts and .json configuration files.

In general, the YAML instructs the build server to 1. start up when a certain file is changed (trigger), 2. copy the contents of the repository to the build server and 3. execute a selection of shell scripts (tasks) from the repository

Pipeline: Create cluster

This is a detailed walk through for the build-cluster.yml pipeline. The .yml files have a hierachical structure and the full hierarchy of the DevOps build pipeline is included although stages could be omitted.

Trigger

The first section of the pipeline YAML specifies the trigger. Any changes to the specified branch of the linked repo will automatically run of the Build Agent.

trigger:
  branches:
    include:
    - master
  paths:
    include:
    - pipelines/config.cluster.json
    - pipelines/databricks-library-install.sh

Without the paths: section, any change to the master branch will run the pipeline. The cluster is rebuild when the configuration changes or the selection of installed Python- or R-libraries changes.

Stages

The stage can be omitted (for a single stage pipeline) and the pool, variables and jobs directly defined. Then the stage would be implicit. It is possible to add testing steps to the pipeline and build fully automated CI/CD pipelines accross environments within on .yml file.

stages:
- stage: "dev"
  displayName: "Development"
  dependsOn: []

Pool

  pool:
    vmImage: "ubuntu-latest"

Selects the type of virtual machine to start when the trigger files are changed. At the time of writing ubuntu_latest will start a Ubuntu 18.04 LTS image.

Variables

  variables:
    - group: databricks_cli

This section references the variable group created in the Prerequisite section. The secret token is transfered to the build server and authorizes the API calls from the server to the Databricks workspace.

Jobs, Steps and Tasks

A job is a sequence of steps which are executed on the build server (pool). In this pipeline only task steps are used (see the docs for all step operations).

  jobs:
    - job: CreateCluster4Dev
      steps:

        - task: UsePythonVersion@0
          inputs:
            versionSpec: "3.8"
            architecture: "x64"

The first step is selecting the Python version for all following Python command on the build server; the Databricks CLI is written in Python and installed via Pip in the following task.

Task: Install and configure the Databricks CLI

        - task: ShellScript@2
          inputs:
            scriptPath: pipelines/databricks-cli-config.sh
            args: "\$(databricks_host) \$(databricks\_token)"
          displayName: "Install and configure the Databricks CLI"

Note that the path is relative to the root of the repo. The secret access token and host URL from the DevOps library are copied into environment variables which can be passed to the script in the args section.

The shell script executes the installation of the Databricks CLI and writes the neccessary CLI configuration on the build server.

python -m pip install databricks-cli
echo -e "[DEFAULT]\nhost: $HOST\ntoken: $TOKEN" > $HOME/.databrickscfg

Task: “Delete previous cluster version (if existing)”

This task will remove any cluster with the name provided in the args: section. This allows for updating the cluster when the configuration file is changed. When no such cluster is present the script will fail. Usually the pipeline will break at this point but here continueOnError is true, so the pipeline will continue when creating a cluster for the first time.

        - task: ShellScript@2
          inputs:
            scriptPath: pipelines/databricks-cluster-delete.sh
            args: "HelloCluster"
          continueOnError: "true"
          displayName: "Delete previous cluster version (if existing)"

The shell script called by this task is a wrapper around the Databricks CLI. First it queries for the cluster-id of any cluster with the name passed.

CLUSTER_ID=$(databricks clusters list --output json | jq -r --arg N "$CLUSTER_NAME" '.clusters[] | select(.cluster_name == $N) | .cluster_id')

It is possible to create multiple clusters with the same name. In case there are multiple all of them are deleted.

for ID in $CLUSTER_ID
do
    echo "Deleting $ID"
    databricks clusters permanent-delete --cluster-id $ID
done

Task: Create new cluster

        - task: ShellScript@2
          inputs:
            scriptPath: pipelines/databricks-cluster-create.sh
            args: "HelloCluster"
          displayName: "Create new cluster"

The build script reads the config.cluster.json and adds the cluster name passed from the pipeline .yml

cat config.cluster.json | sed "s/CLUSTER_NAME/$CLUSTER_NAME/g" > /tmp/conf.json

Now the configuration .json file can be passed to the Databricks CLI. The complete Apache Spark infrastructure is configured in the json. CLUSTER_NAME will be replaced with the name passed from the .yml.

{
    "cluster_name": "CLUSTER_NAME",
    "spark_version": "6.0.x-scala2.11",
    "spark_conf": {
        "spark.sql.execution.arrow.enabled": "true"
    },
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 1,
    "ssh_public_keys": [],
    "custom_tags": {
        "Project": "DevOpsIntegration"
    },
    "cluster_log_conf": {
        "dbfs": {
            "destination": "dbfs:/cluster_logs"
        }
    },
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "autotermination_minutes": 120,
    "enable_elastic_disk": false,
    "init_scripts": []
}

Updating the runtime to another version requires only modifying the spark_version parameter with any supported runtime.

A Spark cluster consists of one driver node and a number of worker nodes and can be scaled horizontally by adding nodes (num_workers) or vertically by choosing larger node types. The node types are cloud provider specific. The Standard_DS3_v2 node type id references the minimal Azure node.

The autotermination feature shuts the cluster down when not in use. Costs are billed per second up time per processing unit.

Any reconfigurations triggers the pipeline and rebuilds the cluster.

CLUSTER_ID=$(databricks clusters create --json-file /tmp/conf.json | jq -r '.cluster_id')

The cluster create call returns the cluster-id of the newly created instance. Since the last step of this pipeline installs additional Python and R libraries (via Pip and CRAN respectively) it is necessary to wait for the cluster to be in pending state.

STATE=$(databricks clusters list --output json | jq -r --arg I "$CLUSTER_ID" '.clusters[] | select(.cluster_id == $I) | .state')

echo "Wait for cluster to be PENDING"
while [[ "$STATE" != "PENDING" ]]
do
    STATE=$(databricks clusters list --output json | jq -r --arg I "$CLUSTER_ID" '.clusters[] | select(.cluster_name == $I) | .state')
done

Task: Install Python and R dependencies on the cluster

The final step is to add additional Python and R packages to the cluster. There are many ways to install packges in Databricks. This is just one way to do it.

        - task: ShellScript@2
          inputs:
            scriptPath: pipelines/databricks-library-install.sh
            args: "HelloCluster"
          displayName: "Install Python and R dependencies"

Again the shell script wraps the Databricks CLI, here the library install command. The cluster name (“DemoCluster” in this example) has to be passed again.

All CLI calls to Databricks need the cluster-id to delete, create and manupulate instances. So first fetch it with a cluster list call:

CLUSTER_ID=$(databricks clusters list --output json | jq -r --arg N "$CLUSTER_NAME" '.clusters[] | select(.cluster_name == $N) | .cluster_id')

Then install the packages – one call to library install per package:

databricks libraries install --cluster-id $CLUSTER_ID --pypi-package azure
databricks libraries install --cluster-id $CLUSTER_ID --pypi-package googlemaps
databricks libraries install --cluster-id $CLUSTER_ID --pypi-package python-tds
databricks libraries install --cluster-id $CLUSTER_ID --cran-package tidyverse

For additional Python or R package add a line in this build script – this will trigger the pipeline and the cluster is rebuild.

Pipeline: Import workspace

This is a detailed walk through for the build-workspace.yml pipeline. The first part of the pipeline is identical to the build-cluster.yml pipeline. The trigger include differs, since this pipeline is triggered by code pushes to the workspace/ directory. The choice of the build server (pool), the variable reference to the databricks_cli variable group for the Databricks access tokens and the Python version task are identical, also installing and configuring the Databricks CLI with the same build script as above.

The only build task is importing all files in the workspace/ directory to the Databricks Workspace. The args passes a sub-directory name for the /Shared/ folder in Databricks ( /Shared/HelloWorkspace/ in the example).

        - task: ShellScript@2
          inputs:
            scriptPath: pipelines/databricks-workspace-import.sh
            args: "HelloWorkspace"
          displayName: "Import updated notebooks to workspace to dev"

The specified directory is first deleted. When the directory does not exist, the CLI prints and error in JSON format, but does not break the pipeline. The args: parameter is passed to the $SUBDIR variable in the build script.

databricks workspace delete --recursive /Shared/$SUBDIR

Then the script files in the workspace/ folder of the master branch are copied into the Databricks workspace.

databricks workspace import_dir ../workspace /Shared/$SUBDIR/

Remember that the repo is copied into the pipeline build agent/server and the working directory of the pipeline agent points to the location of the .yml file which defines the pipeline.

Performance of semi-structured data types in Snowflake

Snowflake extends the idea of traditional relational database by having the possibility to store and handle semi-structured data types. As shown in this post, semi-structured data types have effects on query performance and results of numerical operations.

On top of the traditional relational data, Snowflake also has support for semi-structured data with flexible schema. This can be utilized with for example following data formats: JSON and XML. Inside Snowflake, these are stored as either variant, array or object data types. Let us take a closer look what these mean.

Variant is a tagged universal type that can hold up to 16 MB of any data type supported by Snowflake. Variants are stored as columns in relational tables. Array is a list-like indexed data type that consists of variant values. Object, on the other hand, is a data type that consist of key-value pairs, where key is a not-null string and value is variant type data.

Snowflake provides guidelines on handling semi-structured data on their documentation. As a baseline, they recommend to store semi-structured data as variant data type, if usage for data is unsure. However, it is stated that when the usage is known, recommendation is to perform test cases to find solution with the best performance. Let us take a look what kind of differences might occur.

Test setup

For testing purposes, we create three different tables:

  1.   Table with 500 million rows
  2.   Table with variant column having 500 million values
  3.   Table with array column having 500 million values

The traditional table includes user identifier, which is a random field between values 1 and 5 million and amount field, which is integer value between values 1 and 500 million. Variant and array tables are both grouped tables. These are created with the traditional table as their source. Schema has user identifier and variant or array typed value field, which has aggregated list of the values for certain user identity. The tables are created with following queries:

CREATE TABLE T_500M (
   user_id   INT,
   amount    INT); 

INSERT INTO T_500M
   SELECT UNIFORM (1, 5000000, random())
        , UNIFORM (1, 500000000, random())
   FROM TABLE ( GENERATOR ( ROWCOUNT => 50000000 ));
 
CREATE TABLE T_500M_VARIANT AS 
   SELECT user_id,
        , CAST(ARRAY_AGG(amount) AS VARIANT) AS variant_field
   FROM T_500M
   GROUP BY user_id;

CREATE TABLE T_500M_ARRAY AS 
   SELECT user_id
       ,  ARRAY_AGG(amount) AS array_field
   FROM T_500M
   GROUP BY user_id;

Evaluating performance

Storing values as variant or array might seem like a good idea, if we want to aggregate sums on amount field for every user identity. As a query result, we want to show user identifier, count number of occurrences for that user and aggregated sum for the amount field. We can achieve it for each table with following queries:

SELECT user_id
     , COUNT(*) AS value_count
     , SUM(amount) AS sum
FROM T_500M
GROUP BY user_id;

SELECT user_id
     , COUNT(flat_variant.value) AS value_count
     , SUM(flat_variant.value::INTEGER) AS sum
FROM T_500M_VARIANT
    , lateral flatten(INPUT => variant_field) flat_variant
GROUP BY user_id;

SELECT user_id
     , COUNT(flat_array.value) AS value_count
     , SUM(flat_array.value::INTEGER) AS sum
FROM T_500M_ARRAY
     , lateral flatten(INPUT => array_field) flat_array
GROUP BY user_id;

Select-clause takes 15.6 seconds for the traditional relational table, 22.1 seconds with variant table and 21.9 seconds with array table. The difference is significant with the queries being over 40 % slower for semi-structured tables.

Another thing to consider with semi-structured formats is that queries on semi-structured data will not use result cache. We can notice this by running the queries again. The traditional table query takes only 0.2 seconds thanks to Snowflake’s persisted query results, but the queries to other tables take the same circa 22 seconds as earlier to complete.

The tested difference in our query time between structured data table and table with semi-structured data type exists, but it is still acceptable in some cases, where loading semi-structured data is a lot easier to variant or array columns. However, it needs to be noted, as stated in Snowflake documentation, query performance for data types that are not native for JSON are even worse for tables using variant or array. Shown test included only native JSON data types, but including for example datetime as variant would make the difference even bigger.

Explicit data type conversion

It is important to pay attention to the data types when accessing array or variant data. Consider the following example, where we query total sum of the amount-field with following select-statement on the variant table:

SELECT COUNT (*) AS row_count
     , SUM(flat_variant.value) AS sum_without_cast
     , SUM(flat_variant.value)::INTEGER AS cast_after_sum
     , SUM(flat_variant.value::INTEGER) AS cast_before_sum
FROM T_500M_VARIANT
     , lateral flatten(INPUT => variant_field) flat_variant;

For query results, we get three different sums:
   No specified casting:           1.24998423949572e+17
   Casting before sum:            124998423949572384
   Casting after sum:               124998423949572368

The same sum amounts are received when running the above query for the array table. Difference comes from Snowflake’s calculations, where variant and array are handled using JavaScript. JavaScript language uses Float as data type for numeric values. As shown in the earlier blog post using floating-point numeric data types, may lead to imprecise values. Even though the table only has fixed-point numeric values, using variant or array converts them to floating points unless determined explicitly when querying data.

Conclusion

Possibility to store semi-structured data in relational tables on Snowflake comes in handy for many business needs that do not have traditional relational source data. It enables loading semi-structured data straight to Snowflake and parsing data onwards from there. Even though this is possible, it should be tested per use case whether it is the best solution.

When evaluating query performance we noticed that querying tables with semi-structured data types, the select-clauses resulted in being 40 % slower compared to similar table with structured data types. This is with JavaScript native data types, while non-native types will result in even bigger difference for execution time. Semi-structured data types can’t  utilize result cache, so re-running the queries will take similar time as the initial one.

Best practice is converting data types explicitly, when accessing variant or arrays. Snowflake engine uses JavaScript to handle these data types, so as was shown, numeric values may suffer from rounding.

New call-to-action

Lessons learned from combining SQS and Lambda in a data project

In June 2018, AWS Lambda added Amazon Simple Queue Service (SQS) to supported event sources, removing a lot of heavy lifting of running a polling service or creating extra SQS to SNS mappings. In a recent project we utilized this functionality and configured our data pipelines to use AWS Lambda functions for processing the incoming data items and SQS queues for buffering them. The built-in functionality of SQS and Lambda provided us serverless, scalable and fault-tolerant basis, but while running the solution we also learned some important lessons. In this blog post I will discuss the issue of valid messages ending up in dead-letter queues (DLQ) and correctly configuring your DLQ to catch only erroneous messages from your source SQS queue.

What are Amazon SQS and Lambda?

In brief, Amazon SQS is a lightweight, fully managed message queueing service, that enables decoupling and scaling microservices, distributed systems and serverless applications. With SQS, it is easy to send, store, and receive messages between software components, without losing messages.

AWS Lambda is a fully managed, automatically scaling service that lets you run code in multiple different languages in a serverless manner, without having to provision any servers. You can configure a Lambda function to execute on response to various events or orchestrate the invocations. Your code runs in parallel and processes each invocation individually, scaling with the size of the workload.

When you configure an SQS queue as an event source for your Lambda, Lambda functions are automatically triggered when messages arrive to the SQS queue. The Lambda service automatically scales up and down based on the number of inflight messages in the queue. The polling, reading and removing of messages from the queue will be thus automatically handled by the built-in functionality. Successfully processed messages will be removed and the failed ones will be returned to the queue or forwarded to the DLQ, without needing to explicitly configure these steps inside your Lambda function code.

Problems with valid messages ending up in DLQ

In the recent project we needed to process data that would be coming in daily as well as in larger batches with historical data loadings through the same data processing pipeline. In order to handle changing data loads, SQS decouples the source system from processing and balances the load for both use cases. We used SQS for queueing metadata of new data files and Lambda function for processing the messages and passing on metadata to next steps in the pipeline. When testing our solution with pushing thousands of messages rapidly to the queue, we observed many of the messages ending up in a dead-letter queue, even though they were not erroneous.

From the CloudWatch metrics, we found no execution errors during the given period, but instead there was a peak in the Lambda throttling metric. We had configured a DLQ to catch erroneous messages, but ended up having completely valid and unprocessed messages in the DLQ. How does this happen? To understand this, let’s dive deeper into how Lambda functions are triggered and scaled when they have SQS configured as the event source.

Lambda scales automatically with the number of messages arriving to SQS – up to a limit

Let’s first introduce briefly the parameters of SQS and Lambda that are relevant to this problem.

SQS

ReceiveMessageWaitTimeSeconds: Time that the poller waits for new messages before returning a response. Your messages might be arriving to the SQS queue unevenly, sometimes in bursts and sometimes there might be no messages arriving at all. The default value is 0, which equals constant polling of messages. If the queue is empty and your solution allows some lag time, it might be beneficial not to poll the queue all the time and return empty responses. Instead of polling for messages constantly, you can specify a wait time between 1 and 20 seconds.

VisibilityTimeout: The length of time during which a message will be invisible to consumers after the message is read from the queue. When a poller reads a message from the SQS queue, that message still stays in the queue but becomes invisible for the period of VisibilityTimeout. During this time the read message will be unavailable for any other poller trying to read the same message and gives the initial component time to process and delete the message from the queue.

maxReceiveCount: Defines the number of times a message can be delivered back to being visible in the source queue before moving it to the DLQ. If the processing of the message is successful, the consumer will delete it from the queue. When ever an error occurs in processing of a message and it cannot be deleted from the queue, the message will become visible again in the queue with an increased ReceiveCount. When the ReceiveCount for a message exceeds the maxReceiveCount for a queue, message is moved to a dead-letter queue.

Lambda

Reserved concurrency limit: The number of executions of the Lambda function that can run simultaneously. There is an account specific limit how many executions of Lambda functions can run simultaneously (by default 1,000) and it is shared between all your Lambda functions. By reserving part of it for one function, other functions running at the same time can’t prevent your function from scaling.

BatchSize: The maximum number of messages that Lambda retrieves from the SQS queue in a single batch. Batchsize is related to the Lambda event source mapping, which defines what triggers the Lambda functions. In this case they are triggered from SQS.

In the Figure 1 below, it is illustrated how Lambda actually scales when messages arrive in bursts to the SQS queue. Lambda uses long polling to poll messages in the queue, which means that it does not poll the queue all the time but instead on an interval between 1 to 20 seconds, depending on what you have configured to be your queue’s ReceiveMessageWaitTimeSeconds. Lambda service’s internal poller reads messages as batches from the queue and invokes the Lambda function synchronously with an event that contains a batch of messages. The number of messages in a batch is defined by the BatchSize that is configured in the Lambda event source mapping.

When messages start arriving to the queue, Lambda reads first up to 5 batches and invokes a function for each. If there are messages still available, the number of processes reading the batches are increased by up to 60 more instances per minute (Figure 2), until it reaches the 1) reserved concurrency limit configured for the Lambda function in question or 2) the account’s limit of total concurrent Lambda executions (by default 1,000), whichever is lower (N  in the figure).

By setting up a reserved concurrency limit for your Lambda, you guarantee that it will get part of the account’s Lambda resources at any time, but at the same time you also limit your function from scaling over that limit, even though there would be Lambda resources available for your account to use. When that limit is reached and there are still messages available in the queue, one might assume that those messages will stay visible in the queue, waiting until there’s free Lambda capacity available to handle those messages. Instead, the internal poller still tries to invoke new Lambda functions for all the new messages and therefore causes throttling of the Lambda invokes (figure 2). Why are some messages ending up in DLQ then?

Let’s look at how the workflow goes for an individual message batch if it succeeds or fails (figure 3). First, the Lambda internal poller reads a message batch from the queue and those messages stay in the queue but become invisible for the length of the configured VisibilityTimeout. Then it invokes a function synchronously, meaning that it will wait for a response that indicates a successful processing or an error, that can be caused by e.g. function code error, function timeout or throttling. In the case of a successful processing, the message batch is deleted from the queue. In the case of failure, however, the message becomes visible again.

The SQS queue is not aware of what happens beyond the event source mapping, if the invocations are failed or throttled. It keeps the messages in the queue invisible, until they get either deleted or turned back to visible after the length of VisibilityTimeout has passed. Effectually this means that throttled messages are treated as any other failed messages, so their ReceiveCount is increased every time they are made visible in the queue. If there is a huge burst of messages coming in, some of the messages might get throttled, retried, throttled again, and retried again, until they reach the limit of maxReceiveCount and then moved to the DLQ.

The automatic scaling and concurrency achieved with SQS and Lambda sounds promising, but unfortunately like all AWS services, this combination has its limits as well. Throttling of valid messages can be avoided with the following considerations:

Be careful when configuring a reserved concurrency to your Lambda function: the smaller the concurrency, the greater the chance that the messages get throttled. AWS suggests the reserved concurrency for a Lambda function to be 5 or greater.

Set the maxReceiveCount big enough in your SQS queue’s properties, so that the throttled messages will eventually get processed after the burst of messages. AWS suggest you set it to 5 or greater.

By increasing message VisibilityTimeout of the source queue, you can give more time for your Lambda to retry the messages in the case of message bursts. AWS suggests this to be set to at least 6 times the timeout you configure to your Lambda function.

Of course, tuning these parameters is an act of balancing with what best suits the purpose of your application.

Configuring DLQ to your SQS and Lambda combination

If you don’t configure a DLQ, you will lose all the erroneous (or valid and throttled) messages. If you are familiar with the topic this seems obvious, but it’s worth stating since it is quite important. What is confusing now in this combo, is that you can configure a dead-letter queue to both SQS and Lambda. The AWS documentation states:

Make sure that you configure the dead-letter queue on the source queue, not on the Lambda function. The dead-letter queue that you configure on a function is used for the function’s asynchronous invocation queue, not for event source queues.

To understand this one needs to dive into the difference between synchronous and asynchronous invocation of Lambda functions.

When you invoke a function synchronously Lambda runs the function and waits for a response from it. The function code returns the response, and Lambda returns you this response with some additional data, including e.g. the function version. In the case of asynchronous invocation, however, Lambda sends the invocation event to an internal queue. If the event is successfully sent to the internal queue, Lambda returns success response without waiting for any response from the function execution, unlike in synchronous invocation. Lambda manages the internal queue and attempts to retry failed events automatically with its own logic. If the execution of the function is failing after retries as well, the event is sent to the DLQ configured to the Lambda function. With event source mapping to SQS, Lambda is invoked synchronously, therefore there are no retries like in asynchronous invocation and the DLQ on Lambda is useless.

Recently, AWS launched Lambda Destinations, that makes it possible to route asynchronous function results to a destination resource that can be either SQS, SNS, another Lambda function or EventBridge. With DLQs you can handle asynchronous failure situations and catch the failing events, but with Destinations you can actually get more detailed information on function execution in both success and failure, such as code exception stack traces. Although, Destinations and DLQs can be used together and at the same time, AWS suggests Destinations should be considered a more preferred solution.

Conclusions

The described problems are all stated and deductible from the AWS documentation, but still not completely obvious. With carefully tuning the parameters of our SQS queue, mainly by increasing the maxReceiveCount and VisibilityTimeOut, we were able to overcome the problems with Lambda functions throttling. With configuring the DLQ to the source SQS queue instead of configuring it to Lambda, we were able to catch erroneous or throttled messages. Although adding a DLQ to your source SQS does not solve anything by itself, but you also need to handle the failing messages in some way. We configured a Lambda function also to the DLQ to write the erroneous messages to DynamoDB. This way we have a log of the unhandled messages in DynamoDB and the messages can be resubmitted or investigated further from there.

Of course, there are always several kinds of architectural options to solve these kind of problems in AWS environment. Amazon Kinesis, for example, is a real-time stream processing service, but designed to ingest large volumes of continuous streaming data. Our data, however, comes in uneven bursts, and SQS acts better for that scenario as a message broker and decoupling mechanism. One just needs to be careful with setting up the parameters correctly, as well as be sure that the actions intended for the Lambda function will execute within Lambda limits (including 15 minutes timeout and 3,008 MB maximum memory allocation). The built-in logic with Lambda and SQS enabled the minimal infrastructure to manage and monitor as well as high concurrency capabilities within the given limits.

To float, or not to float – Choosing correct numeric data type in Snowflake

Fixed-point numerical data types should be the default choice when designing Snowflake relational tables. Using floating-point data types has multiple downsides, which are important to understand. The effect of choosing different numerical data type can be easily tested.

Snowflake numeric data types can be split in two main categories: fixed-point numbers and floating-point numbers. In this blog post we are going to look at what these mean from Snowflake database design point of view, and especially should you use floating type when dealing with numeric data?

Fixed-point numbers are exact numeric values, which include data types such as number, integer and decimal. For these data types, developer can determine precision (allowed number of digits in column) and scale (number of digits right of the decimal point). In Snowflake, all fixed-point numeric data types are actually type decimal with precision 38 and scale 0, if not specified differently. Typical use cases for fixed-point data types are natural numbers and exact decimal values, such as monetary figures, where they need to be stored precisely.

On the other side of the spectrum are floating-point numbers, which are approximate representations of numeric values. In Snowflake, floating-point columns can be created with key-words float, double, double precision or real. However, as the Snowflake documentation states, all of these data types are actually stored as double and they do not have difference in precision, but displayed as floats. Floating-point data types are mainly used in mathematics and science to simplify the calculations with scientific notation. Storing numbers with major differences in magnitude is their advantage in databases too, because zeros trailing or following the decimal sign does not consume memory as it does for decimal format. In traditional computing, floats are considered faster for computation, but is that really the case in modern database design?

Floating-point precision

First, let us explore inserting data on Snowflake into table with float as numeric data type. We create a table and insert 500 million rows of generated dummy data to the table with following query:

CREATE TABLE T_FLOAT_500M (
   id            INT,
   float_field   FLOAT );

 
INSERT INTO T_FLOAT_500M
  SELECT SEQ8()
       , UNIFORM(1, 500000000, RANDOM())::FLOAT
  FROM TABLE ( GENERATOR ( ROWCOUNT => 500000000 ) );

To see the effect of using float as the data type for big numeric values, we can run the following query:

SELECT SUM(float_field) AS NO_CONVERSION
     , SUM(float_field::INT) AS CONVERT_VALUES_SEPARATELY
     , SUM(float_field)::INT AS CONVERT_SUM_VALUE
FROM T_FLOAT_500M;

Sum without the conversion produces us a rounded number with scientific notation: 1.24997318579082e+17
Separately converted values sum produces result: 124997318579081654
Conversion made after the calculation produces sum value: 124997318579081664

From the sum results, we will notice the accuracy problem related to storing numeric values as floats as the sum results differ from each other. When dealing with large or extremely accurate numeric values, floats may cause differentiation in results due to their nature of being approximate representations. Same effect can be seen when using WHERE clauses as the approximate representations may not work as designed with conditions that point to exact numeric values.

Storage size of float

Next, we create a similar table as earlier, but with the second field being type integer and populate it without converting random figures to floats. 

CREATE TABLE T_INT_500M (
   id          INT,
   int_field   INT);

INSERT INTO T_INT_500M
  SELECT SEQ8()
       , UNIFORM(1, 500000000, RANDOM())::INT
  FROM TABLE ( GENERATOR ( ROWCOUNT => 500000000 ) );

SHOW TABLES LIKE '%_500M';

Looking at the Snowflake table statistics, we will notice integer table is smaller (3.37 GB) compared to the float table (5.50 GB). The difference in table sizes is significant with the float table being 63 % bigger. This can be explained by Snowflake reserving 64 bits of memory for every float value. Integer values on the other hand are stored in compressed format and take only the necessary amount of memory.

This difference is seen also on SELECT queries, where querying all rows with X-Small warehouse takes only 85 seconds for integer type table compared to the 160 seconds with the float type table. Difference is once again major. 

Summary

Floats still have their use cases with numbers that have majorly different magnitudes. But from the presented test cases, we can draw a conclusion that using floating-point data types will lead to bigger storage sizes and longer query times, which result as an increase to data warehousing costs. Another thing to consider is the possibility of imprecise values, when dealing with extremely accurate data. Whenever possible, it is recommended not to use float as a type for numeric data without a specific reason. Especially precise and whole numbers are not meant to be stored as floats and should be given appropriate data types. 

goupandneverstop

Amazon’s innovation culture

At re:Invent one can only admire the amount of new innovations that AWS publishes. There were 339 announcements before the event even started. I had change to attend some sessions at this years re:Invent to get better understanding of Amazons innovation culture.

It all starts with the customer and finally ends in one. At Amazon everything stems from the company’s mission; “To be earth’s most customer-centric company”. Start by identifying what the customer needs and innovate solutions from there. Jeff Bezos has stated in letter to shareholders that “customers are always beautifully, wonderfully dissatisfied”. It is the same thing that Henry Ford said many years ago, customers do not know what they want. Amazon is not trying to build faster horse, but something that will delight customers needs. Amazon is after minimum lovable product (MLP). 

Innovation needs structure and at Amazon innovation organises around four concepts that interconnect. They are culture, mechanism, architecture and organisation. Culture builds around 14 leadership principles and these are the core of innovation. Principles are guidelines to help people bring the best out of them, they challenge everyone to be curious, take responsibility, challenge status quo, move fast and think long term. Amazon uses data and makes calculated risk, but speed matters so they have some guidelines to help this. Some decision can be considered one-way door as others two-way. Decision can be seen as two-way door if it can be reversed, then risk is lower, and one should move with speed. It is also interesting to hear, that Amazon pushes leaders to make decision when they have only around 70% of needed data.

Amazon does 198 million deployments a year

Architecture of Amazon allows rapid development. Amazon architecture went from one monolith to micro services and this allowed them to move from quarterly change cycle to 198 million deployments a year. The transformation from monolith to micro service was based on idea of Conway’s law that states “organisations which design systems … are constrained to produce designs which are copies of the communication structures of these organisations.” Amazon mapped the communication structured of their organisation and came up with two-pizza-teams (American pizza) that are made up of 4-8 people. These decentralised teams have freedom and responsibility and are able to fail fast. I had change to talk to many AWS employees and to my surprise they all had same feeling about working for Amazon. To them it felt like working for a startup. This is somewhat amazing when we are talking about company with 750 000 employees. 

Mechanism to innovate is the backwards working process. Backwards process is a very cumbersome process that starts with identifying customer and customer need. To understand customer, there are five questions. Who is the customer, what is the problem or opportunity, what is the benefit, how to quantify this, what does the experience look like? Customer is always a person. Once the customer and the need have been clearly understood and validated with data, one can start brainstorming boundless, big and bold ideas. After this it is time to draft fictional press release, internal and external FAQ’s and finally visual of the customer experience. This press release is a way to democratise the idea, so that all people would have similar opportunity to share their idea. All meetings around the idea start by all members reading the paper and iterating over it. All this is done by the team created around the idea, before a single line of code is written. This is heavy process, but its purpose is to make sure that what you build is for customers need. 

It was said that out of press release drafted by Andy Jassy, iterated 45 times, came AWS. Also, Amazon prime was built from press release. There have of course been mistakes, such as Amazon Fire phone, lessons learned, and that was base to produce Alexa. There is no magic in this, just hard work. To build something for the customer, one has to start with the customer.

AWS launches major new features for Amazon SageMaker to simplify development of machine learning models

Machine learning continues to grow on AWS and they are putting serious effort on paving the way for customers’ machine learning development journey on AWS cloud. The Andy Jassy keynote in AWS Re:Invent was a fiesta for data scientists with the newly launched Amazon SageMaker features, including Experiments, Debugger, Model Monitor, AutoPilot and Studio.

AWS really aims to make the whole development life cycle of machine learning models as simple as possible for data scientists. With the newly launched features, they are addressing common, effort demanding problems: monitoring your data validity from your model’s perspective and monitoring your model performance (Model Monitor), experimenting multiple machine learning approaches in parallel for your problem (Experiments), enable cost efficiency of heavy model training with automatic rules (Debugger) and following these processes in a visual interface (Studio). These processes can even be orchestrated for you with AutoPilot, that unlike many services is not a black box machine learning solution but provides all the generated code for you. Announced features also included a SSO integrated login to SageMaker Studio and SageMaker Notebooks, a possibility to share notebooks with one click to other data scientists including the needed runtime dependencies and libraries (preview).

Compare and try out different models with SageMaker Experiments

Building a model is an iterative process of trials with different hyperparameters and how they affect the performance of the model. SageMaker Experiments aim to simplify this process. With Experiments, one can create trial runs with different parameters and compare those. It provides information about the hyperparameters and performance for each trial run, regardless of whether a data scientist has run training multiple times, has used automated hyperparameter tuning or has used AutoPilot. It is especially helpful in the case of automating some steps or the whole process, because the amount of training jobs run is typically much higher than with traditional approach.

Experiments makes it easy to compare trials, see what kind of hyperparameters was used and monitor the performance of the models, without having to set up the versioning manually. It makes it easy to choose and deploy the best model to production, but you can also always come back and look at the artifacts of your model when facing problems in production. It also provides more transparency for example to automated hyperparameter tuning and also for new SageMaker AutoPilot. Additionally, SageMaker Experiments has Experiments SDK so it is possible call the API with Python to get the best model programmatically and deploy endpoint for it.

Track issues in model training with SageMaker Debugger

During the training process of your model, many issues may occur that might prevent your model from correctly finishing or learning patterns. You might have, for example, initialized parameters inappropriately or used un efficient hyperparameters during the training process. SageMaker Debugger aims to help tracking issues related to your model training (unlike the name indicates, SageMaker Debugger does not debug your code semantics).

When you enable debugger in your training job, it starts to save the internal model state into S3 bucket during the training process. A state consists of for example weights for neural network, accuracies and losses, output of each layer and optimization parameters. These debug outputs will be analyzed against a collection of rules while the training job is executed. When you enable Debugger while running your training job in SageMaker, will start two jobs: a training job, and a debug processing job (powered by Amazon SageMaker Processing Jobs), which will run in parallel and analyze state data to check if the rule conditions are met. If you have, for example, an expensive and time consuming training job, you can set up a debugger rule and configure a CloudWatch alarm to it that kills the job once your rules trigger, e.g. loss has converged enough.

For now, the debugging framework of saving internal model states supports only TensorFlow, Keras, Apache MXNet, PyTorch and XGBoost. You can also configure your own rules that analyse model states during the training, or some preconfigured ones such as loss not changing or exploding/vanishing gradients. You can use the debug feature visually through the SageMaker Studio or alternatively through SDK and configure everything to it yourself.

Keep your model up-to-date with SageMaker Model Monitor

Drifts in data might have big impact on your model performance in production, if your training data and validation data start to have different statistical properties. Detecting those drifts requires efforts, like setting up jobs that calculate statistical properties of your data and also updating those, so that your model does not get outdated. SageMaker Model Monitor aims to solve this problem by tracking the statistics of incoming data and aims to ensure that machine learning models age well.

The Model Monitor forms a baseline from the training data used for creating a model. Baseline information includes statistics of the data and basic information like name and datatype of features in data. Baseline is formed automatically, but automatically generated baseline can be changed if needed. Model Monitor then continuously collects input data from deployed endpoint and puts it into a S3 bucket. Data scientists can then create own rules or use ready-made validations for the data and schedule validation jobs. They can also configure alarms if there are deviations from the baseline. These alarms and validations can indicate that the model deployed is actually outdated and should be re-trained.

SageMaker Model Monitor makes monitoring the model quality very easy but at the same time data scientists have the control and they can customize the rules, scheduling and alarms. The monitoring is attached to an endpoint deployed with SageMaker, so if inference is implemented in some other way, Model Monitor cannot be used. SageMaker endpoints are always on, so they can be expensive solution for cases when predictions are not needed continuously.

Start from scratch with SageMaker AutoPilot

SageMaker AutoPilot is an autoML solution for SageMaker. SageMaker has had automatic hyperparameter tuning already earlier, but in addition to that, AutoPilot takes care of preprocessing data and selecting appropriate algorithm for the problem. This saves a lot of time of preprocessing the data and enables building models even if you’re not sure which algorithm to use. AutoPilot supports linear learner, factorization machines, KNN and XGBboost at first, but other algorithms will be added later.

Running an AutoPilot job is as easy as just specifying a csv-file and response variable present in the file. AWS considers that models trained by SageMaker AutoPilot are white box models instead of black box, because it provides generated source code for training the model and with Experiments it is easy to view the trials AutoPilot has run.

SageMaker AutoPilot automates machine learning model development completely. It is yet to be seen if it improves the models, but it is a good sign that it provides information about the process. Unfortunately, the description of the process can only be viewed in SageMaker Studio (only available in Ohio at the moment). Supported algorithms are currently quite limited as well, so the AutoPilot might not provide the best performance possible for some problems. In practice running AutoPilot jobs takes a long time, so the costs of using AutoPilot might be quite high. That time is of course saved from data scientist’s working time. One possibility is, for example, when approaching a completely new data set and problem, one might start by launching AutoPilot and get a few models and all the codes as template. That could serve as a kick start to iterating your problem by starting from tuning the generated code and continuing development from there, saving time from initial setup.

SageMaker Studio – IDE for data science development

The launched SageMaker Studio (available in Ohio) is a fully integrated development environment (IDE) for ML, built on top of Jupyter lab. It pulls together the ML workflow steps in a visual interface, with it’s goal being to simplify the iterative nature of ML development. In Studio one can move between steps, compare results and adjust inputs and parameters.  It also simplifies the comparison of models and runs side by side in one visual interface.

Studio seems to nicely tie the newly launched features (Experiments, Debugger, Model Monitor and Autopilot) into a single web page user interface. While these new features are all usable through SDKs, using them through the visual interface will be more insightful for a data scientist.

Conclusion

These new features enable more organized development of machine learning models, moving from notebooks to controlled monitoring and deployment and transparent workflows. Of course several actions enabled by these features could be implemented elsewhere (e.g. training job debugging, or data quality control with some scheduled smoke tests), but it requires again more coding and setting up infrastructure. The whole public cloud journey of AWS has been aiming to simplify development and take load away by providing reusable components and libraries, and these launches go well with that agenda.

AWS Redshift breaks bond between compute and storage

AWS Redshift took a huge leap forwards with new releases. Decoupling the storage and compute are the first steps towards modern cloud data warehouse.

AWS Redshift is the world’s most popular data warehouse, but has faced some tough competition from the market. AWS Redshift has the compute and storage coupled, meaning that with the specific amount of instance you get set of storage that sometimes can be limiting. At the Andy Jassy keynote AWS released a new managed storage model for Redshift that allows you to scale the compute decoupled from the storage.

The storage model uses SSDs and S3 for the storage behind the scenes and is utilising architectural improvements of the infrastructure. This allows to users to keep the hot data in SSD and also query historical data stored in S3 seamlessly from Redshift. On top of this, you only pay for the SSD you use. It also comes with a new Nitro based compute instances. In Ireland RA3 instance has price of $15.578 per node/hour, but you get 48 vCPUs and 384 GB of memory and up to 64 TB of storage. You can cluster these up to 128 instances. AWS promises to give 3x the performance of any other cloud data warehouse service and Redshift Dense Storage (DS2) users are promised to get twice the performance and twice the storage at the same cost. RA3 is available now in Europe in EU (Frankfurt), EU (Ireland), and EU (London).

Related to the decoupling of the compute and storage, AWS released AWS AQUA. Advanced Query Accelerator promises 10 times better query performance. AQUA sits on top of S3 and is Redshift compatible. For this feature we have to wait for mid 2020 to get hands on. 

While AWS Redshift is the world’s most popular data warehouse, it is not practical to load all kind of data there. Sometimes data lakes are more suitable places for data, especially for unstructured data. Amazon S3 is the most popular choice for cloud data lakes. New Redshift features help to tie structured and unstructured data together to enable even better and more comprehensive insight.

With Federated Query feature (preview), it is now possible to query data in Amazon RDS PostgreSQL, and Amazon Aurora PostgreSQL from a Redshift cluster. The queried data can then be combined with data in the Redshift cluster, and Amazon S3. Federated queries allows data ingestion into Redshift, without any other ETL tool, by extracting data from above-mentioned data sources, transforming it on the fly, and loading data into Redshift. Data can also be uploaded from Redshift to S3 in Apache Parquet format using Data Lake Export feature. With this feature you are able to build some nice lifecycle features into your design. 

“One should use the best tool for the job”, reminded Andy Jassy at the keynote. With long awaited decoupling of storage and compute and big improvements to the core, Redshifts took a huge leap forward. It is extremely interesting to start designing new solutions with these features.

Pose detection to help seabird research – Baltic Seabird Hackathon

Team Solita participated in Baltic Seabird Hackathon in Gothenburg last week. Based on the huge material and data set available, we decided to introduce pose detection as a method to understand seabird behavior and interactions. The results were promising, yet leave still room to improve.

Baltic Seabird Hackathon

Some weeks ago we decided to participate in the Baltic Seabird Hackathon in Gothenburg. Hackathon was organised by AI INNOVATION of Sweden, Baltic Seabird Project, WWF, SLU and Chalmers University of Technology. In practise we spent few weeks preparing ourselves, going through the massive dataset and creating some models to work with the data. Finally we travelled to Gothenburg and spent 2 days there to finalise our models, presented the results and of course just spent time with other teams and networked with nice people. In this post we will dive a bit deeper on the process of creating the prediction model for pose detection and the results we were able to create.

Initially we didn’t know that much about seagulls, but during the couple weeks we got to learn wonderful details about the birds, their living habits and social interaction. I bet you didn’t know that the oldest birds are over 45 years old! During the hackathon days in Gothenburg we had many seabird experts available to discuss and ask more challenging questions about the birds. In addition we were given some machine learning and technical experts to support the work in the provided data factory platform. We decided to work in AWS sandbox environment, because it was more natural choice for us.

Our team was selected to have cross-functional expertise in design, data, data science and software development and to be able to work in multi-site setup. During the hackathon we had 3 members working in Gothenburg and 2 members working remotely from Sweden and Finland.

So what did we try and achieve?

Material available

For the hackathon we received some 2000 annotated images and 100+ hours of video from the 2 different camera locations in the Stora Karlsö island. Cameras were installed first time in 2019 so all this material was quite new. The videos were from the beginning of May when the first birds arrive to the same ledge as they do every year and coveraged the life of the birds until beginning of August when most of them had left already.

The images and videos were in Full HD resolution i.e. 1920×1080, which gives really good starting point. The angle of the cameras was above and most of the videos and images looked like the example below. Annotated birds were the ones on the top ledge. There were also videos and images from night time, which made it a bit more harder to predict.

Our idea and approach

Initial ideas from the seabird experts were related to identifying different events in the video clips. They were interested to find out automatically when egg was laid, when birds were leaving and coming back from fishing trips and doing other activities.

We thought that implementing these requirements would be quite straightforward with the big annotation set and thus decided to try something else and took a little different approach. Also because of personal interests we wanted to investigate what pose detection of the birds could provide to the scientists.

First some groundwork – Object detection

Before being able to detect the poses of the birds one needs to identify where the birds are and what kind of birds there are. We were provided with over 2000 annotated images containing annotations for adult birds, chicks and eggs. The amount of annotated chicks and eggs was far less than adult birds and therefore we decided to focus on adult birds. With the eggs there were also issue with the ledge color being similar to the egg color and thus making it much harder to separate eggs from the ledge.

We decided to use ImageAI (https://github.com/OlafenwaMoses/ImageAI) Python library for object detection. It has been built simplicity in mind and therefore it was fast and easy to take into use given the existing annotation set. All we had to do was to transform the existing annotations into PascalVOC format. After all initial setup we trained the model with about 200 images, because we didn’t want to spent too much time in the object detection phase. There is a good tutorial available how to do it with your custom annotation set: https://github.com/OlafenwaMoses/ImageAI/blob/master/imageai/Detection/Custom/CUSTOMDETECTIONTRAINING.md

Even with very lightweight training we were able to get easily over 95% precision for the detections. This was enough for our original approach to focus on poses rather than the activities. Otherwise we probably had continued to develop the object detection model further to identify different activities happening on the ledge as some of the other teams decided to proceed.

Based on these bounding boxes we were able to create 640×640 clips of each bird. We utilised FFMPEG to crop the video clips.

Now we got some action – Pose detection

For years now there has been research and models on detecting human poses from images and videos. Based on these concepts  Jake Graving and Daniel Chae have developed DeepPoseKit (https://github.com/jgraving/DeepPoseKit) for detecting poses for animals. They have also focused on making the pose detection much faster than in previous libraries. DeepPoseKit is written in Python and uses in the background TensorFlow and Keras. You can read the paper about the DeepPoseKit here: https://elifesciences.org/articles/47994

The process for utilising DeepPoseKit has 4 main steps:

  1. Create annotation set. This will define the resolution and color of the images used as basis for the model. Also the skeleton (joints and their connections to each other) needs to be defined in csv as a parent-child hierarchy. For the resolution it is probably easiest if the annotation set resolution matches close to what you expect to get from the videos. That way you don’t need to adjust the frames during the prediction phase. For the color scale you should at least consider whether the model works more reliable in gray scale or in RGB color space.

  2. Annotate the images in annotation set. This is the brutal work and requires you to go through the images one by one and marking all skeleton keypoints. The GUI DeepPoseKit provides is pretty simple to use.

  3. Train the model. This definitely takes some time even with GPU. There is also support for augmented data, so you can really improve the model during the training.

  4. Create predictions based on the model.

You can later increase the size of the annotated set and add more images to the set. Also the training can be continued based on existing model and thus the library is pretty flexible.

Because the development of DeepPoseKit is still in the early phases there are at least 2 considerable constraints to remember:

  • Library can only detect individual poses and if you have multiple animals in the same frames, you need some additional steps to separate the animals

  • DeepPoseKit only supports image resolutions that can be repeatedly divided by 2 (e.g. 320×320, 640×640)

Because of these limitations and considering our source material, we came up with following process:

So we decided to create separate clips for each identified bird and run pose detection for these clips and then in the end combine the individual pose detection predictions to the original video.

To get started we needed the annotation set. We decided to use the provided sample script (https://github.com/jgraving/DeepPoseKit/blob/master/examples/step1_create_annotation_set.ipynb) that takes in a video and picks random frames from the video. We started originally with 100 images and increased it to 400 during the hackathon.

For the skeleton we ambitiously decided to model 16 keypoints. This turned out to be quite a task, but we managed to do it. In the end we also created a simplified version of the skeleton and the annotations including only 3 keypoints (beck, head and tail). The original skeleton included eyes, different parts of the wings and legs.

This is how the annotations for complex skeleton look like:

The simplified skeleton model has only 3 keypoints:

With these 2 annotation sets we were able to create 2 models (simple with 3 keypoints and complex with 16 keypoints).

To train the model we pretty much followed the sample script provided by the developers of DeepPoseKit (https://github.com/jgraving/DeepPoseKit/blob/master/examples/step3_train_model.ipynb). Due to limited time available we did not have time to work so much with the augmented data, which could have improved the accuracy of the models. Running a epoch with 45 steps took with AWS p3.2xlarge instance (1 GPU) about 5-6 minutes for the complex model. We managed to run around 45 epochs in total given a final validation loss around 25. Because the development is never a such straightforward process, we had to start the training of the model from scratch few times during the hackathon.

The results

When the model was about ready, we run the detections for few different videos we had available. Once again we followed the example in DeepPoseKit library (https://github.com/jgraving/DeepPoseKit/blob/master/examples/step4b_predict_new_data.ipynb). Basically we ran through the individual clips and frame by frame create the skeleton prediction. After we had this data together, we transformed the prediction coordinates resolution (640×640) to match the original video resolution (1920×1080). In addition to the original script we fine-tuned the graphs a bit and included for example order number for each skeleton. In the end we had a csv file containing for each identified object for each frame in the video for each identified skeleton keypoint the keypoint coordinates and confidence percentage. We added also the radius and degrees between keypoints and the distance of connected keypoints. Radius could be later used to analyse for example in which direction the bird is moving its head. In practice for one identified bird this generated 160 rows of data per second. Below is a sample dataset generated.

The results looked more promising when we had more simple setup in the area of the camera. Below is an example of 2 birds’ poses visualised with the complex model and the results seems quite ok. The predictions follow pretty well the movement of a bird.

The challenges are more obvious when we add more birds to the frame:

The problem is clear if you look at one of the identified bird and it’s generated 640×640 clip. Because the birds are so close to each other, one frame contains multiple birds and the model starts to mix parts of the birds together.

The video above also shows that the bird on the right upper corner is not correctly modeled when the bird expands its wings. This is just an indication that the annotation set does not include enough various poses of the birds and thus the model doesn’t learn those poses.

Instead if we take the more simple model in use in the busy video, it behaves a bit better. Still it is far from being optimal.

So at this point we were puzzling how to improve the model precision and started to look for additional methods.

Shape detection to the help?

One of the options that came to our mind was to try leave only the identified bird visible in the 640×640 frames. The core of the idea was that when only individual bird would be visible in the frame then the pose detection would not mess up with other birds. Another team had partly used this method to rule out all the birds in the distance (upper part of the image). Due to the shape of the birds nothing standard such as vignette filter would work out of the box.

So we headed out to look for better alternatives and found out Mask RCNN (https://github.com/matterport/Mask_RCNN). It has a bit similar approach to the pose detection that you first have to annotate a lot of pictures and then train the model. Due to the limited time available we had to try using Mask RCNN just with 20 annotated images.

After very quick training it seemed as the model had really low validation loss. But unfortunately the results were not that good. As you can see from the video below only parts of the birds very identified by the shape detection (shapes marked as blue).

So we think this is a relevant idea, but unfortunately we didn’t have time to verify this idea.

Another idea we had was to detect some kind of pattern that would help identify the birds that are a couple. We played around with the idea that by estimating the density map of each individual bird we could identify the couples that have a high density map. If the birds are tracked then the birds that have a high density output would be classified as a couple. This would end up in a lot of possibilities for the scientists to track the couples and do research on their patterns. For this task first thing we have to do is to put a single marker on each individual bird. So instead of tagging the bird as a whole, we instead tag the head of the bird which is a single point. Image the background as black and the top of the head of each individual bird is marked with the color. The Deep learning architectures used for this were UNET and FCRN(Fully Convolutional Regressional Networks). These are the common architectures used when estimating density maps. We got the idea from this blogpost(https://towardsdatascience.com/objects-counting-by-estimating-a-density-map-with-convolutional-neural-networks-c01086f3b3ec) and how it is used to estimate the density maps which is then used to count the number of objects. We ended up using this to identify both the couples and the number of penguins. Sadly the time was not enough to see some reasonable results. But the idea was very much appreciated by the judges and could be something that they could think about and move forward with.

Another idea would be to use some kind of tagging of the birds. That would work as long as the birds remain on the ledge, but in general it might be a bit challenging as the videos are very long and the birds move around the ledge to some extent.

What next?

Well with the provided results we won the 3rd price in the hackathon. According to the jury biggest achievements were related to the pose detection and the possibilities it opens up for science. It seems that there is not much research done for the social interactions of seagulls and our pose detection model could help on that.

It was clear for all of us that we want to donate the money and have now decided to give it as a scholarship for a student who will take the models and work them further for the benefit of seabird science. Will be interesting to see what the models can tell us about the life of the baltic seabirds, their social interaction and in general socioeconimics of Baltic Sea.

On behalf our Team Solita (Mari Harju, Jani Turunen, Kimmo Kantojärvi, Zeeshan Dar and Layla Husain),

Kimmo & Zeeshan

 

Power BI Embedded web app in Azure

Use Power BI Embedded and Azure Web Apps to show your PBI report on a web page or in your intranet, where users can securely access it.

Introduction

Power BI provides easy and readily available tools for report creation and distribution. However, distributing reports outside Power BI Service and to users without Power BI Pro account is limited. Say you have a Power BI report monitoring your Snowflake account usage (https://data.solita.fi/visualize-snowflake-warehouse-cost-using-power-bi/), and you’d like to share that to the development and maintenance team. You don’t want to purchase Power BI Pro licenses and you don’t want to open the report to the whole world. Instead you simply want to show your PBI report on a web page / in your intranet, where users can securely access it – and here’s where Power BI Embedded and Azure Web Apps come in handy!

In this blog post we’ll set up an environment with a Web App, and use a service principal (whose identity we’ll retrieve from Azure Key Vault) to access Power BI Service to retrieve a report for Embedding. All the work will result in a template that is easy to copy for new web applications to embed new reports. We’ll also take a look at some handy modifications to our Web App template for retrieving a desired report page and hiding navigation elements. You don’t need to be familiar with .NET, we’ll be doing only slight modifications to an existing Microsoft template!

Environment

Target environment is presented in the picture below. The goal is to setup the environment in a way that it is easily reusable: for example adding a few more web applications with other reports embedded in them shouldn’t be too much of a trouble.

Power BI Embedded and Web App environment

App Service Plan

Azure App Service Plan provides a pool of resources for App Services running inside the Plan. You can view it a bit like a server farm in old on-premises systems. The main decisions to make are your App Service Plan location (West Europe, US East, etc…) and the amount and level of resources available inside that Plan: number and size of instances, pricing tier. Start small when developing and scale up as required. As always, refer to Azure documentation for up-to-date information on different service levels: https://docs.microsoft.com/en-us/azure/app-service/overview-hosting-plans

App Service

Azure App Service provides infrastructure for a single application inside an App Service Plan. Each App Service receives a URL pointing to that particular App Service: for example https://ourprojectsnowflakemonitoring.azurewebsites.net. Our embedded report is a web application running in App Service. We’ll be using App Service managed identity to identify our application to Azure Key Vault. And lastly we can use App Service network configurations to restrict access to our application.

Web application

A web application running in Azure App Service can be written in a number of languages.  We’ll be using a template provided by Microsoft from https://github.com/microsoft/PowerBI-Developer-Samples/tree/master/App%20Owns%20Data. We’ll modify the template so that we can retrieve configurations from Azure Key Vault, namely our Service Principal information. We will also use application settings so we can easily copy our web app for new embedded reports.

Azure Key Vault

Key Vault provides a secure place for accessing and storing secrets. In this example, we’ll be storing our Service Principal credentials there, and allowing access to our App Service.

Service Principal

Service Principal can be seen as a technical account in Azure AD. It offers a way to authenticate as a service, not as a user. Using Service Principals to authenticate your applications to Power BI Service is a more scalable way: say you have multiple workspaces and multiple applications using the reports: creating, managing and authorising multiple service principals is simpler than multiple “Master Users”. For much more detailed documentation, please refer to https://docs.microsoft.com/en-us/power-bi/developer/embed-service-principal

Power BI Embedded

Power BI Embedded provides Power BI capacity outside Power BI Service. We can setup workspaces to use the Embedded capacity and provide the reports to outside users via web apps. Power BI Embedded comes in three capacities: A for capacity in Azure, EM for volume licensing, and P for Premium licence. Unless you already have volume licensing or Premium capacity, you would most probably want to use A capacity bought from Azure.

Power BI Service, Workspace and Report

The Embedded report resides as usual in Power BI Service in a Workspace that is assigned to use Embedded capacity. As we are using Service Principal, our Workspace needs to be V2, and we need to allow Service Principals to use it.

Environment setup

Terraform

I’m used to creating my environments using Terraform (https://www.terraform.io/). There’s a lot of blog posts out there discussing why you should or shouldn’t use Terraform in Azure. I personally like the easy readability and reusability of it.

Using Terraform in Azure and setting up a CI/CD pipeline from Azure DevOps is a topic for another blog post. Here, we’ll focus more on setting up our environment and the different configurations we require. So don’t worry right now about the Terraform syntax, let’s focus more on the output. 

We are creating an App Service Plan sized D1 and Shared tier. This is suitable for development and testing. 

resource "azurerm_app_service_plan" "service_plan" {
  name                = "blogpowerbiembedded"
  location            = "${module.defaults.resource_location}"
  resource_group_name = "${azurerm_resource_group.rg.name}"
  tags = "${module.defaults.tags}"
  sku {
    tier = "Shared"
    size = "D1"
  }
}

Then we add an App Service to previously created App Service Plan. We set the App Service to get a “System Assigned” identity. Using SystemAssigned identity, our newly created web app receives an Azure AD identity which we then authorize to access Azure Key Vault to retrieve Service Principal Secrets. This way we don’t have to authorize anything else (for example another service principal) besides the application itself to retrieve the secret.

We also set an app_settings block. These are application configurations, which will override the configurations set in code. These will become handy later on.

resource "azurerm_app_service" "snowflake_monitor" {
  name                = "example-snowflake-monitor"
  location            = "${module.defaults.resource_location}"
  resource_group_name = "${azurerm_resource_group.rg.name}"
  app_service_plan_id = "${azurerm_app_service_plan.service_plan.id}"
  https_only          = "true"
 
 identity {
    type = "SystemAssigned"
  } 

  app_settings {
    "reportId" = "nnnnnn"
    "workspaceId" = "nnnnn"
  }
}

Creating a key vault is pretty straight forward. What is special here is that we grant “get” and “list” permissions to the system assigned identity of our App Service: the tenant_id and object_id are dynamically referencing the corresponding outputs of the system assigned identity. No need to hardcode anything.

resource "azurerm_key_vault" "keyvault" {
  name                        = "${module.defaults.name_prefix_safe}pbiemb"
  resource_group_name         = "${azurerm_resource_group.rg.name}"
  location                    = "${module.defaults.resource_location}"
  enabled_for_disk_encryption = true
  tenant_id                   = "${module.secrets.tenant_id}"
  sku_name                    = "standard"

  tags = "${module.defaults.tags}"
}

resource "azurerm_key_vault_access_policy" "snowflake_monitor" {
  key_vault_id          = "${azurerm_key_vault.keyvault.id}"
  tenant_id = "${azurerm_app_service.snowflake_monitor.identity.0.tenant_id}"
  object_id = "${azurerm_app_service.snowflake_monitor.identity.0.principal_id}"
  secret_permissions = [
    "get",
    "list"
  ]
}

Remember to restrict access to your web app if required. For example, restrict access to only your offices’ and vpn’s IPs.

Service Principal

Next, we need a service principal. Service principal is a technical account which accesses Power BI Service and REST APIs. It doesn’t require any licenses and it can’t login via app.powerbi.com. More info about Service Principal in Power BI Embedded context can be found from https://docs.microsoft.com/en-us/power-bi/developer/embed-service-principal

Create the Service Principal any way you like: az cli, PowerShell, Azure Portal, have your Azure Admin do it. We then need to add permissions to our Service Principal to use Power BI REST APIs. This is pretty straightforward in Azure Portal: https://docs.microsoft.com/en-us/power-bi/developer/register-app#apply-permissions-to-your-application-within-azure-ad

What you require are Service Principal Client ID and Client Secret. Add them to your Key Vault:

resource "azurerm_key_vault_secret" "sp_id" {
  name      = "servicePrincipalClientId"
  value     = "${module.secrets.powerbi_embedded_sp_client_id}"
  key_vault_id  = "${azurerm_key_vault.keyvault.id}"
  tags = "${module.defaults.tags}"
}

resource "azurerm_key_vault_secret" "sp_secret" {
  name      = "servicePrincipalClientSecret"
  value     = "${module.secrets.powerbi_embedded_sp_client_secret}"
  key_vault_id  = "${azurerm_key_vault.keyvault.id}"
  tags = "${module.defaults.tags}"
}

Here the actual values are retrieved from another place. We don’t want to store sensitive information in plaintext in Terraform.

Power BI Service and Workspace

Setting up a Power BI Embedded can be a bit of a hassle and documentation is spread around. A more complete guide is a topic of it’s own blog post, so here’s a short summary of what needs to be done:

  1. Create Power BI Embedded capacity in Azure. That means A-tier capacity (P-tier is for Premium, and EM for volume licensing). Note that Embedded capacity should be in the same region as your Power BI Service is. Don’t forget to add capacity admin.
  2. Add Embedded capacity to your Power BI Service 
  3. Assign a Workspace V2 to Embedded capacity. As we are using Service Principals, we need to use a Workspaces V2.
  4. Allow Service Principals in Tenant settings
  5. Allow service principals in the Workspace with your Reports you want to Embed.

Power BI Report for Embedding

Now that everything else is ready we still need the actual report and the web page to display and embed the report. The embedded report resides as usual in Power BI Service, published in our Workspace V2 added to Embedded capacity.

Open your report, and publish it to the correct workspace. Note that you will need a Power BI Pro license to do so.

After publishing, the parameters we require for embedding are Report and Workspace Id. These you can retrieve from your published report. Open your report and look at the url:

Here the part after “groups/” is your workspace ID. The part after “reports/” is your report ID. 

You might remember we had an app settings block in our web app:

  app_settings {
    "reportId" = "nnnnnn"
    "workspaceId" = "nnnnn"
  }

Update your application settings.

  app_settings {
    "reportId" = "nnnnnnn-nnnn-nnnn-nnnn-nnnnnnnnnn"
    "workspaceId" = "nnnnnn-nnnn-nnnn-nnnn-nnnnnnnnnn"
  }

Now we have empty web app capacity which has access to Azure Key Vault to retrieve Service Principal ID and secret in order to access our Power BI. All that is left is the web application itself.

Web App

Our web app will be a simple web page which shows the report from Power BI. We’ll be using the template with no further authentication. The template is called “app owns data”. It is advisable to restrict web app access to your office IPs for example. 

For Web App, we’ll use the app owns data sample provided by Microsoft: https://github.com/microsoft/PowerBI-Developer-Samples/tree/master/App%20Owns%20Data

Download a copy of the project, and open up the PowerBIEmbedded_AppOwnsData.sln in Visual Studio. The template requires some modifications:

  1. Reference the Key Vault for secrets
  2. Use the Key Vault as a source for configurations

Besides Key Vault, we’ll also use application settings (set before, report ID and workspace ID) as a source for configurations. This happens automatically, and doesn’t require any modifications.

Setup Azure Key Vault

Let’s add our Key Vault as a connected service:

Select the Key Vault we just created. Note, you might need to add access rights to yourself depending on your setup!

Use Key Vault as a source for configurations

Update and add NuGet packages:

You’ll want to update the already existing packages. We also need to add one NuGet package in order for us to use Azure Key Vault as a config source. Add NuGet package:

Microsoft.Configuration.ConfigurationBuilders.Azure

Now we want to modify our configurations. Open the web.config file. At the start of the file, add configBuilders section (modify “yourkeyvaultname” to match your keyvault). Also add configBuilders=”AzureKeyVault” to appSettings block.

<configuration>
  <configSections>
    <section name="entityFramework" type="System.Data.Entity.Internal.ConfigFile.EntityFrameworkSection, EntityFramework, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" />
    <section name="ServicePrincipal" type="System.Configuration.NameValueSectionHandler" />
    <section name="MasterUser" type="System.Configuration.NameValueSectionHandler" />
    <!-- For more information on Entity Framework configuration, visit http://go.microsoft.com/fwlink/?LinkID=237468 -->
    <section name="configBuilders" type="System.Configuration.ConfigurationBuildersSection, System.Configuration, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" restartOnExternalChanges="false" requirePermission="false" />
  </configSections>
  <configBuilders>
    <builders>
    <add name="AzureKeyVault" vaultName="yourkeyvaultname" type="Microsoft.Configuration.ConfigurationBuilders.AzureKeyVaultConfigBuilder, Microsoft.Configuration.ConfigurationBuilders.Azure, Version=1.0.0.0, Culture=neutral" vaultUri="https://yourkeyvaultname.vault.azure.net"/></builders>
  </configBuilders>
  <appSettings configBuilders="AzureKeyVault">
    <add key="webpages:Version" value="3.0.0.0" />
    <add key="webpages:Enabled" value="false" />
    <add key="ClientValidationEnabled" value="true" />
….

In the same web.config, find appSettings, and change your application to use ServicePrincipal.

 <add key="AuthenticationType" value="ServicePrincipal" />

For local debugging, you can add workspaceId and reportId. However, the app_settings values we previously set for web application in Azure will override anything you set in here.

    <add key="workspaceId" value=""/>
    <add key="reportId" value=""/>

Fill your Azure AD tenant:

    <!-- Fill Tenant ID in authorityUrl-->
    <add key="authorityUrl" value="https://login.microsoftonline.com/common/f4nnnnn-nnnn-nnnn-nnnn-nnnnnnnnnnnn" />

You need to add empty key-value pairs for every Key Vault secret you want to import! Let’s say our ServicePrincipal secret is called “servicePrincipalClientSecret”. To import it, we need to add the following line in app settings:

 <add key="servicePrincipalClientSecret" value="" />

We are also retrieving the Service Principal ID from Key Vault:

<add key="servicePrincipalClientSecret" value="" />

The actual values are fetched from Azure Key Vault.

Use configuration values in our application

Check what configuration values map to what values in application: Modify Services/EmbedService.cs In our example, we need to configure ApplicationSecret to point to servicePrincipalClientSecret, and ApplicationId to point to servicePrincipalClientId. This is at the very start of the file:

 public class EmbedService : IEmbedService

    {

       ...

        private static readonly string ApplicationId = ConfigurationManager.AppSettings["servicePrincipalClientId"];

        ...

        private static readonly string ApplicationSecret = ConfigurationManager.AppSettings["servicePrincipalClientSecret"];

...



Start debugging:

A web page should open. Click on Embed report, and it should open a report for you!

And that’s it! You’ve successfully set-up a Web App, which connects to a Key Vault using a managed identity, to get client secrets, and then uses those secrets to identify as a Service Principal to Embed a Power BI report to your web page! That was a mouthful; in other words, you are using application specific secrets the correct way and not some hard-coded human-impersonating Power BI account.

Publish our Web App

All that is left is to publish our web app:

Let’s use our previously created web app:

Click OK and then publish, and you are good to go! You might want restrict access to your web app only to your office and vpn IPs.

Some modifications

Disable filter and navigation panes from report

Say you are showing your report in a non-interactive way (a TV screen for example). You might want hide redundant filter and navigation panes from the view:

Modify Views/Home/EmbedReport.cshtml. Find the following:

    var config = {
        type: 'report',
        tokenType: models.TokenType.Embed,
        accessToken: accessToken,
        embedUrl: embedUrl,
        id: embedReportId,
        permissions: models.Permissions.All,
        settings: {
            filterPaneEnabled: true,
            navContentPaneEnabled: true
        }
    };

Change the settings block filterPaneEnabled and navContentPaneEnabled to “false”. You can find other interesting settings and their documentation like locale from https://github.com/Microsoft/PowerBI-JavaScript/wiki/Embed-Configuration-Details.

Go to report page based on URL parameter

You have a nice report showing on your TV with navigations and filters disabled. Nice. Now you’d like to show different pages from that report based on a URL.

Modify Controllers/HomeController.cs

Add ReportPage to EmbedReport:

public async Task<ActionResult> EmbedReport(string username, string roles, int ReportPage = 0)

The zero in ReportPage means that by default we’ll open the first page.

Modify Views/Home/EmbedReport.cshtml.

Change the end of the file from:

    // Embed the report and display it within the div container.
    var report = powerbi.embed(reportContainer, config);

To

    // Get a reference to the embedded report HTML element
    var reportContainer = $('#embedContainer')[0];

    // call load() instead of embed() to load the report while delaying the rendering process
    var report = powerbi.load(embedContainer, config);

    // when loaded event occurs, set current page then call render()
    report.on("loaded", function () {
       // console.log("loaded event executing");
        // call to get Pages collection
        report.getPages().then(
            function (pages) {
                // inspect pages in browser console
                //console.log(pages);
                // display specific page in report
                var startPage = pages[@ViewBag.ReportPage];
                config.pageName = startPage.name;
                // Call report.render() to display report
                report.render(config);
            });
    });

 

Now you can set the report page in the URL. Use ReportPage=<page_number_starting_from_0> as the parameter: For example “EmbedReport?ReportPage=1” would take you to the second page. 

Further customization:

Embed container size and other style related: Content/style/style.css

Webpage customization: Views/Home/EmbedReport.cshtml

For much more information, check .NET MVC Framework: https://docs.microsoft.com/fi-fi/aspnet/mvc/overview/getting-started/introduction/adding-a-view

It is possible to require users to authenticate to your Azure AD: this option is called “users’ own data” and the template is found from the same GitHub repo as our app own data example.

Publishing a new report

Now that we have a working template, publishing a new web app pointing to a new report is extremely fast and easy.

  1. Get your new report’s report ID. Use the same workspace, or remember to add the service principal to the new Power BI workspace.
  2. Create a new web application in Azure. It’s just copying in your Terraform: Update the reportId and workspaceId, add key vault access policy. Here we are creating a new web app called “Solita-blog-datapipeline-monitor”.
  3. Copy our web application solution.
  4. Open the new copy, and publish it to the new web app in Azure. 
resource "azurerm_app_service" "datapipeline_monitor" {
  name                = "solita-blog-datapipeline-monitor"
  location            = "${module.defaults.resource_location}"
  resource_group_name = "${azurerm_resource_group.rg.name}"
  app_service_plan_id = "${azurerm_app_service_plan.service_plan.id}"
  https_only          = "true"
  identity {
    type = "SystemAssigned"
  } 

  app_settings {
    "reportId" = "Some-New-value-nnnn-xxxxxxx"
    "workspaceId" = "nnnnnn-nnnn-nnnn-nnnn-nnnnnnnnnn"
  }
}

resource "azurerm_key_vault_access_policy" "datapipeline_monitor" {
  key_vault_id          = "${azurerm_key_vault.keyvault.id}"
  tenant_id = "${azurerm_app_service.datapipeline_monitor.identity.0.tenant_id}"
  object_id = "${azurerm_app_service.datapipeline_monitor.identity.0.principal_id}"
  secret_permissions = [
    "get",
    "list"
  ]
}

Wrap up

You now know what kind of environment you need to securely use Web App, Power BI Service and Power BI Embedded in Azure. You know how to do some modifications to your web app or at least where to look for more information. And after the first application, publishing a new one is just a simple matter of creating a new copy with two new parameter values.

After working with Power BI Embedded for a while I’ve grown to appreciate it’s capabilities. This was just about embedding a complete report from Power BI Service, and I’m looking forward to do some more complex work combining web applications with Power BI Embedded!

Http proxy through AWS EC2 Ubuntu for fake IP address.

Creating an http proxy server to cloud – A hobby project

Introducing an approach to reroute web traffic through a virtual machine in the cloud. This was a personal competence development project for which Solita allows their employees to spend some working hours.

Why I wanted to create a proxy server to cloud?

While I was on a business trip in Sweden I had some lazy time to watch a documentary from a video streaming service. Unfortunately the web service was available only in Finland, so I had to come up with other ways to spend my evening. The obvious choice was to try whether it would be theoretically possible to hack my location to watch such programs abroad.

To stay on the brighter side of the law, I decided to only validate approach in the conceptual level. I never tried the hack in the actual service, and you should neither.

And yes, there are tons of software products to make this easier. Rather, my goal was to learn new skills and sharpen my developer competence. The focus was more in the functionality rather than in the cyber security.

Disclaimer. Use this article on for legally valid business purposes such as rerouting traffic in your own web service. Always read the rules of the web services that you are using.

Choosing the cloud provider for the proxy server

I needed a virtual server located physically in Finland. Initially I planned to use Google Cloud virtual machines as Google has a data center in Finnish soil. Microsoft Azure and Amazon Web Services do not have data centers in Finland.

Well, the first attempt failed quickly, because Google Cloud did not assign the IP address of the proxy server to Finland. So I switched the focus on creating a general purpose http proxy server on AWS to fake the location for web services.

Next comes the instructions to replicate my approach. The examples are primarily for Windows users.

Creating a virtual server to AWS for http proxy

Create an AWS account if you don’t have one. Login to AWS console from the browser.

From services select EC2. Select the preferred region from the top right corner. I usually choose Ireland because it has one of the the most comprehensive service selections in Europe. By this choice the traffic would be rerouted through Ireland.

Click Launch instance.

Select Ubuntu 18.04 LTS as the image for the EC2 instance.

Click next until you are prompted to create an SSH key. Name the key as you wish, download it and launch the instance.

Go back to EC2 instance view and note the IP address. In my case it was 18.203.111.131. It is safe to publish the info here, as the virtual machine is already destroyed.

Connect to AWS EC2 instance and create a tunnel

You need to have PuTTY Key Generator and PuTTY installed. The key file was downloaded in pem format from AWS. Convert the pem file to ppk using PuTTY Key Generator. Load the pem file and click Save private key.

Normally you would never want to show the private key to anyone. The key and the EC2 instance for this tutorial have already been destroyed.

Go to PuTTY and give the username and IP address of the remote machine for PuTTY. For AWS EC2 Ubuntu instance the default user is ubuntu.

Go to ConnectionSSH > Auth and browse the ppk file that you just saved.

Create a tunnel that will route all traffic in your local machine port 8080 to port 3128 of the remote EC2 Ubuntu instance. 3128 is the default port for the squid proxy tool in Linux that we will install soon.

Click Open from the bottom of PuTTY. The terminal window appears.

Install squid in the virtual machine to make it an http proxy server

Install squid to the remote Ubuntu machine.

sudo apt update
sudo apt install squid

Find the line from the squid configuration file where the http access has been denied by default.

grep -n 'http_access deny all' /etc/squid/squid.conf

Open the squid configuration file for editing.

sudo nano /etc/squid/squid.conf

Simultaneously press CTRL and  to enter the line number to find the correct line.

Change the value to:

http_access allow localhost

Press CTRL and x to save. Choose y to confirm. Hit Enter to overwrite.

Finally restart the squid service.

sudo service squid restart

The proxy server is now successfully configured.

Configure Google Chrome browser for http proxy

Go to Google Chrome settings in  your laptop. Find the settings having keyword proxy.

Click Open proxy settings. The browser opens Windows settings for the Internet Properties. Click LAN settings.

Route all traffic through your local machine’s (127.0.0.1) port 8080. That port again was tunneled to AWS EC2 port 3128 where the squid proxy server is running on top of the Ubuntu operating system.

Click Ok to see the magic happening.

Checking if the http proxy server works in the browser

I went to a page which detects the client’s IP address and shows the geographical location. The web service thinks I’m in Ireland where the AWS data center is located.

Once I switch off the browser proxy from Chrome/Windows my IP address points to my actual location. At the moment of writing I was in Gothenburg, Sweden.

New call-to-action
Pyspark looks like regular python code, but the distributed nature of the execution requires the whole new way of thinking to optimize the code.

PySpark execution logic and code optimization

PySpark looks like regular python code. In reality the distributed nature of the execution requires the whole new way of thinking to optimize the PySpark code.

This article will focus on understanding PySpark execution logic and performance optimization. PySpark DataFrames are in an important role.

To try PySpark on practice, get your hands dirty with this tutorial: Spark and Python tutorial for data developers in AWS

DataFrames in pandas as a PySpark prerequisite

PySpark needs totally different kind of engineering compared to regular Python code.

If you are going to work with PySpark DataFrames it is likely that you are familiar with the pandas Python library and its DataFrame class.

Here comes the first source of potential confusion: despite their similar names, PySpark DataFrames and pandas DataFrames behave very differently. It is also easy to confuse them in your code. You might want to use suffix like _pDF for pandas DataFrames and _sDF for Spark DataFrames.

The pandas DataFrame object stores all the data represented by the data frame within the memory space of the Python interpreter. All of the data is easily and immediately accessible. The operations on the data are executed immediately when the code is executed, line by line. It is easy to print intermediate results to debug the code.

However, these advantages are offset by the fact that you are limited by the local computer’s memory and processing power constraints – you can only handle data which fits into the local memory. But since the operations are done in memory, with a basic data processing task you do not need to wait more than a few minutes at maximum.

PySpark DataFrames and their execution logic

The PySpark DataFrame object is an interface to Spark’s DataFrame API and a Spark DataFrame within a Spark application. The data in the DataFrame is very likely to be somewhere else than the computer running the Python interpreter – e.g. on a remote Spark cluster running in the cloud.

There are two distinct kinds of operations on Spark DataFrames: transformations and actions. Transformations describe operations on the data, e.g. filtering a column by value, joining two DataFrames by key columns, or sorting data. Actions are operations which take DataFrame(s) as input and output something else. Some examples from action would be showing the contents of a DataFrame or writing a DataFrame to a file system.

The key point to understand how Spark works is that transformations are lazy. Executing a Python command which describes a transformation of a PySpark DataFrame to another does not actually require calculations to take place. Ordering by a column and calculating aggregate values, returning another PySpark DataFrame would be such transformation. Rather, the operation is added to the graph describing what Spark should eventually do.

When an action is requested – e.g. return the contents of this Spark DataFrame as a Pandas DataFrame – Spark looks at the processing graph and then optimizes the tasks which needs to be done. This is the job of the Catalyst optimizer, and it enables Spark to optimize the operations to very high degree.

Also, the actual computation tasks run on the Spark cluster, meaning that you can have huge amounts of memory and processing cores available for the actual computation, even without resorting to the top-of-the-line virtual machines offered by cloud providers.

Consider caching to speed up PySpark

However, the highly optimized and parallelized execution comes at a cost: it is not as easy to see what is going on at each step. Looking at the data after some transformations means that you have to gather the data, or its subset, to a single computer. This is an action, so Spark has to determine the computation graph, optimize it, and execute it.

If your dataset is large, this may take quite some time. This is especially true if caching is not enabled and Spark has to start by reading the input data from a remote source – such as a database cluster or cloud object storage like S3.

You can alleviate this by caching the DataFrame at some suitable point. Caching causes the DataFrame partitions to be retained on the executors and not be removed from memory or disk unless there is a pressing need. In practice this means that the cached version of the DataFrame is available quickly for further calculations. However, playing around with the data is still not as easy or quick as with pandas DataFrames.

Use small scripts and multiple environments in PySpark

As a rule of thumb, one PySpark script should perform just one well defined task. This is due to the fact that any action triggers the transformation plan execution from the beginning. Managing and debugging becomes a pain if the code has lots of actions.

The normal flow is to read the data, transform the data and write the data. Often the write stage is the only place where you need to execute an action. Instead of debugging in the middle of the code, you can review the output of the whole PySpark job.

With large amounts of data this approach would be slow. You would have to wait a long time to see the results after each job.

My suggestion is to create environments that have different sizes of data. In the environment with little data you test the business logic and syntax. The test cycle is rapid as there’s no need process gigabytes of data. Running the PySpark script with the full dataset reveals the performance problems.

This goes well together with the traditional dev, test, prod environment split.

Favor DataFrame over RDD with structured data

RDD (Resilient Distributed Dataset) can be any set of items. For example, a shopping list.

["apple", "milk", "bread"]

RDD is the low-level data representation in Spark, and in earlier versions of Spark it was also the only way to access and manipulate data. However, the DataFrame API was introduced as an abstraction on top of the RDD API. As a rule of thumb, unless you are doing something very involved (and you really know what you are doing!), stick with the DataFrame API.

DataFrame is a tabular structure: a collection of Columns, each of which has a well defined data type. If you have a description and amount for each item in the shopping list, then a DataFrame would do better.

+-------+-----------+------+
|product|description|amount|
+-------+-----------+------+
|apple  |green      |5     |
|milk   |skimmed    |2     |
|bread  |rye        |1     |
+-------+-----------+------+

This is also a very intuitive representation for structured data, something that can be found from a database table. PySpark DataFrames have their own methods for data manipulation just like pandas DataFrames have.

Avoid User Defined Functions in PySpark

As a beginner I thought PySpark DataFrames would integrate seamlessly to Python. That’s why I chose to use UDFs (User Defined Functions) to transform the data.

A UDF is simply a Python function which has been registered to Spark using PySpark’s spark.udf.register method.

With the small sample dataset it was relatively easy to get started with UDF functions. When running the PySpark script with more data, spark popped an OutOfMemory error.

Investigating the issue revealed that the code could not be optimized when using UDFs.  To Spark’s Catalyst optimizer, the UDF is a black box. This means that Spark may have to read in all of the input data, even though the data actually used by the UDF comes from a small fragments in the input I.e. doing data filtering at the data read step near the data, i.e. predicate pushdown, cannot be used.

Additionally, there is a performance penalty: on the Spark executors, where the actual computations take place, data has to be converted (serialized) in the Spark JVM to a format Python can read, a Python interpreter spun up, the data deserialized in the Python interpreter, the UDF executed, and the result serialized and deserialized again to the Spark JVM. All of this takes significant amounts of time!

The recommendation is to stay in native PySpark dataframe functions whenever possible, since they are translated directly to native Scala functions running on Spark.

If you absolutely, positively need to do something with UDFs in PySpark, consider using the pandas vectorized UDFs introduced in Spark 2.3 – the UDFs are still a black box to the optimizer, but at least the performance penalty of moving data between JVM and Python interpreter is lot smaller.

Number of partitions and partition size in PySpark

In order to process data in a parallel fashion on multiple compute nodes, Spark splits data into partitions, smaller data chunks. A DataFrame of 1,000,000 rows could be partitioned to 10 partitions having 100,000 rows each. Additionally, the computation jobs Spark runs are split into tasks, each task acting on a single data partition. Spark cluster has a driver that distributes the tasks to multiple executors. This means that the datasets can be much larger than fits into the memory of a single computer – as long as the partitions fit into the memory of the computers running the executors.

In one of the projects our team encountered an out-of-memory error that we spent a long time figuring out. Finally we found out that the problem was a result of too large partitions. The data in a partition could simply not fit to the memory of a single executor node.

Too few partitions also make the execution inefficient. Some of the executor cores idle while others are working on a full steam, if there are not as many partitions as there are available cores (or, technically, available slots)

However, having a large amount of small partitions is not optimal either – shuffling the data in the small partitions is inefficient. Also reading and writing to disk (not to mention a network destination) in small chunks potentially increases the total execution time.

The Spark programming guide recommends 128 MB partition size as the default. For 128 GB of data this would mean 1000 partitions. Without going too deep in the details, consider partitioning as a crucial part of the optimization toolbox. If your partitions are too large or too small, you can use the coalesce() and repartition() methods of DataFrame to instruct Spark to modify the partition distribution. The number of partitions in a DataFrame sDF can be checked with sDF.rdd.getNumPartitions().

Summary – PySpark basics and optimization

PySpark offers a versatile interface for using powerful Spark clusters, but it requires a completely different way of thinking and being aware of the differences of local and distributed execution models. The functionality offered by the core PySpark interface can be extended by creating User-Defined Functions (UDFs), but as a tradeoff the performance is not as good as for native PySpark functions due to lesser degree of optimization. Partitioning the data correctly and with a reasonable partition size is crucial for efficient execution – and as always, good planning is the key to success.

New call-to-action

How to visualize Snowflake Warehouse costs in euros using Power BI

In this blog post, I'll demonstrate how you can create a Power BI report to visualize monthly Snowflake virtual warehouse costs in euros and make Snowflake costs visible even for those not having access to Snowflake console.

At the core of Snowflake are the virtual warehouses which enable unlimited scalability for computing. Data load, integration and visualization tools can have their own virtual warehouses. You can even provide a dedicated virtual warehouse for a business analyst who wants to query the raw data by himself.

You although want to ensure that Snowflake costs won’t skyrocket due to misuse of computational power what Snowflake provides so easily and the easiest way to do this is to visualize the virtual warehouse credit costs in Power BI (or similar tool) using Snowflake’s internal data dictionary and share the dashboard for relevant parties not having access to Snowflake console.

If you open any Snowflake database, you see that Snowflake databases come with INFORMATION_SCHEMA -schema included. INFORMATION_SCHEMA is Snowflake’s own data dictionary of all the stuff which are inside the database.

LOAD_HISTORY -view, for example, contains information of all files which are loaded into Snowflake. Do you want to know who has logged into Snowflake recently and what queries he or she has done? Just visit LOGIN_HISTORY and QUERY_HISTORY -views. Insides of INFORMATION_SCHEMA have been illustrated greatly in Kent Graziano’s “Using Snowflake Information Schema” so I won’t go more into details.

Our use case is to visualize the credit cost of virtual warehouses in Euros and for this, we will need:

1) Snowflake user who has access either to ACCOUNT_USAGE -schema or desired INFORMATION_SCHEMA’s.
2) PowerBI or similar BI -tool to visualize the warehouse credit usage

Finding correct schema and values which we need.

As we want to visualize credit usage of warehouses, we want to use WAREHOUSE_METERING_HISTORY -view. The view is located at two places, inside ACCOUNT_USAGE -schema and INFORMATION_SCHEMA inside every database (as information of historical and current usage of data are shared inside the account inside every INFORMATION_SCHEMA). This time we’re going to use ACCOUNT_USAGE -schema. The values which interest us are START_TIME, WAREHOUSE_NAME and CREDITS_USED.

Creating Power BI data sources

Power BI supports Snowflake, but before you can access Snowflake data, you need to download Snowflake ODBC -driver. Once you have the setup running, create a new data source using the “Get Data” -button. In my case, I have chosen “DV_WAREHOUSE_METERING_HISTORY_S” -table which is same as the WAREHOUSE_METERING_HISTORY -table, but is loaded weekly to provide historical data beyond one year.

The cost of warehouses are stored inside WAREHOUSE_METERING_HISTORY -table in CREDITS_USED column, but the Snowflake credits need to be converted into dollars and euro’s and grouped accordingly.

Add two new columns to the data set. Use the Modeling -tab and the “New Column” -button. Name first column as Dollar and one more column “Dollar” to the data and give it current value for dollar in euros. Currently one dollar is valued at 0,89 euros, so will give that value. In the future we can modify this value to be updated constantly, but for getting a best estimate for costs, a fixed value works fine.

Remember to give enough decimals for column so that Power BI won’t round up the value into one.

Create second column and give it name “Euro”. For this column, use following calculation. This calculation will count the sum the used credits and multiply the with the value of dollar. Outcome is the used credits in euros.

1 EURO = CALCULATE(SUM([CREDITS_USED])*CALCULATE(SUM(DV_WAREHOUSE_METERING_HISTORY_S[Dollar])))

Creating Power BI report

Once you have data sources ready, doing the final report is really easy. Add START_TIME (on month level) to AXIS, WAREHOUSE_NAME to LEGEND and finally EURO as value.

For this visualization, I have chosen a clustered column chart. You can add SUM of EURO to the dashboard to give it a final touch. I have also added a note that this does not include storage costs. The final result should look something like below.

Hopefully this short guidance will help you to visualize your Snowflake costs and ease the task of proofing that Snowflake is really cheap if you use it correctly.

For more information
https://docs.snowflake.net/manuals/sql-reference/functions/warehouse_metering_history.html
https://docs.snowflake.net/manuals/sql-reference/account-usage/warehouse_metering_history.html
https://docs.snowflake.net/manuals/sql-reference/info-schema.html
https://docs.snowflake.net/manuals/sql-reference/account-usage.html

New call-to-action
The editor to modify the python flavored spark code.

AWS Glue tutorial with Spark and Python for data developers

This AWS Glue tutorial is a hands-on introduction to create a data transformation script with Spark and Python. Basic Glue concepts such as database, table, crawler and job will be introduced.

In this tutorial you will create an AWS Glue job using Python and Spark. You can read the previous article for a high level Glue introduction.

In the context of this tutorial Glue could be defined as “A managed service to run Spark scripts”.

In some parts of the tutorial I reference to this GitHub code repository.

Create a data source for AWS Glue

Glue can read data either from database or S3 bucket. For this tutorial I created an S3 bucket called glue-blog-tutorial-bucket. You have to come up with another name on your AWS account.

Create two folders from S3 console called read and write.

The S3 bucket has two folders. In AWS folder is actually just a prefix for the file name.
The S3 bucket has two folders. In AWS a folder is actually just a prefix for the file name.

 

Upload this movie dataset to the read folder of the S3 bucket.

The data for this python and spark tutorial in Glue contains just 10 rows of data. Source: IMDB.
The data for this Python and Spark tutorial in Glue contains just 10 rows of data. Source: IMDB.

Crawl the data source to the data catalog

Glue has a concept of crawler. A crawler sniffs metadata from the data source such as file format, column names, column data types and row count. The metadata makes it easy for others to find the needed datasets. The Glue catalog enables easy access to the data sources from the data transformation scripts.

The crawler will catalog all files in the specified S3 bucket and prefix. All the files should have the same schema.

In Glue crawler terminology the file format is known as a classifier. The crawler identifies the most common classifiers automatically including CSV, json and parquet. It would be possible to create a custom classifiers where the schema is defined in grok patterns which are close relatives of regular expressions.

Our sample file is in the CSV format and will be recognized automatically.

Instructions to create a Glue crawler:

  1. In the left panel of the Glue management console click Crawlers.
  2. Click the blue Add crawler button.
  3. Give the crawler a name such as glue-blog-tutorial-crawler.
  4. In Add a data store menu choose S3 and select the bucket you created. Drill down to select the read folder.
  5. In Choose an IAM role create new. Name the role to for example glue-blog-tutorial-iam-role.
  6. In Configure the crawler’s output add a database called glue-blog-tutorial-db.

 

Summary of the AWS Glue crawler configuration.
Summary of the AWS Glue crawler configuration.

 

When you are back in the list of all crawlers, tick the crawler that you created. Click Run crawler.

Note: If your CSV data needs to be quoted, read this.

The crawled metadata in Glue tables

Once the data has been crawled, the crawler creates a metadata table from it. You find the results from the Tables section of the Glue console. The database that you created during the crawler setup is just an arbitrary way of grouping the tables.

Metadata for the Glue table. You can see properties as well as column names and data types from this view.
Metadata for the Glue table.

 

Glue tables don’t contain the data but only the instructions how to access the data.

Note: For large CSV datasets the row count seems to be just an estimation.

AWS Glue jobs for data transformations

From the Glue console left panel go to Jobs and click blue Add job button.

Follow these instructions to create the Glue job:

  1. Name the job as glue-blog-tutorial-job.
  2. Choose the same IAM role that you created for the crawler. It can read and write to the S3 bucket.
  3. Type: Spark.
  4. Glue version: Spark 2.4, Python 3.
  5. This job runsA new script to be authored by you.
  6. Security configuration, script libraries, and job parameters
    1. Maximum capacity2. This is the minimum and costs about 0.15$ per run.
    2. Job timeout10. Prevents the job to run longer than expected.
  7. Click Next and then Save job and edit the script.

Editing the Glue script to transform the data with Python and Spark

Copy this code from Github to the Glue script editor.

Remember to change the bucket name for the s3_write_path variable.

Save the code in the editor and click Run job.

The Glue editor to modify the python flavored spark code.
The Glue editor to modify the python flavored Spark code.

 

The detailed explanations are commented in the code. Here is the high level description:

  1. Read the movie data from S3
  2. Get movie count and rating average for each decade
  3. Write aggregated data back to S3

The execution time with 2 Data Processing Units (DPU) was around 40 seconds. Relatively long duration is explained by the start-up overhead.

The data transformation creates summarized movie data. For example 90's had 4 movies in the top 10 with the average score of 8.95.
The data transformation script creates summarized movie data. For example 1990 decade had 4 movies in the IMDB top 10 with the average score of 8.95.

 

You can download the result file from the write folder of your S3 bucket. Another way to investigate the job would be to take a look at the CloudWatch logs.

The data is stored back to S3 as a CSV in the "write" prefix. The number of partitions equals number of output files.
The data is stored back to S3 as a CSV in the “write” prefix. The number of partitions equals the number of the output files.

Speeding up Spark development with Glue dev endpoint

Developing Glue transformation scripts is slow, if you just run a job after another. Provisioning the computation cluster takes minutes and you don’t want to wait after each change.

Glue has a dev endpoint functionality where you launch a temporary environment that is constantly available. For development and testing it’s both faster and cheaper.

Dev endpoint provides the processing power, but a notebook server is needed to write your code. Easiest way to get started is to create a new SageMaker notebook by clicking Notebooks under the Dev endpoint in the left panel.

About Glue performance

In the code example we did read the data first to Glue’s DynamicFrame and then converted that to native PySpark DataFrame. This method makes it possible to take advantage of Glue catalog but at the same time use native PySpark functions.

However, our team has noticed Glue performance to be extremely poor when converting from DynamicFrame to DataFrame. This applies especially when you have one large file instead of multiple smaller ones. If the execution time and data reading becomes the bottleneck, consider using native PySpark read function to fetch the data from S3.

Summary about the Glue tutorial with Python and Spark

Getting started with Glue jobs can take some time with all the menus and options. Hopefully this tutorial gave some idea what is the role of database, table, job and crawler.

The focus of this tutorial was in a single script, but Glue also provides tools to manage larger group of jobs. You can schedule jobs with triggers or orchestrate relationships between triggers, jobs and crawlers with workflows.

Learning the Glue console is one thing, but the actual logic lies in the Spark scripts. Tuning the code impacts significantly to the execution performance. That will be the topic of the next blog post.

AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

Introduction to AWS Glue for big data ETL

AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

AWS Glue is a serverless ETL tool in cloud. In brief ETL means extracting data from a source system, transforming it for analysis and other applications and then loading back to data warehouse for example.

In this blog post I will introduce the basic idea behind AWS Glue and present potential use cases.

The emphasis is in the big data processing. You can read more about Glue catalogs here and data catalogs in general here.

Why to use AWS Glue?

Replacing Hadoop. Hadoop can be expensive and a pain to configure. AWS Glue is simple. Some say that Glue is expensive, but it depends where you compare. Because of on demand pricing you only pay for what you use. This fact might make AWS Glue significantly cheaper than a fixed size on-premise Hadoop cluster.

AWS Lambda can not be used. A wise man said, use lambda functions in AWS whenever possible. Lambdas are simple, scalable and cost efficient. They can also be triggered by events. For big data lambda functions are not suitable because of the 3 GB memory limitation and 15 minute timeout. AWS Glue is specifically built to process large datasets.

Apply DataOps practices. Drag and drop ETL tools are easy for users, but from the DataOps perspective code based development is a superior approach. With AWS Glue both code and configuration can be stored in version control. The data development becomes similar to any other software development. For example the data transformation scripts written by scala or python are not limited to AWS cloud. Environment setup is easy to automate and parameterize when the code is scripted.

An example use case for AWS Glue

Now a practical example about how AWS Glue would work in practice.

A production machine in a factory produces multiple data files daily. Each file is a size of 10 GB. The server in the factory pushes the files to AWS S3 once a day.

The factory data is needed to predict machine breakdowns. For that, the raw data should be pre-processed for the data science team.

Lambda is not an option for the pre-processing because of the memory and timeout limitation. Glue seems to be reasonable option when work hours and costs are compared to alternative tools.

The simplest way of get started with the ETL process is to create a new Glue job and write code to the editor. The script can be either in scala or python programming language.

Extract. The script first reads all the files from the specified S3 bucket to a single data frame. You can think a data frame as a table in Excel. The reading can be just a one-liner.

Transform. This is the most of the code. Let’s say that the original data had 100 records per second. The data science team wants the data to be aggregated per each 1 minute with a specific logic. This could be just tens of code lines if the logic is simple.

Load. Write data back to another S3 bucket for the data science team. It’s possible that a single line of code will do.

The code runs on top of the spark framework which is configured automatically in Glue. Thanks to spark, data will be divided to small chunks and processed in parallel on multiple machines simultaneously.

What makes AWS Glue serverless?

Serverless means you don’t have machines to configure. AWS provisions and allocates the resources automatically.

The processing power is adjusted by the number of data processing units (DPU). You can do additional configuration, but it’s likely that a proof of concept works out of the box.

In an on-premise environment you would have to make a decision about the computation cluster size. A big cluster is expensive but fast. A small cluster would be cheaper but slow to run.

With AWS Glue your bill is the result the following equation:

[ETL job price] = [Processing time] * [Number of DPUs]

 

The on demand pricing means that the increase in processing power does not compromise with the price of the ETL job. At least in theory, as too many DPUs might cause overhead in processing time.

When is AWS Glue a wrong choice?

This is not an advertisement, so let’s give some critique for Glue as well.

Lots of small ETL jobs. Glue has a minimum billing of 10 minutes and 2 DPUs. With the price of 0.44$ per DPU hour, the minimum cost for a run becomes around 0.15$. The starting price makes Glue unappealing alternative to process small amount of data often.

Specific networking requirements. If you spin up a standard EC2 virtual machine, an IP address will be attached to it. The serverless nature of Glue means you have to put more effort on network planning in some cases. One such scenario would be whitelisting a Glue job in a firewall to extract data from an external system.

Summary about AWS Glue

The most common argument against Glue is “It’s expensive”. True, in a sense that the first few test runs can already cost a few dollars. In a nutshell, Glue is cost efficient for infrequent big data workloads.

In the big picture AWS Glue saves a lot of time and unnecessary hardware engineering. The costs should be compared against alternative options such as on-premise Hadoop cluster or development hours required for a custom solution.

As Glue pricing model is predictable, the business cases are straightforward to calculate. It might be enough to test just the critical parts of the ETL pipeline to become confident about the performance and costs.

I feel that optimizing the code for distributed computing has been more of a challenge than the Glue service itself. The next blog post will focus on how data developers get started with Glue using python and spark.

A curated list of new Snowflake features released at Snowflake Summit 2019

Snowflake coming to Google Cloud, Data Replication, Snowflake Organizations, external tables, Data Pipelines, Data Exchange. The list goes on. Read following article for a curated list of new features released on Snowflake Summit keynote at San Francisco.

Trying to go through for all the announcements for a product can be sometimes overwhelming. It takes time as you need to go through for all the individual press announcements. To ease the pain, I’ve gathered a curated list of new Snowflake features released at the first annual Snowflake Summit at sunny California at Hotel Hilton.

So let’s begin.

Snowflake released new features which can be broken down into to following categories: Core Data Warehouse, Data Pipelines, Global Snowflake and Secure Data Sharing. I’ll list the features and give a more precise description of the feature (if possible).

Core Data Warehouse

Core Data Warehouse features are the bread and butter of Snowflake. Everybody is already familiar with Snowflake features such as virtual warehouses, separation of storage and compute so Snowflake didn’t release any announcements on those. They though said that they are working on decreasing concurrency latency and making large ad-hoc queries to work even better. In larger scale this means that the boundaries of operational databases and operational data platforms are narrowing down.

On the security side, Snowflake told that they are working with multiple vendors on enabling identity passing from BI -tools to Snowflake. Basically this means that your user id which you use to log on to for example to Tableau, is passed straight to Snowflake. This will enable even better row-level security and secure views possibilities. As of today, Snowflake reminded, that Snowflake already supports OAuth 2.0 as an authentication method.

Under the hood, JavaScript Stored Procedures support was reminded and the possibility of Geospatial capability was spoken (nothing released, rather teased).

Guys and gals at Snowflake are also keen on providing a richer experience for the end users. They told that they had gathered information from the end users, to provide better worksheets, Worksheet 3.0 so to speak. The end result of that was that during the investigation phase they encountered company called Numeracy and eventually decided to make their first acquisition as Numeracy was working on creating even better UI for Snowflake. Now the features of Numeracy UI are being ported into Snowflake UI. Numecary UI provides better editor suggestions, visualizations and provides worksheet sharing. On text this doesn’t sound anything new, but my colleague took a video of the new UI and it is awesome.

Data Pipelines

Data Pipelines is a new feature coming into Snowflake. Basically Snowflake has now the capability to auto-ingest data coming into the cloud storage layer and the possibility to do a data transformation for the data based on user-defined tasks. This means that you no longer need any external scheduler to trigger the small ELT or scheduled jobs.

This is possible due to the following new features:
– Auto-Ingest
– Streams and Tasks

During the Data Pipelines presentation a Snowflake Connector for the Kafka was also announced.

Under the Data Pipelines headline, the concept of Snowflake as a Data Lake was introduced more in detail because Snowflake will now support external tables and SQL over external tables, which means that you don’t need to load the data into Snowflake to get the insight of the data (and data structure). To make things interesting, Snowflake will now also support materialized views on external tables.

As a surprise effect, all the features which were were listed above are available today in public preview.

Global Snowflake

Under the title of Global Snowflake the new regions were introduced. AWS got Canada Central, US-Central (Ohio) and Japan. For Azure, new regions will open in US-West-2, Canada Central, Government and Singapore.

The most anticipated release was the release of Google Cloud version of Snowflake, which will be on preview at Q3 at 2019. Google Cloud supported stages are though possible already, as noted earlier.

Snowflake Organizations was published as a new way to control your Snowflake instances within a large corporation. Now you have the possibility to deploy Snowflake instances in your chosen cloud and region through Snowflake UI and you can act as organization admin. Organizations will provide a dashboard of warehouse and storage costs of all your accounts. So, if your company has multiple Snowflake accounts, this the new way to go.

Finally, Database Replication was announced. Database replication offers the possibility to replicate your data, per database level, to another Snowflake account regardless whether the account is Azure, AWS or Google Cloud. It doesn’t even matter on which region you’re planning to replicate your data as database replication supports cross-region replications.

Database replication is a point-and-click version of the traditional Oracle Data Guard or Microsoft SQL Server AlwaysOn Cluster where you define database which you want to replicate to a different location for business continuity purposes or just to provide read-only data nearer to the end users. The difference to the Oracle and Microsoft versions is that Snowflake implementation works out-of-the-box without any hassle. Database replication also supports failover and failback for client application end user point, meaning that if you have Tableau Server connected to Snowflake and you do a failover the database, Tableau Server will reconnect to the new location.

Secure Data Sharing

On Secure Data Sharing field, Snowflake introduced their concept of a data marketplace, called Data Exchange. Basically, Data Exchange is a marketplace for data that can be used through Snowflake Data Sharing. You can buy data shares (for example consumer travel data, weather data etc.) through Data Exchange and use them straight away in your SQL queries.

There are two options for this, Standard and Personalized. Standard -data is your normal data set, available for everybody. Personalized -data requires authentication between you and the data provider, but it will grant you the possibility to ingest your own data (if possible) straight from the provider using Data Sharing -functionality.
If that’s not all, Snowflake even advertised in mid-sentence, a simpler way to deploy Tableau into AWS and configure it to use Snowflake, now in 30 minutes.

Hopefully, this list helps you to get a grasp on things to come on Snowflake landscape. I will update this list if new features will be launched tomorrow or Thursday.

Link list in raw format.
https://www.snowflake.com/blog/snowflake-data-pipelines/
https://www.snowflake.com/blog/numeracy-investing-in-our-query-ui/
https://www.snowflake.com/blog/using-oauth-2-0-with-snowflake/
https://www.snowflake.com/blog/snowflake-introduces-javascript-stored-procedures/
https://www.snowflake.com/news/snowflake-and-google-cloud-announce-strategic-partnership/
https://www.snowflake.com/blog/snowflake-launches-database-replication-and-failover/
https://www.snowflake.com/news/snowflake-announces-data-exchange-to-break-down-data-barriers/
https://aws.amazon.com/about-aws/whats-new/2019/06/new-quick-start-deploys-tableau-and-snowflake-on-the-aws-cloud/

How to use Active Directory users and groups for Snowflake user and role management

In this following blog-post, I will tell you how to create asynchronous replication from Microsoft Active Directory to Snowflake Cloud Data Warehouse so you can use your existing AD users and groups (including nested groups) to manage Snowflake.

I’ve been using Snowflake for a few years now and although Snowflake is an excellent product, it lacks the possibility to use existing directories (Microsoft Active Directory, OpenLDAP, Google) for managing Snowflake users and roles. It doesn’t though mean that it’s impossible.

As of today, you can already create federated authentication against Okta, ADFS, or another (i.e. custom) SAML 2.0-compliant Identity Provider (IdP) which enables users to do a single sign-on into Snowflake using their (for example) Azure Active Directory information. We’ll come back to this later on, as federated authentication is a key piece for using AD accounts with Snowflake.

In this following blog-post, I will tell you how to use and sync Microsoft Active Directory -users and groups into Snowflake and what components are needed for this. Let’s start by going through the needs.

You’ll need following for making the synchronization possible:

1) Rights to modify Snowflake account parameters and create new users
2) Microsoft Azure -subscription for enabling AzureAD Single Sign-On to Snowflake
3) Windows Server connected to Active Directory and snowsql installed
4) This script which sync’s users from AD. Huge thanks to James Weakley.

We will create following linking between Snowflake, Azure and Microsoft Active Directory.

Active Directory syncronization illustrated

Idea is to create a similar structure within Active Directory as you want to have inside Snowflake. Users and groups are provisioned and synced into Snowflake using Powershell -script running inside Window Server joined into AD. Single Sign-On authentication happens against AzureAD. So, let’s start configuring.

Enabling Single Sign-On

Steps can be created in any order, but you can start by enabling single sign-on as a first step. Having single sign-on enabled, you can already start using AD accounts as long as you create the needed accounts manually. Enabling single sign-on is already documented really well by Microsoft in the following site. https://docs.microsoft.com/en-us/azure/active-directory/saas-apps/snowflake-tutorial so I won’t go through all steps. Shortly, the idea is to configure AzureAD as IdP for Snowflake and eventually configure the same in Snowflake as detailed below.

USE ROLE ACCOUNTADMIN ;
ALTER ACCOUNT SET SAML_IDENTITY_PROVIDER = '{
"certificate": "longhash",
"ssoUrl": "https://login.microsoftonline.com/hashedtext/saml2",
"type" : "ADFS",
"label" : "AzureAD"
}';

Set the login page (this the page what you see, when you log on to Snowflake) to have new button what user’s can now to log using their Active Directory account.

ALTER ACCOUNT SET SSL_LOGIN_PAGE = TRUE ;

Finally alter any desired test users to have LOGIN_NAME similar as their AD account information. This will map do the actual mapping. You may also remove the users.

ALTER USER mika.heino@solita.fi SET LOGIN_NAME = 'mika.heino@solita.fi' DISPLAY_NAME = 'Mika Heino, Solita '

Creating Snowflake service account for script use

Once you have single sing-on working and tested, we can move to automating the user and group (role) provisioning. For this, we need a separate Snowflake service account as we don’t want to run stuff with personal accounts.

Creating roles and users in Snowflake requires SECURITYADMIN role. You can either create a service account with securityadmin granted, or you can create your own security admin role for service account with the required rights. Either way is ok, as long as it’s documented properly.

For example create following user:
CREATE USER S_AUTOMATION_AD_SYNC PASSWORD = 'example_password' COMMENT = 'Service Account, Used to sync Active Directory users and roles' LOGIN_NAME = 'S_AUTOMATION_AD_SYNC' DEFAULT_ROLE = "SERVICE_ACCOUNTS_SECURITYADMIN" DEFAULT_WAREHOUSE = 'WH_ADE_DEV' DEFAULT_NAMESPACE = 'PUBLIC' MUST_CHANGE_PASSWORD = FALSE ;

GRANT ROLE "SERVICE_ACCOUNTS_SECURITYADMIN" TO USER S_AUTOMATION_AD_SYNC ;

GRANT ROLE "SECURITYADMIN" TO ROLE "SERVICE_ACCOUNTS_SECURITYADMIN";

Creating OU and Snowflake role structure

Choose or create a new Organizational Unit (OU) for Snowflake users. This will be the container for roles that you can use to grant rights inside Snowflake. Although this is the most straightforward phase in the whole process, use time to consider the roles (groups) to have and use. Consider at least the following structure:

1) Limited role for persons to have ACCOUNTADMIN -rights. Keep the number of people here a limited as this role can do anything inside Snowflake -account.
2) DV_ADMIN/DEVELOPER -role. Ideally, this role will have the rights to do anything inside one or more databases and meant for the persons creating the data models and loads i.e. in Solita the persons doing the Data Vault.
3) DATA SCIENTIST -role. Ideally, this role would have the rights to read anything from all the schemas inside the database having all the business data and rights to create anything inside separate development -database.
4) ANALYST -role. This role is meant for the business users and should have read rights to Publish -schema or ideally only to dimension and facts tables having business rules applied.

Same structure inside Active Directory.

Example Snowflake role set

Once you have the general structure ready, create it inside Active Directory, note the OU -structure and move to next step.

Scheduling the script

Scheduling the PowerShell – script is really straight forward. You use can the trusted Task Scheduler (ideally this would be running for example inside Azure Function, but this is the first version) for scheduling. You need to have a Windows Server joined into Active Directory, snowsql -installed and necessary firewall openings done. You’ll need to provide a few details for the script. Following is an example of the details for the script.

.\snoad.ps1 -snowflakeAccount 'ly12345' -snowflakeUser 'usersync' -snowflakeRole 'ACCOUNTADMIN' -snowflakeRegion 'ap-southeast-2' -ouIdentity 'OU=AsiaPacific,OU=Sales,OU=UserAccounts,DC=FABRIKAM,DC=COM' -createAnyMissingUsers $true

The script will retrieve all security groups and all the users from the OU you define, check whether they exist in Snowflake and create any missing users without any password to enforce AD authentication. For each security group immediately within the OU, a role in Snowflake is matched or created and all Snowflake users are granted it. The process goes also the other way, as users who have subsequently been removed from the AD group will be revoked from the corresponding role. Detailed documentation can be found in the GitHub -projects README -file.

Following is an example from script and it will create missing roles and grant roles to correct persons. In this example script you can also see a user being removed. In Snowflake the user removal is disable and revoke of his/hers roles, but no user account deletion.

As you can see from the script, you can map the role names using prefixes so that AD role snowflake-role-ADMIN is mapped to ADROLE_ADMIN. This will also help you on creating the roles in Active Directory, as you can use the existing company naming standards.

** Retrieving list of current snowflake users
Total current Snowflake SSO users: 30
**Retrieving AD security groups in OU OU=Snowflake,OU=Solutions,DC=FI,DC=Solita,DC=int
. AD Group : snowflake-role-ADMIN
. Snowflake Role : ADROLE_ADMIN
Fetching users immediately in group whose accounts aren't disabled
Fetching users nested in non-snowflake groups under group , whose accounts aren't disabled
Fetching nested role group
. AD Group : snowflake-role-DEVELOPER
. Snowflake Role : ADROLE_DEVELOPER
Fetching users immediately in group whose accounts aren't disabled
Fetching users nested in non-snowflake groups under group , whose accounts aren't disabled
Fetching nested role group
. AD Group : snowflake-role-Scientist
. Snowflake Role : ADROLE_SCIENTIST
Fetching users immediately in group whose accounts aren't disabled
Fetching users nested in non-snowflake groups under group , whose accounts aren't disabled
Fetching nested role group
. AD Group : snowflake-role-Analyst
. Snowflake Role : ADROLE_ANALYST
Fetching users immediately in group whose accounts aren't disabled
Fetching users nested in non-snowflake groups under group , whose accounts aren't disabled
Fetching nested role group
Disabling 1 users as they have no roles mapped in AD
**Retrieving list of current snowflake roles
Checking membership of current roles
Checking role ADROLE_ADMIN exists in Snowflake
Role does not exist, creating
Checking role ADROLE_DEVELOPER exists in Snowflake
Checking role ADROLE_SCIENTIST exists in Snowflake
Checking role ADROLE_ANALYST exists in Snowflake
Role does not exist, creating
. Checking role ADROLE_ADMIN has the appropriate users
missing grantees: Mika.Heino@solita.fi
superfluous grantees:
. Checking role ADROLE_DEVELOPER has the appropriate users
missing grantees:
superfluous grantees:
. Checking role ADROLE_SCIENTIST has the appropriate users
missing grantees:
superfluous grantees:
. Checking role ADROLE_ANALYST has the appropriate users
missing grantees: pappa@solita.fi seppo@solita.fi
superfluous grantees:
. Checking role ADROLE_ADMIN has the appropriate roles
missing grantees:
superfluous grantees:
. Checking role ADROLE_DEVELOPER has the appropriate roles
missing grantees:
superfluous grantees:
. Checking role ADROLE_SCIENTIST has the appropriate roles
missing grantees:
superfluous grantees:
. Checking role ADROLE_ANALYST has the appropriate roles
missing grantees:
superfluous grantees:
Without the -WhatIf flag, the script will execute the following SQL Statement:
BEGIN TRANSACTION;
ALTER USER \"teppo@solita.fi\" SET DISABLED=TRUE;
CREATE ROLE \"ADROLE_ADMIN\";
CREATE ROLE \"ADROLE_ANALYST\";
GRANT ROLE \"ADROLE_ADMIN\" TO USER \"mika.heino@solita.fi\";
GRANT ROLE \"ADROLE_ANALYST\" TO USER \"pappa@solita.fi\";
GRANT ROLE \"ADROLE_ANALYST\" TO USER \"seppo@solita.fi\";
REVOKE ROLE \"ADROLE_ANALYST\" FROM USER \"teppo@solita.fi\";
COMMIT;

Final words

I encourage you to test this script a couple of times before leaving it to run by itself. I also want to highlight that all though I’ve been using the script in production for a while, it comes without any support at all and it is a community-developed script, not an official Snowflake offering.

Also I recommend that you leave few Snowflake accounts with ACCOUNTADMIN rights as using AD accounts do not have any password and you might get yourself locked out in situations where SSO -authentication does not work. Keep in my though, that in general practice you should not have large amounts of users with ACCOUNTADMIN -rights as those accounts have highest rights on Snowflake.

For more information:
https://github.com/jamesweakley/snoad
https://docs.snowflake.net/manuals/user-guide/admin-security-fed-auth.html
https://docs.microsoft.com/en-us/azure/active-directory/saas-apps/snowflake-tutorial

Visualizing location data with Excel 3D Map including a time dimension. 3D Map utility is very effective for simple and occasional use cases.

Visualizing location data by time – Excel 3D Map

Visualizing location data with Excel 3D Map including the time dimension. 3D Map utility is very effective for quick demonstrations.

The good old Microsoft Excel has an effective utility called 3D Maps. It’s great for simple geo-animations. My original use case was to visualize geographical master data for one of Solita’s customers.

See the demo video (00:35 min) created from OECD asylum seeker inflow data from years 2000-2017. The asylum data is not related to the customer project.

Next I will show how to create a 3D animation yourself.

Preparing the data for Excel 3D Map

I downloaded OECD asylum data from International Migration Database in CSV format.

After saving the CSV to my laptop I imported the data to Excel from Data > New Query > From file > From CSV.

Data after Excel Power Query importing. Asylum seeker data from OECD database. Years 2000-2017.
Data after Excel Power Query import.

 

The data had only a Year column, so I needed to create a Date column to fulfill 3D map requirements. For example year 2018 became date 31.12.2018.

The data set had lots of information about different types of asylums in each country. For the demo I only took Inflows of asylum seekers by nationality category and Total for each country.

Excel Power Query editor. Added a custom column to convert the year to a date.
Excel Power Query editor.

Creating a new tour in Excel 3D Map

You can create a new 3D map tour in Excel from Insert > 3D Map.

Launching Excel 3D Map. You can either create a new tour or open an existing one.
Launching Excel 3D Map.

 

A tour contains multiple scenes. In my demo the tour had five scenes.

Transitions and animations between the scenes are automatic. You just zoom the map to the right position for each scene. When all the scenes in the tour are in place, the tour can be played or you can create a video file from it.

Excel 3D Map has relatively few editing options. This makes the use cases somewhat limited but keeps things simple.

Creating a 3D Map with a time dimension

This is what you see when entering to 3D Map editing view.

Tour editing view in Excel 3D Map. Scenes on the left. Country as the location. Number of asylums as the value. Date as the time. Color can be changed from Layer Options.
Tour editing view in Excel 3D Map.

 

Here are the setting I adjusted from the panel on the right side of the screen.

Layer name. In the demo there’s only one layer. It would be possible to add multiple data layers per scene. You could have different data series for the number of asylums and number of immigrants.

Visualization type. There are five icons in the right panel where you can choose. My selection was region. Other options are stacked column, clustered column, bubble and heat map.

Location. Column in the data that represents the location. I chose the Country column. Could be also an address or coordinates.

Value. Column in the data that represents the magnitude, height or intensity of the location. In this example it’s the Number of asylum seekers in a specific year and country.

Time. It would be simple to show asylums in different countries on a single year. To make the demo more interesting I added a timeline to animate the asylum numbers annually. From the clock icon select Data stays until it is replaced to avoid showing cumulative numbers.

Layer options. Changed the color from blue to red.

Finally you can set the scene transition duration by clicking the gear icon from a specific scene in the left panel.

For some scenes I added a Text Box from the top menu (the Ribbon) to explain what is happening.

Creating a video from the 3D Map tour

Click Create Video from the ribbon.

Select the quality. Background music often makes the video more vivid, so you can add music from Soundtrack Options. I downloaded the sound track from here.

The Value of MDM – 4 Drivers For Managing Your Data

The term MDM (Master Data Management) may often have a negative echo due to the price tag it presents. This blog post aims to clarify why MDM is important by simplifying the actual value of MDM from four different perspectives.

The term MDM may often have a negative echo as it can be understood as a yet another system to buy or as a very expensive and complex process to be implemented without clear tangible value or measurable ROI (Return on Investment). Both of those interpretations could not be more wrong if you really understand what you are doing.

When talking about MDM, the first thing to understand is that MDM is not only a system – it’s a combination of roles, responsibilities and processes enabled by technology. Or how I like to think of it, MDM is a way of thinking and seeing the data management in a big picture.

Secondly, it’s true that the value of MDM is hard to prove but it does not mean that there is no value at all. To understand better the concept of MDM you can read our previous blog post Master Data Management explained. But now, lets dig deeper into the four drivers for managing your data.

1. MDM reduces costs by optimizing and automating data management processes

One of the key features of master data is that it’s used in several systems all around the organization either for operational or analytical purposes. But how is the data managed in those systems? Is it created manually or is it integrated from somewhere else?

Every time a person creates a new data record or updates an existing record it takes time. Time is money. If a new customer record is needed for five different systems, e.g. marketing, prospects, CRM, financial, data warehouse, there is a huge difference in the time it takes and costs it generates if the update is done manually in each system or once in one system that is integrated with the others. Also if the creation or update process is very complex, e.g. if approval or data enrichment is required from different persons, it takes more time, which again generates more costs.

The example in the table 1 below highlights two different scenarios for master data maintenance (incl. creation, update, deletion) costs. In the first scenario the data maintenance is done only in one central system using a simplified process. The maintained data is automatically exported to other systems that need the same data through automatic integrations.

In the second scenario there aren’t any automatic integrations so the data maintenance is done separately in all systems that need the data. Being operational systems, data maintenance often requires more or less complex processes. The time used for data maintenance includes the time required from all persons that take part in the process (e.g. creator, acceptor). The cost of the time is based on the calculation where the employee costs about 50€ per hour for the employer including salary and other costs.

SCENARIO 1:
Data maintenance to one system, optimized process, automated data integrations
SCENARIO 2:
Data maintenance to several systems, complex processes, manual data integrations
Amount of systems where data maintenance is needed 1 10
Time used for data maintenance 5 min 10 min
Cost of the time 4,17 € 8,33 €
Amount of data maintenance transactions in a year 1000 1000
Price in total in a year 4 170 € 83 300 €

Table 1. Two scenarios for master data maintenance costs.

The example above does not consider the expenses caused by the licensing costs and development costs when building the MDM capability and integrations to support the automated data integrations.

However, the example clearly shows that by minimizing the time you need for creating or updating a data record and the amount of manual work you need for the whole data management operation you can save a huge amount of money in the long run.

And imagine what other valuable work the employees could be doing if he or she didn’t have to spend time on complex data maintenance tasks in multiple systems.

On the other hand, complex data management tasks may also affect people’s mindset about the data maintaining itself. If all data maintenance tasks are really complicated, people tend to avoid them which may end up poor data quality. Easy and intuitive processes mean happy employees who can concentrate on their actual work rather than spending time on data maintenance.

2. MDM reduces costs by decreasing the friction in the operative processes

One of the clearest indicators of poor master data management is bad data quality. If the data quality is bad, it tends to cause errors in operative processes such as sales, manufacturing, shipping, etc. If for example the same customer record is saved in several systems, do you know which system has the latest and most accurate information about the customer? If the data is managed separately in each system, the operative processes may utilize different data.

As an example, a customer company has bought a product and requested direct delivery to one of its plants. The delivery address is saved to the system where the sales is handled. However, the shipping is done based on the data in the ERP system where the supply chain management is handled. If the customer data is not in sync between the sales and ERP systems there is a risk that the delivery address is incorrect in ERP. The product can be accidently sent to the customer’s headquarters instead of the plant they requested the delivery in the first place.

This kind of error obviously generates extra costs. First, the bad end result of the process needs to be fixed, e.g. the wrong delivery is re-delivered to the correct address. Second, the bad data needs to be fixed in order to avoid the error to happen again. Let’s say that fixing the error costs about 500 € including the time the employees use, re-delivering the product, compensation for the client, etc. If this is a one time thing, it’s not a big issue and the cost can be handled. But if the root cause is not fixed, it is most likely that the error will be repeated. If similar errors occur, let’s say 100 times a year, the costs are already 50 000 €.

In addition, to the direct costs mentioned above, there are also indirect costs which are often misinterpreted. For example, wrong deliveries may affect the customer’s attitude so that the next time they need something, they will turn to some other provider who can handle the delivery to the correct address. Additionally, bad experiences are often shared and in the worst case scenario also other customers or prospects may begin to doubt the trustworthy of the company and turn to competitors. At the end of the day, this may have a major impact on the sales and income.

By managing the master data, refining the data management processes and utilizing suitable technologies to support them, you can decrease the amount of errors, i.e. friction, in the operative processes caused by data issues.

For example, if the customer master data is managed only in one system and then automatically integrated to other systems utilizing that data, you can be sure that the data is in sync between the systems.

3. MDM enhances decision making by providing better visibility on high-quality data

Nowadays, business decisions are usually based on data. It can be public data which is available from news, publications, statistics, social media, etc. or it can be analysis made from operational processes (sales, deliveries, invoicing, etc.).

When talking about decisions it’s very important that one can trust the data on which the decisions are based. And more important, one should be able to connect the data to other important data domains in order see the bigger picture. The most important domains to be connected with are usually master data. For example, what do you do with sales data if you can not connect it to a customer, a product or a certain area, e.g. country?

When master data management is handled poorly, even some simple reports can be difficult to build reliably. For example, if you want to have a report including all customers, does you know in which systems the customer data is stored. And how do you know in which system the data is most reliable? Again, if you want to have a report to see which products or product types the customers have bought, are you able to combine different data domains reliably to each other with some key information. Or do you have visibility on the life cycle and validity of customer or product or other master data? Can you actually trust the data?

It’s true that in traditional data warehouse you can pretty much combine any data that is wanted as long as you find the correlation between the data objects. You can also build simple automatic match & merge logic based on decisions which sources are more trustworthy than others. However, data warehouse is used mainly for analytical purposes and it doesn’t take into account the data management aspect and the actual user who creates, updates or validates the data. Data warehouse does not either fix the data quality.

What if you could focus the power to fix the root problem instead of fixing the symptoms, and actually enhance and simplify the processes related to creating and updating the data?

With MDM you can harness the data management to enhance the quality of the data. The master data creation and update processes can be simplified so that it’s easy and pleasant to keep the master data updated. Depending on the architectural approach there still can be various sources for the master data if it’s the best fit for the business process but nevertheless the management is optimized and supports the business processes and end user experience. The overlapping processes are removed and it’s clear who updates what data and in which system. This leads to a situation where data quality issues can be pointed out and fixed already at the root and there is no need for extinguishing small fires which would jeopardize the data quality used for decision making. 

Many organizations are also thinking about overtaking big data, data science and artificial intelligence (AI) projects in order to enhance their business and getting competitive advantage. However, every data science and AI projects (as well as traditional analytics projects) are only as good as the used data and the true value comes only when one is able to connect the data to master data domains. If the data is poor quality and connections to master data are not reliable, the results and the achieved advantage are poor as well.

With MDM one can create a rock solid foundation for the business critical data which enhances all the data ventures and thus enables decision making, value creation and even new business opportunities by providing better visibility on the business through data.

The actual value what MDM brings to the decision making is very hard to calculate as the benefits come indirectly depending on how well the organization uses the MDM ecosystem. The more the valuable high-quality data is used the more benefit is gotten. For example, by offering valuable master data for machine learning purposes one can save money by optimizing maintenance processes based on the data got from the production. Or one can find a cross-selling opportunities for existing customers and create a whole new business.

The decisions made are as good as the data. Hence, with high-quality master data one can enable high-quality decisions. The actual benefits can be from tens of thousands of euros to millions. 

4. MDM helps to comply with regulations by creating a solid foundation for data management

The fourth aspect for the value of MDM is the requirements that different regulations set to data and data management. Such regulations are for example General Data Protection Regulation (GDPR) in EU and International Financial Reporting Standards (IFRS). If organizations fail to comply by those requirements, they end up paying millions of euros in fines.

As an example, the key requirements that GDPR sets for personal data processing are transparent data processing, limitations based on purpose, data subject rights, consent for data processing, notifications for breaches, privacy by design, data protection impact assessment, data transfer protection, assigning a data protection officer, and increasing the awareness and training of GDPR among the employees [1]. Hence, organizations need to have strict control of the technologies and processes that they use to collect, manage and share information about their customers, employees, suppliers and other parties that they do business with [2].

MDM helps organizations to achieve the control on the data processing required by regulations such GDPR. When MDM is implemented in the core of data management it’s clear who has access to the master data, how it can be modified, where it flows, what’s the lifecycle of the data, etc. In addition, with MDM system one can connect separate data silos into one centralized repository providing full visibility on the master data, e.g. customer or employee data. Also, the technical requirements such as privacy by design can be implemented directly to modern MDM tools to support regulations.

Even though MDM can not fully fulfill all the requirements that different regulations, such as GDPR, sets to data processing, it creates a strong basis for data management and can significantly make it easier to achieve the correct state to fulfill the regulations.

Hence, MDM saves time and money when one doesn’t have to start over every time when new regulations are set. And it can potentially save millions of euros by avoiding the huge fines for non-compliance.

Conclusions

Above we went through four different approaches why MDM is important. Of course, when thinking of the value and especially the ROI in a big picture one also needs to consider the amount of investment that is put for example into the licensing costs and development costs when building the MDM capability and integrations to support the automated data integrations.

As with many data related projects, also in MDM projects the major costs are usually realized in the very beginning of the MDM journey. Most costs are related to setting up the MDM capability, building the integrations, handling the change management, and so on. The benefits and savings however are coming over time, once the master data is under control. This means that trying to calculate a short time ROI is difficult. However, looking a bit further, using a longer time frame, let’s say 3-5 years, should be enough to get a reasonable ROI in any master data project.

Even though MDM has clear advantages, one should still remember that large scale MDM is not for everyone. Especially small organizations can handle their master data without launching separate MDM projects. However, when the organization gets bigger and the architecture and ecosystem gets more complex one needs to start looking into how the data is actually managed. Otherwise, one will end up running the business with poor data and inefficient processes and will surely get behind in the race of digitalization.

By correctly deploying continuous MDM capability one can create a supporting ecosystem, including people, processes and technologies, which enables the data initiatives by creating the structure, visibility and endurance to the data management.

References

[1] Bhatia, P. 2019. A summary of 10 key GDPR requirements. Advisera Expert Solutions Ltd. Available: https://advisera.com/eugdpracademy/knowledgebase/a-summary-of-10-key-gdpr-requirements/

[2] Assefa, B. 2018. Master Data Management for Achieving GDPR Compliance. DATAVERSITY Education, LLC. Available: https://www.dataversity.net/master-data-management-achieving-gdpr-compliance/#

Master Data Management explained

In this blog series we dive into the world of master data management and explain what it is, what are the future trends of it and why it is so important.

Imagine sending a confidential email to a wrong address, or having factory standing by due to malfunction, or even crashing a rocket into a planet instead of safely reaching its orbit. All of these cases are real life examples caused by the poor master data quality. In the first case the problem was in the customer data containing erroneous address for person. The second case was caused by poor master data regarding factory maintenance schedule. The third case happened due to invalid reference data of metrics used. The saying “devil is in the detail” applies really well in the world of master data. Those seemingly small issues will quickly grow into major problems impacting the whole business. Most of these data quality related issues would have been easily avoided by a proper master data management.

But what exactly is master data and how it should be managed? This blog series dives into the world of master data management: What is it, why it is important, and how it should be executed. But before going too deep into technical detail, let’s start by defining what could be considered as master data. There are various definitions for the term and none of those is absolute. One definition could be that master data can be seen as the data describing the most relevant business entities, on which the activities of an organization are based on. But what are those entities?

Each industry has its own master data requirements and characteristics, but there can be total of five common core master data domains identified:

  1. Product (What): All master data related to e.g. products, services, or assets offered or owned by the organization. This domain can vary a lot depending on the industry.
  2. Party (Who): Domain consisting of all business partner related master data such as customers, suppliers, distributors, employees or citizens. It can be either person or organization, and contains e.g. name, party identifier and contact details.
  3. Location (Where): Master data regarding places, sites and regions. Sales territories, cities, offices and production facilities are examples of locations.
  4. Account (How): Explains how parties are related to each other and to the things offered by organizations. It contains data such as accounts, financials, agreements, contracts and other relationships between entities.
  5. Calendar (When): Time domain of master data containing e.g. validity periods and schedules associated with e.g. creation, marketing, shipping, and obsolescence of products. This is the most ambiguous domain since the time dimension can also be seen as part of the life-cycle management of other domains.

Master data also has some special characteristics compared to other types of data:

  • Master data always describes the basic characteristics of entities in the real world.
  • Master data has relatively long lifecycle.
  • Transaction volumes do not directly affect the amount of master data.
  • Master data can exist without transaction data, but not vice versa.

Now that there is a clear picture about what master data is, we can define what is master data management. There are plenty of different definitions for the term master data management (MDM). I would define MDM as a method of enabling organization to create a unified view of its core data. The focus of MDM is to create an integrated, accurate, timely, and complete set of data needed to manage and grow the business. The goal of master data management can be summarized into these three objectives [1]:

  1. Provide a consistent understanding and trust of master data entities
  2. Provide a mechanism for consistent use of master data across the organization
  3. Accommodate and manage change

The Zachman Framework for Enterprise Architecture [2] defines six categories of enterprise architecture components. Interestingly, the core basic master data domains share similar categories of Who, What, Where, How and When. The remaining category is “Why”, and it should definitely be taken into concern when thinking about master data management. The “Why” dimension can be seen as reasoning behind the whole master data management initiative. This is the part where most of the MDM initiatives fail when business is not being involved. This being said, the next part of the MDM blog series will open more the “Why” of master data management.

“Without a systematic way to start and keep data clean, bad data will happen.” — Donato Diorio

REFERENCES:

[1] Dreibelbis, A., Hechler, E., Milman, I., Oberhofer, M., van Run, P., & Wolfson, D. (2008). Enterprise master data management: an SOA approach to managing core information. Pearson Education.

[2] Zachman, J. (2006). The zachman framework for enterprise architecture (pp. 1-15). Virginia: Zachman Framework Associates.

Finnish stemming and lemmatization in python. See python code examples and try scripts yourself. This tutorial uses python 3.

Finnish stemming and lemmatization in python

Finnish stemming and lemmatization in python for text analytics.

There are plenty of options for natural language processing in English. For small languages like Finnish it is a different story. Not all solutions are easy to find.

In this blog I deal with stemming and lemmatization in Finnish language. Examples are written in python 3.6.

Difference between stemming and lemmatization

Transforming a word to a generalized format is helpful in many applications of text analysis. This is because words like cat and cats mean almost the same thing.

Lemmatization can be defined as converting words to their base forms. After the conversion, the different “versions” of a word such as cat, cats, cat’s or cats’ would all be simply cat.

Stemming is the other option to convert words to a general format. Stemming is not exactly the same operation as base form conversion as it goes deeper down to the structure and science of the language. More about stemming from Wikipedia.

Here is a simple example about the difference between lemmatization and stemming.

Original word Lemmatized word Stemmed word
Study Study Study
Studies Study Studi

More focus is put on lemmatization in this article. This is because Finnish lemmatization libraries were more difficult to find.

Finnish lemmatization with voikko python library

In the GitHub page Voikko describes the use cases for the library:

“Libvoikko provides spell checking, hyphenation, grammar checking and morphological analysis for Finnish language.”

It took some trial and error to find proper installation instructions for python. Instead of using python’s pip package installer, the following line worked for Linux users. For Windows users I recommend installing Ubuntu subsytem for Windows.

sudo apt -y install -y voikko-fi python-libvoikko

 

After installation the libvoikko library can be imported to python scripts as usual. Here is an example how to lemmatize a single Finnish word to its base form with python.

#Import the Voikko library
import libvoikko

#Define a Voikko class for Finnish
v = libvoikko.Voikko(u"fi")

#A word that might or might not be in base form
#Finnish word "kissoja" means "cats" in English
word = "kissoja"

#Analyze the word
voikko_dict = v.analyze(word)

#Extract the base form as
#analyze() function returns various info for the word
word_baseform = voikko_dict[0]['BASEFORM']

#Print the base form of the word
#This should print "kissa", which is "cat" in English
print(word_baseform)

 

Finnish sentence lemmatization in python

Often you would like to perform the base form conversion for a block of text or for a sentence. To achieve this you should first split the long text to list of words. The you can apply Voikko’s analyze() function for each of them. Word splitting is called word tokenization.

There are different ways of doing tokenization depending on your objective. Sometimes commas, dashes and upper case letters matter, sometimes not.

Python package nltk provides an English module for tokenization which works for Finnish in most cases. But instead, I wrote my own tokenization script to demonstrate base form conversion for multiple sentences.

#Import the Voikko library
import libvoikko

#Define a Voikko class for Finnish
v = libvoikko.Voikko(u"fi")

#Some Finnish text
txt = "Tähän jotain suomenkielistä tekstiä. Väärinkirjoitettu yhdys-sana, pahus."

#Pre-process the text
txt = txt.lower().replace(".", "").replace(",", "")

#Split to list by space character
word_list = txt.split(" ")

#Initialize a list for base form words
bf_list = []

#Loop all words in the list
for w in word_list:
  
  #Analyze the word with voikko
  voikko_dict = v.analyze(w)
  
  #Extract the base form, if the word is recognized
  if voikko_dict:
    bf_word = voikko_dict[0]['BASEFORM']
  #If word is not recognized, add the original word
  else:
    bf_word = w
  
  #Append to the list
  bf_list.append(bf_word)
  
#Print results
print("Original:")
print(word_list)
print("Lemmatized:")
print(bf_list)

 

Finnish stemming with python

The nltk package provides stemming for Finnish language here.

And here are some Finnish stemming examples.

#Import nltk Snowball stemmer
from nltk.stem.snowball import SnowballStemmer

#Create a Finnish instance
stemmer = SnowballStemmer("finnish")

#Print the stemmed version of some Finnish word
print(stemmer.stem("koiriemme"))

As you can see, the nltk stemmer is extremely easy to use. Antoher advanatage is, with very little code you can harness the same script for other languages.

Summary – Lemmatization and stemming in Finnish

This blog offered you simple and concrete examples to lemmatize and stem Finnish words in python. Hopefully this gets you started with your text mining project.

There is no absolute truth whether you should use stemming or lemmatization. One rule of thumb is that stemming captures more semantics than lemmatization. On the other hand lemmatization is easier to understand and generalizes more.

Now harness your creativity and try yourself!

Photo by Luca Bravo on Unsplash

Greetings from the Bay Area – IBM Think 2019 Part 2

IBM Think 2019 contained several very interesting sessions. Several product and feature launches aroused great interest in event participants, and we share the best bits with you!

IBM Think 2019 contained several very interesting sessions, as described earlier. Several product and feature launches aroused great interest in event participants, and we share the best bits with you!

To start with an important side note, IBM showed a disclaimer in every presentation that the new features may or may not be implemented during upcoming releases and they did not want to promise any hard date for those new features (need to say the same in the blog post because we do not want to make too big promises…).

IBM presented bringing the full power of AI to business intelligence. What does that then mean in practice? Cognos Analytics storytelling, recommendations, automated visualizations, natural language questions,  reusable content and advanced analytics. IBM is putting lot of effort to make analytics, data science and machine learning more accessible and easier for use for its customers.

We had a change to test latest version of Cognos and its Exploration tool. User can get advanced analytics insights and contextual recommendations of their data in a very easy way (picture below). Exploration works with relational data sets and you can test it easily with the existing data or make a new data set with the data module tool (data modelling and preparation tool) inside cognos analytics.

View of Cognos 11.1 Explorer - natural language question on the right and its recommendations and analytical insights on the left
View of Cognos 11.1 Explorer – natural language question on the right and its recommendations and analytical insights on the left

There are lot of new features and improvements in new Cognos Analytics versions 11.0.13 and 11.1. Those contain for example, several dashboarding  improvements, data module supporting relative dates, aggregation roll ups, security, reporting navigations and usability as well as copy > paste (which actually works), enabling visualizations and other objects copied from Explorer reports or Dashboards to reports or the other way.

How about news on Planning Analytics? To start with, IBM Planning Analytics is a planning, budgeting, forecasting solution which runs on IBM Cloud as a service or it can be installed to organization’s own on-premise server. For starters, IBM stated that they are now investing heavily on Planning Analytics Cloud. The development activities are focusing on the end-user tools such as Planning Analytics for Excel and Planning Analytics Workspace and interoperability between these two components. Concretely, this means that IBM is publishing new version of Workspace almost every month containing new features and nitty-gritty fixes. Although, IBM is talking a lot about new components of Planning Analytics, it was nice to notice that the good old developer tools such as Architect will remain under the hood of Planning Analytics and there are no short term solution to remove Architech, but eventually our assumption is that all the modelling will be done in PAW.

One of the biggest new features in Planning Analytics for Excel (PAx) and Planning Analytics Workspace (PAW) is the alternative hierarchies. Alternative hierarchies feature has been introduced in Workspace earlier already but during the conference we learned that the feature will be supported in Exploration views in Planning Analytics for Excel as well. In addition, as mentioned earlier, interoperability between Planning Analytics for Excel and Workspace will be enhanced. For example, PAx has always been able to open and save views from the TM1 Database and in the future views can be opened from the PA Content store which is inside Planning Analytics Workspace. For example, user can create own calculations in PAx and publish those in Workspace. Moreover, PAW will support action buttons which can trigger parameterized TI Scripts by users. Finally from the TM1 server side, the biggest improvement is integration with GIT which enables better version control.

Photo by Luca Bravo on Unsplash
Photo by Luca Bravo on Unsplash

We went through a lot of technical deep-dive sessions, but there were also interesting customer case introductions such as Deutsche Bahn, Uber, Montage International and Theravance Biopharma. Maybe one of the most exciting one was related to Deutsche Bahn’s TM1 model which may be one of world’s largest TM1 implementation. For example, they have over 6000 users, 8 servers which are boosted with 72 cores, 2TB RAM per machine and they receive over 50 million records monthly. It was pretty interesting to hear how they have managed to build such solution and use it as a company-wide performance management solution.  

Interested to continue the discussion? Come talk to us at Chief of Finance event, call us at +358 29 170 3300 or mail us at contact(at)solita.fi!

Building machine learning models with AWS SageMaker

A small group of Solita employees visited AWS London office last November and participated in a workshop. There we got to know the AWS service called SageMaker. SageMaker turned out to be easy to learn and use and in this blog post I'm going to tell more about it and demonstrate with short code snippets how it works.

AWS SageMaker

SageMaker is an Amazon service that was designed to build, train and deploy machine learning models easily. For each step there are tools and functions that make the development process faster. All the work can be done in Jupyter Notebook, which has pre-installed packages and libraries such as Tensorflow and pandas. One can easily access data in their S3 buckets from SageMaker notebooks, too. SageMaker provides multiple example notebooks so that getting started is very easy. I introduce more information about different parts of SageMaker in this blog post and the picture below summarises how they work together with different AWS services.

Picture of how SageMake interacts with other AWS services during build, train and deploy phase

Dataset

In the example snippets I use the MNIST dataset which contains labeled pictures of alphabets in sign language. They are 28×28 grey-scale pictures, which means each pixel is represented as an integer value between 0-255. Training data contains 27 455 pictures and test data 7 127 pictures and they’re stored in S3.

For importing and exploring the dataset I simply use pandas libraries. Pandas is able to read data from S3 bucket:

import pandas as pd

bucket = ''
file_name = 'data-file.csv'

data_location = 's3://{}/{}'.format(bucket, file_name)

df = pd.read_csv(data_location)

From the dataset I can see that its first column is a label for picture, and the remaining 784 columns are pixels. By reshaping the first row I can get the first image:

from matplotlib import pyplot as plt
pic=df.head(1).values[0][1:].reshape((28,28))

plt.imshow(pic, cmap='gray')

plt.show()

Image with alphabet d in sign language

Build

The build phase in AWS SageMaker means exploring and cleaning the data. Keeping it in csv format would require some changes to data if we’d like to use SageMaker built-in algorithms. Instead, we’ll convert the data into RecordIO protobuf format, which makes built-in algorithms more efficient and simple to train the model with. This can be done with the following code and should be done for both training and test data:

from sagemaker.amazon.common import write_numpy_to_dense_tensor
import boto3

def convert_and_upload(pixs, labels, bucket_name, data_file):
	buf = io.BytesIO()
	write_numpy_to_dense_tensor(buf, pixs, labels)
	buf.seek(0)

	boto3.resource('s3').Bucket(bucket_name).Object(data_file).upload_fileobj(buf)

pixels_train=df.drop('label', axis=1).values
labels_train=df['label'].values

convert_and_upload(pixels_train, labels_train, bucket, 'sign_mnist_train_rec')

Of course, in this case the data is very clean already and usually a lot more work is needed in order to explore and clean it properly before it can be used to train a model. Data can also be uploaded back to S3 after the cleaning phase for example if cleaning and training are kept in separate notebooks. Unfortunately, SageMaker doesn’t provide tools for exploring and cleaning data, but pandas is very useful for that.

Train

Now that the data is cleaned, we can either use SageMaker’s built-in algorithms or use our own, provided by for example sklearn. When using other than SageMaker built-in algorithms you would have to provide a Docker container for the training and validation tasks. More information about it can be found in SageMaker documentation. In this case as we want to recognise alphabets from the pictures we use k-Nearest Neighbors -algorithm which is simple and fast algorithm for classification tasks. It is one of the built-in algorithms in SageMaker, and can be used with very few lines of code:

knn=sagemaker.estimator.Estimator(get_image_uri(
	boto3.Session().region_name, "knn"),
	get_execution_role(),
	train_instance_count=1,
	train_instance_type='ml.m4.xlarge',
	output_path='s3://{}/output'.format(bucket),
	sagemaker_session=sagemaker.Session())

knn.set_hyperparameters(**{
	'k': 10,
	'predictor_type': 'classifier',
	'feature_dim': 784,
	'sample_size': 27455
})

in_config_test = sagemaker.s3_input(
	   s3_data='s3://{}/{}'.format(bucket,'sign_mnist_test_rec'))

in_config_train = sagemaker.s3_input(
	   s3_data='s3://{}/{}'.format(bucket,'sign_mnist_train_rec'))

knn.fit({'train':in_config_train, 'test': in_config_test})

So let’s get into what happens there. Estimator is an interface for creating training tasks in SageMaker. We simply tell it which algorithm we want to use, how many ML instances we want for training, which type of instances they should be and where the trained model should be stored.

Next we define hyperparameters for the algorithm, in this case k-Nearest Neighbors classifier. Instead of the classifier we could have a regressor for some other type of machine learning task. Four parameters shown in the snippet are mandatory, and the training job will fail without them.  By tuning hyperparameters the accuracy of the model can be improved. SageMaker also provides automated hyperparameter tuning but we won’t be using them in this example.

Finally we need to define the path to the training data. We do it by using Channels which are just named input sources for training algorithms. In this case as our data is in S3, we use s3_input class. Only the train channel is required, but if a test channel is given, too, the training job also measures the accuracy of the resulting model. In this case I provided both.

For kNN-algorithm the only allowed datatypes are RecordIO protobuf and CSV formats. If we were to use CSV format, we would need to define it in configuration by defining the named parameter content_type and assigning ‘text/csv;label_size=0’ as value. As we use RecordIO protobuf type, only s3_data parameter is mandatory. There are also optional parameters for example for shuffling data and for defining whether the whole dataset should be replicated in every instance as a whole. When the fit-function is called, SageMaker creates a new training job and logs its the training process and duration into the notebook. Past training jobs with their details can be found by selecting ‘Training jobs’ in the SageMaker side panel. There you can find given training/test data location and find information about model accuracy and logs of the training job.

Deploy

The last step on our way to getting predictions from the trained model is to set up an endpoint for it. This means that we automatically set up an endpoint for real-time predictions and deploy trained model for it to use. This will create a new EC2 instance which will take data as an input and provide prediction as a response. The following code is all that is needed for creating an endpoint and deploying the model for it:

import time

def get_predictor(knn_estimator, estimator_name, instance_type, endpoint_name=None): 
    knn_predictor = knn_estimator.deploy(initial_instance_count=1, instance_type=instance_type,
                                        endpoint_name=endpoint_name)
    knn_predictor.content_type = 'text/csv'
    return knn_predictor


instance_type = 'ml.m5.xlarge'
model_name = 'knn_%s'% instance_type
endpoint_name = 'knn-ml-%s'% (str(time.time()).replace('.','-'))
predictor = get_predictor(knn, model_name, instance_type, endpoint_name=endpoint_name)

and it can be called for example in the following way:

file = open("path_to_test_file.csv","rb")

predictor.predict(file)

which would return the following response:

b'{"predictions": [{"predicted_label": 6.0}, {"predicted_label": 3.0}, {"predicted_label": 21.0}, {"predicted_label": 0.0}, {"predicted_label": 3.0}]}'

In that case we got five predictions, because the input file contains five pictures. In a real life case we could use API Gateway and Lambda functions for providing interface for real-time predictions. The Lambda function can use boto3 library to connect to the created endpoint and fetch a prediction. In the API gateway we can setup an API that calls the lambda function once it gets a POST request and returns the prediction in response.

Conclusions

AWS SageMaker is a very promising service that allows reading data, training a model and deploying the endpoint with less than a hundred lines of code. It provides many good functions for training but also allows using Docker for custom training jobs. Jupyter Notebook is familiar tool to data scientists, so it’s very nice that it is used in SageMaker. SageMaker also integrates very easily with other AWS Services and allocating resources for training and endpoints is very easy. The machine learning algorithms are optimised for AWS, so their performance is very high.

The amount of code needed for training a model is not the biggest challenge in a data scientist’s everyday job, though. There are already very good libraries for that purpose, and one of the most time consuming part is usually cleaning and altering the data so that it can be used for training. For that SageMaker doesn’t provide help.

All in all, optimised algorithms, automated hyperparameter tuning, easy integration and interaction with other AWS services saves a lot time and trouble for data scientists. Trying out SageMaker is definitely worthwhile.

 

 

Greetings from the Bay Area – IBM Think 2019 Part 1

Our Solita crew participated in IBM Think held in San Francisco in February. IBM Think is an annual technology and business conference, where the latest technology trends and new product releases from IBM are introduced.

IBM Think in San Francisco was a huge technology event with approximately 27 000 attendees, thousands of different sessions, presentations and keynotes held in different venues in San Francisco.

Due to the size of the conference we wanted to focus on certain key areas: AI, machine learning and analytics. There were about 500 data and analytics presentations to choose from. Topics covered areas such as data science, AI, business and planning analytics, hybrid data management, governance and integration. IBM Cloud Private for Data alone had 18 sessions where this new product was presented.

Solita has a strong expertise in the area of analytics

Solita has a strong expertise in the area of analytics (Cognos Analytics & Planning Analytics) and we wanted to strengthen our competence and learn about upcoming releases of those products. We had a chance to meet IBM’s offering management and discuss new features and give feedback. There were also several hands-on labs where one could test upcoming features of products.

Although Planning Analytics (PA) was a bit of a sidekick compared to buzzwords like AI and Blockchain, the PA sessions provided good information about the new features and on-going development. In addition, there were several different client presentations providing insights into their CPM solutions. Interestingly, many of those presentations were still focusing on TM1 technology and not on Planning Analytics even though TM1 support will end on 30th of September 2019.

AI and data science were strongly present on IBM Think agenda. Success stories on AI implementations were told for example by Carrefour (retail chain who wanted to optimize existing and new supermarket investment decisions), Nedbank (bank that used predictive maintenance to optimize AMT services), Red Eléctrica de España (electrical company that wanted to predict generation and optimize production) and Daimler (truck manufacturer using AI to comprehend the complexity of product configurations).

Also AI project best practices were shared in many of the sessions.

Also AI project best practices were shared in many of the sessions. Best practices included starting with a quick-win use case to gain buy-in from management and business, having a business sponsor for the project, measuring clear KPIs and business impact and, good quality data, creating effective teams, choosing the right tools, etc. These are all principles we definitely agree on and that are already now implemented in Solita data projects.

What else did we learn in IBM Think 2019? Deep dive into learnings coming up!

A data scientist’s abc to AI ethics, part 2 – popular opinions about AI

In this series of posts I’ll try to paint the borderline between AI and ethics from a bit more analytical and technically oriented perspective. Here I start to examine how AI is perceived, and how we may start to analyze ethical agency.

Multiple images

From 3D apps to evil scifi characters, in everyday use it can mean almost anything. It’s a bit of a burden that it is associated with Terminator, for instance. Or that the words deep learning might receive god like overtones in marketing materials.

Let’s go on with some AI related examples. On a PowerPoint slide, AI might be viewed as an economic force. For yet another example, we could look at AI regulation.

Say a society wants to regulate corporate action, or set limits to war damage with weapon treaties. Likewise, core AI activities might need legal limits and best practices. Like, how to make automatic decisions fair. My colleague Lassi wrote a nice recap about this also from an AI ethics perspective.

Now in my view, new technology won’t relieve humans from ethics or moral responsibility. Public attention will still be needed. Like Thomas Carlyle suggested, publicity has some corrective potential. It forces institutions to tackle their latent issues and ethical blind spots. Just like public reporting helps to keep corporate and government actions in check.

Then one very interesting phenomenon, at least from an analytical perspective, are people’s attitudes towards machines.

Especially in connection to ethics, it is relevant how we tend to personify things. Even while we consciously view a machine as dumb, we might transfer some ethical and moral agency to it.

A good example is my eight year old son, who anticipated a new friend from Lego Boost robot. Even I harbor a level of hate towards Samsung’s Bixby™ assistant. Mine is a moral feeling too.

These attitudes can be measured to a certain extent, in order to improve some models. This I’ll touch a bit later.

Perceived moral agency

There is a new analytical concept that describes machines and us, us with machines. This concept of perceived moral agency describes how different actors are viewed.

Let’s say we see a bot make a decision. We may view it as beneficial or harmful, as ethical or unethical. We might harbor a simple question whether the bot has morals or not. A researcher may also ask how much morals the bot was perceived to have.

Here we have two levels of viewing the same thing, a question about how much a machine resembles humans, and then a less intermediate one about how it is perceived in the society.

I think that in the bigger picture we make chains of moral attribution, like in my Bixby case. My moral emotion is conveyed towards Samsung the company, even if my immediate feelings were triggered by Bixby the product. I attribute moral responsibility to a company, seeing a kind of secondary cause for my immediate reactions. The same kind of thing occurs when we say that the government is responsible of air pollution, for instance.

What’s more to the point, these attributive chains apply to human professionals too. An IT manager or a doctor is bound by professional ethics. Their profession in turn is bound by the consensus within that group. If a doctor’s actions are perceived as standard protocol, it is hard to see them as personal ethics or lack of it.

Design and social engineering

Medical decision assistants and other end products are the result of dozens of design choices. And sometimes design choices, if not downright misleading, voluntarily support illusions.

For instance, an emotional reaction from a chat bot. It might create an illusion that the bot “decides” to do something. We may see a bot as willing or not willing to help. This choice may even be real in some sense. A bot was given a few alternative paths of action, and it did something.

Now what is not immediately clear are a bot’s underlying restrictions. We might see a face with human-like emotions. Then we maybe assume human emotional complexity behind the facade.

Chat bots and alike illustrate the idea of social engineering. What it means is that a technical solution is designed to be easy to assimilate. If a machine exploits cultural stereotypes and roles in a smart way, it might get very far with relatively little intelligence.

A classic example is therapist bot ELIZA from 1960s. Users would interact via a text prompt, and ELIZA would respond quite promptly to their comments. Maybe it asked its “patient” to tell a bit more about their mother. It didn’t actually understand any sentence meanings, but it was designed to react in a grammatically correct way. As the reports go, some of the users even formed an addictive relationship with it.

The central piece of social engineering was to model ELIZA as a psychotherapist. This role aided ELIZA in directing user attention. It might also have kept them away from sizing up ELIZA and its limitations. To read more about ELIZA, you may start from its Wikipedia page.

Engagement and management

ELIZA of course was quite harmless. For toys, it is even desirable to entice the imagination. A human facade can create positive commitment in the user. This type of thing is called engagement in web marketing.

On the other hand, social engineering is hard work and not always rewarding. An interesting related tweet came from a game scriptwriter.

This scriptwriter would wish players to submerge and have profound emotional experiences in her games. In her day to day work she had noticed a constant toil with her characters. Was this need for detail even bigger than, say, in a novel? Yes, she suggested.

The scriptwriter also analyzed this a bit. She noticed that repetitive out-of-context action is likely to distance a user. What’s more, it is also very likely to occur when prolonged interaction is available.

I’m tempted to think that these are the two sides of engaging a user. The catch and the aftermath.

As far as modeling and computational perspective go, another significant theme is the nature of automatic decisions.

The most relevant questions are these. How is the world modeled from the decision making agent’s perspective? What kind of background work does it require? How then about management? What kind of data does the agent consume? How to control data quality?

These will get a bit more detail in my next post. Stay tuned, and thanks for reading!

This is the second post about AI and ethics, in a series of four.

A Data Catalog can be the foundation for your data democracy – if you think of it being more than just a catalog

The hype around data catalog software is justified, but the term "data catalog" is misleading and often misunderstood. We should talk about data libraries i.e. combinations of software, new ways of working and user experience; all aiming to drive data democratization.

I love libraries.

In my younger days, I spent a lot of time in libraries. Many associate libraries with books, but for me libraries are also about music. In libraries, I explored music that would greatly shape my personality and my own approach to music as my personal passion. This was all before music was digitized, so I had to acquire music in physical form.

For me, libraries were a perfectly designed service for music discovery:

  • All products were indexed and easy to find
  • Music magazines were adjacently available for further context of the product
  • You could preview (listen to) the products right there and then
  • The product offering was actively curated by knowledgeable professionals
  • The service was public and available to anyone who had an interest.

I was very curious of non-mainstream music and libraries helped me find artists like Tom Waits, Nick Cave and Pavement, who have all since meant much to me. Had I relied only on my own social tribe (friends and schoolmates), I would never have discovered these artists. This is why I will always be thankful for the existence of libraries.

The Hype Around Data Catalog Software

Much has been said and written about data catalog software in the last year. Data catalogs have been called “the new black”[1] and “the most important data management breakthrough to have emerged in the last decade”[2]. While I agree with most of what Forrester, Gartner and others say, they have all gotten one important aspect wrong: The term “data catalog” is misleading and too narrow. These new software solutions are to data-driven organizations what libraries were to me; a safe place for discovery, learning and transformation. Think of this software as an open-to-everyone service, where:

  • Content is cataloged and curated by experts
  • Contextual information is easily available alongside the core product
  • It is safe and easy to explore the content.

As a bonus, while traditional libraries encourage silence, modern data catalog software encourage conversations, thought exchange and collaboration. They foster data literacy, which is essential for data-driven organizations. Libraries, as we know them, are a foundation for our democracy and even civilization. A data catalog, or a Data Library as I would call it, should be the foundation for your organization’s data/analytic democracy.

I should also add that, unlike with books in libraries, data catalog software does not require data to be physically loaded into the software. The software connects to the data at its original location.

Building the Case for a Data Library

Each organization has to identify its own drivers and build its own business case for a data library. I have worked on and studied data catalog initiatives driven by very different needs:

  • A desire to capture more contextual metadata
  • An aim to foster collaboration in a data & analytic community
  • A need to identify certain data assets for compliance purposes (e.g. PII for GDPR)
  • An ambition to collect information about a multitude of data platforms into one location.

Below are some examples of how common pain points (or “opportunities” if you prefer) can be expressed as benefits for a catalog business case, and how these can then drive the focus of the solution.

 

Some of you might now be wondering what happened to metadata management, ETL, lineage and similar functionalities often associated with data catalogs. While these are important features, they narrow the solution too much and you end up with a technical data catalog, not a data library.

Time & Productivity. You are likely to achieve greater buy-in and success when focusing on business benefits like shorter time to insight or productivity boost. These benefits are achieved by reducing the time data workers spend in finding, accessing and learning how to use data.

Better Decisions. You can also focus on the benefits of faster and more accurate decision making. Faster analytics leads to faster decisions, which leads to seizing business opportunities faster. Similarly, a better understanding of available data assets, leads to richer and more precise analytic outputs, making business decisions more accurate.

Data Value Assessment. Finally, Chief Data Officers in particular should actively be looking at the value of their data assets to understand investment and optimization opportunities. Features like data utilization metrics and user stories shared in the data library will help a CDO understand and assess the value of their data assets.

The highest prioritized benefits should then drive your approach to implementing a data library. Is it more important to capture rich metadata or usage data? Will you favor crowdsourcing or automation for creating the content? Is the library centrally curated or can anyone add/edit/remove contents? The answers to these questions should drive your technology selection and implementation plan.

It’s a Way of Working Aiming To Deliver a Great User Experience

While modern data catalog software products are already pretty great on their own, just like libraries, they won’t provide value without active curation and stewardship. The key to success with a data library is to make it a way of working. This means tampering with behaviors in your organization, i.e. what people do and how they do it.

Libraries in essence are just hollow buildings with empty shelves. What makes them work is their content, and the people curating and designing services around that content. You need to think of data libraries similarly. Identify the people most passionate about your data assets and promote them to become “Supreme Information Curators” or “Esteemed Data Council Members”. Avoid making this about policing and controlling, and instead focus on enabling and empowering users. Then, use insight from how your organization works to design a data library user experience that does to data workers what libraries did to me: Changed beliefs, thoughts and behaviors. Well-designed data libraries will make data workers more engaged, productive, collaborative and knowledgeable. By doing so, the libraries will also drive data workers to contribute to the common wisdom of your organization.

Data Catalogs Are Just Software But Data Libraries Are Foundational Capabilities

The data catalog software market is a dynamic and interesting market, and these products can solve a number of business problems. However, they miss the mark if they are deployed as catalogs only. Think of these products as the enabler of your organization’s data library and the foundation for your data and/or analytic democracy. When building the case for your data library, identify which business problem is most relevant to your organization and then look for solutions that address those specific problems. A data library is a way of working, not just a technology solution. Find the right people in your organization and give them the responsibility – and privilege – of curating your organization’s data wisdom. Think of your data library as a service that provides value to its users and changes behaviors. If you get this user experience right, your organization will generate more value of its data assets, and your data people will be smarter, more engaged and more productive than any of your competitors.

—–

At Solita we have explored the data catalog software market extensively and have also implemented data catalog/library solutions with our clients. We’d love to help you build a business case for a data library and implement a library service that drives data democracy in your organization.

[1] Gartner: “Data Catalogs Are the New Black in Data Management and Analytics” Analyst Report, Ehtisham Zaidi, December 13, 2017

[2] 451 Research: “From out of nowhere: the unstoppable rise of the data catalog”, Matt Aslett, October 10, 2018

Why and how to enable DataOps in an organization?

It can be a daunting task to drive a DataOps initiative forward in an organization. By understanding it's implications, you will increase your odds to succeed.

When talking with my colleague about the introductory post to the blog series I was asked if we can already jump into the technological side of DataOps in the second part. Unfortunately not. Technology is an important part of the phenomenon, but the soft side is even more important.

Defining the undefined

There still are no official standards or frameworks regarding DataOps. So how can we then even talk about enabling DataOps in an organization if we only understand it on a high level? To start get things started, we have to break up the concept.

The DataOps Manifesto [1] that DataKitchen has brewed together does a good job describing the principles that are part of DataOps. However, it is somewhat limited to analytics/machine learning point of view. Modern data platforms can be much more than just that. Gartner [2] has the same spirit in their definition but it is not as tightly scoped to analytics. The focus is more in thinking DataOps as a tool for organizational change regarding the data domain. CAMS-model which was coined by Damon Edwards and John Willis [3] for describing DevOps does work fine also with DataOps. CAMS stands for Culture, Automation, Measurement and Sharing. As you can see, automation is only one of the four elements. Today we will dive into the cultural aspect.

DataOps culture everywhere

How to build DataOps culture? One does not simply build culture. Culture is a puzzle that is put together piece-by-piece. Culture will savour your beloved DataOps initiative as breakfast if you try to drive it forward as a technological project. You can’t directly change culture. But you can change behavior, and behavior becomes culture [3].

Let’s take an example of what the implications of cultural aspects can be. I like the “Make it work, make it fast, make it last” mentality which prioritizes delivering value fast and making things last once business already benefits from the solution. The problem is that culture seldom supports the last of the three.

Once value has been delivered, no one prioritizes the work related to making lasting solutions as it does not produce imminent business value.

By skipping the last part, you slowly add technical dept which means that larger part of development time goes to unplanned work instead of producing new features to the business. The term death spiral (popularized in the book Phoenix Project [4]) describes the phenomenon well.

改善

Important part of DataOps is that the organization makes a collective commitment to high quality. By compromising this the maintenance cost of your data platform slowly starts to rise and new development will also get slower. Related to this we also need some Kaizen mentality. Kaizen (kai/改=change, zen/善=good) means continuous improvement that involves everyone. In the data development context this means that we continuously try to find inefficiencies in our development processes and also prioritize the work that removes that waste. But can you delimit the effects there? Not really. This can affect all stakeholders that are involved with your data, meaning you should understand your data value streams in order to control the change.

As Gartner [2] states “focus and benefit of DataOps is as a lever for organizational change, to steer behaviour and enable agility”. DataOps could be utilized as a tool for organizational transformation. Typical endgame goals for DataOps are faster lead time from idea to business value, reduced total cost of ownership and empowered developers and users.

Faster time to value

This usually is the main driver for DataOps. The time from idea to business value is crucial for an organization to flourish. Lead time reduction comes from faster development process and less waiting between different phases but also from the fact that building and making releases in smaller fragments makes it possible to take the solutions in use gradually. Agile methodology and lean thinking play big part in this and the technology is there to support.

If your data development cycle is too slow it tends to lead to shadow IT meaning each business will build their own solution as they feel they have no other choice. Long development cycles also mean that you will build solutions no one uses. Faster you get feedback better you can steer your development and build a solution the customer needs instead of what the initial request was (let’s face it, usually at the beginning you have no clue about all the details needed to get the solution done).

All in all faster time to value should be quite universal goal because of its positive effects on the business.

Reduced Total Cost of Ownership (TCO)

Reduced TCO is a consequence of many drivers. The hypothesize is that the quality of solutions is better resulting in less error fixing, faster recovery times and less unplanned work in general.

Many cloud data solutions have started small and gradually grown larger and larger. By the time you realize that you might need some sort of governance and practices the environment can already be challenging to manage. By utilizing DataOps you can make the environment a lot more manageable, secure and easier to develop to.

Empowered developers and users

One often overlooked factor is that how does DataOps affects the people that are responsible for building the solutions. Less manual work and more automation means that developers can focus more on the business problems and less on doing monkey work. This can lead to more sensible content of work. But at the same time it can lead to skill caps that can be agonizing for the individual and also a challenge for the organization on how to organize the work. Development work can actually also feel more stressful as there will be less waiting (for loads to complete, etc.) and more of the actual concentrated development.

Some definitions of DataOps [5] emphasize collaborational and communicational side of DataOps. Better collaboration builds trust towards the data and between different stakeholders. Faster development cycles can in part bring developers and users closer to each other and engage the user to take part in the development itself. This can raise enthusiasm among end users and break the disbelief that data development can’t support business processes fast enough.

Skills and know-how

One major reason DataOps is hard to approach is that doing it technically requires remarkably different skillset than traditional ETL-/Data Warehouse -development. You still need to model your data (There seems to be a common misconception that you just put all your data to a data lake and then you utilise it. Believe me database management systems (DBMS) were not invented by accident and there really is a need for curated data models etc. But this is a different story.).

You also need to understand the business logic behind your data. This remains to be the trickiest part of data integrations as it requires a human to interpret and integrate the logic.

So on higher level you are still doing the same things, integrating and modelling your data. But the technologies and development methods used are different.

Back in the day as a ETL/DW-developer you could have done almost all of your work with one GUI-oriented tool be it Informatica, SSIS or Data Stage for example. This changes in the DataOps world. As a developer you should know cloud ecosystems and their components, be able to code (Python, NodeJS and C# are a good start), understand serverless and its implications, be committed to continuous integration and things it requires.

And the list goes on. It’s overwhelming! Well it can be if you try to convert your developers to DataOps without help. There are ways to make the change easier by using automation and prebuilt modular components, but I still tease you a bit on this one and come to the solutions later as this is a big and important subject.

Yesterday’s news

One could argue that data engineers and organizations have been doing this DataOps stuff for years but at least from what I have seen the emphasis has been on the data development side and the operations part has been an afterthought.

This has led to “data platforms” that are technologically state of the art but when you show the environment to an experienced data professional she is horrified. Development has been done to produce quick business value, but the perceived value is rigged. Value will start to fade as the data platform is a burden to maintain and point solutions created all live their own lives instead of producing cumulative value on top of each other.

Success factors for enabling DataOps

In the end I would like to share a few pointers on how to improve your chances in succeeding if you dare to embark on your DataOps journey. Unfortunately, by blindly adopting “best practices” you can fall a victim of cargo cult meaning that you try to adopt practices that do not work in your organization. But still there are some universal things that can help you in making DataOps run in your organization.

Start small and show results early

You need to build trust towards what you are building. What has worked in organisations is that you utilise vertical slicing (building a narrow end-to-end solution) and delivering value as soon as possible. This can prove that new ways of working bring the promised benefits and you’ll get mandate to go forward with your initiative.

Support from senior executives

Oh, the classic. Even if it might sound a bit clichéd you still can’t get away from it. You will need a high-level sponsor in order to improve your odds to succeed. Organizational changes are built bottom-up, but it will even your way if you get support from someone high up. As DataOps is not only a technological initiative you need to break a few established processes along the way. People will question you and by having someone backing you up can help you great deal!

Build cross-functional teams

If you don’t have a team full of unicorns, you will be better off mixing different competences in your development team and not framing their roles too tightly. Also mix your own people with consultants that have deep knowledge in certain fields. You need expertise in data development and also in operating the cloud. The key is to enable close collaboration so that people learn from each other and the flow of information is seamless.

But remember the most important thing! Without it you will not succeed. The thing is to actually start and commit to the change. Like all initiatives that affect people’s daily routines, this will also be protested against. Show the value, engage people and you will prevail!

[1] The DataOps Manifesto. http://dataopsmanifesto.org/dataops-manifesto.html
[2] Nick Heudecker, Ted Friedman, Alan Dayley. Innovation Insight for DataOps. 2018. https://www.gartner.com/document/3896766
[3] John Willis. DevOps Culture (Part 1). 2012. https://itrevolution.com/devops-culture-part-1/
[4] Gene Kim, Kevin Behr, George Spafford. The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win. 2013. IT Revolution Press.
[5] Andy Palmer. From DevOps to DataOps. 2015. https://www.tamr.com/from-devops-to-dataops-by-andy-palmer/

New call-to-action