Untapped potential – analyzing Finnish esports broadcasting channels using Twitch API data

There are more than 117000 unique Twitch users following at least one of four Finnish esports channels: ElisaViihdeSport, Pelaajatcom, TES_csgo, or yleeurheilu. In this series of blog posts, I'm studying these channels and their users using data gained from Twitch API.

Twitch, also known as “Twitch.tv”, is a live streaming platform where anyone can stream their live content to a worldwide audience. It’s mostly known for different types of gaming content, although non-gaming content has also gained a whole lot of popularity during the last few years. This series of blog posts focuses on gaming content – more specifically esports content – by analyzing the four biggest Finnish esports broadcasting channels on the platform: ElisaViihdeSport, Pelaajatcom, TES_csgo and yleeurheilu.

“Oh geez, that’s a lot of stuff to go through, why should I bother reading all of it?”

In this series, I’m trying to make interesting points about the most popular Finnish esports channels and their follower base. I’ll show that all those channels have a lot of room to grow, whether they keep broadcasting for a Finnish audience or expand to an international audience, which is naturally a lot bigger.

Hopefully, you’ll learn something new about esports and the broadcasting side of it, whether you have prior knowledge or not. I can assure you that I found things that I didn’t expect, even though I’m a former professional in the field. In any case, you’ll get to watch some numbers and graphs, not just a wall of text. Enjoy!

As an important disclaimer, I’ve had the pleasure to work with many of these broadcasters during my few years as an esports journalist. These channels were chosen because they consist of four different companies, they have more than 20000 followers and – to be frank – I had to narrow it down somewhere, so I ended up with these four channels. I have no strings attached to these channels and/or companies anymore, and they have not influenced anything in this text, except for doing esports broadcasts for the Finnish audience.

Before jumping to Finnish esports, it’s important to understand some things about Twitch and esports in general. This first post of the series focuses more on introducing these topics to someone, who hasn’t been that familiar with them.

What is Twitch and why should I care about it?

Twitch is one of the most popular live streaming platforms. According to them, seven million Twitch streamers go live every month, and there are users from over 230 countries. TwitchTracker data states that the all-time maximum number of concurrent viewers on Twitch was more than 6.5 million in January 2021. It’s a huge platform globally, but it’s really popular for Finnish viewers and Finnish content too. Twitchtracker data states that there are thousands of Finnish channels broadcasting every month.

Metaphorically speaking, it has replaced television for younger generations. While browsing Twitch at any time of day, there’s always live content to watch. You can switch effortlessly between different channels, but since the vast majority of content is completely free, you need to watch advertisements in between changing channels and sometimes during broadcasts.

Just like many other social media platforms, users of Twitch can “follow” other users, to get notified when their broadcast begins. At the moment, every Twitch user can follow 0-2000 channels. In addition to getting notifications, users can check manually which of their followed channels are live.

Screenshot of Twitch Android application. Live channels are sorted by concurrent viewers.

The more followers, the more people see that channel is live, hopefully luring them to watch it. The more concurrent viewers, the better for the broadcaster, because they appear higher in their followers’ list of live channels. The competition for viewers is fierce, and bigger channels tend to gain more views.

As expected, Twitch users are mostly just watching and chatting, instead of streaming to their channel. There are three types of accounts in Twitch:

  • Regular: Every account is a regular account at first.
  • Affiliate: First upgrade from a regular account after reaching specific milestones of streamed content and viewing numbers. Can receive money from monthly subscriptions and donations made using Twitch’s own “Bits” currency.
  • Partner: The most wanted account status. Upgrade from Affiliate status, as the user gets “Verified” status and more perks than an affiliated user.
Analyzing user types and view counts of approximately 117000 Twitch users, that follow 1-4 channels of ElisaViihdeSport, Pelaajatcom, tes_csgo and yleeurheilu.

As a glance to 117000 Twitch users following ElisaViihdeSport, Pelaajatcom, TES_csgo or yleeurheilu: more than 95 % are regular users, about 4 % are affiliates and less than 0.5 % are partners. Nevertheless, if we sum the view counts of all their channels, more than 80 % consist of partners’ views, 14.5 % of affiliate channel views, and less than 4 % of the views were for regular accounts.

View count isn’t the best and most accurate metric for measuring a channel’s success, but it’s the only view-specific metric you’ll find from Twitch API. Luckily, it’s good enough to prove that views are heavily condensed towards partner accounts on Twitch, at least for this sample of users we are studying.

And why do we concentrate on just Twitch instead of Youtube or other live streaming platforms? Because Twitch is so popular for esports content, it’s almost like it has a monopoly of being “the esports broadcasting platform”.

What is esports and why should I care about it?

Finnish esports scene has grown a lot during the last few years. Photo: Tubecon 2019, SEUL ry / Arttu Kokkonen

As a short introduction, esports is the concept of playing video games competitively. To add more details, esports consists of video games of all types, but the most important ones are competitive multiplayer games played with personal computers, gaming consoles and mobile devices.

Just like in traditional sports, some games are played individually (one versus one), some as teams of different sizes. The most popular esports titles like League of Legends and Counter-Strike: Global Offensive are team sports, played as five versus five, similarly to many traditional ball games. Esports tournaments and leagues are broadcasted like traditional sports, as they share many competitive aspects.

As a spectator sport, esports has huge business potential. According to Sponsor Insight, in the Spring of 2019 esports was “the most interesting sport” among Finnish men between ages of 18 and 29. In my opinion, the result might be a bit too flattering, because all different esports titles were combined and compared to single sports titles like football and ice hockey, but it’s still a sign of growing interest. Esports has a lot of potential, but there is still one huge puzzle to solve: monetization.

In traditional sports – at least for the most popular ones – viewers usually have some kind of monthly subscription to be able to watch live broadcasts. They are used to paying about tens of euros per month for watching their favorite sport. As it’s a monthly subscription, they watch when they have the time for it.

Even if they can’t watch everything, they are still paying customers. The viewing statistics aren’t neglected by broadcasting companies, but at the end of the day, the amount of paying customers is more important than the highest number of concurrent viewers. Of course, traditional sports need advertisement money too, but compared to esports, it’s a whole different ball game.

Esports section of Twitch.

For esports broadcasts, the viewer numbers are everything, because all broadcasts are free to watch, and there are no signs for that to change in the future. Just like mentioned in the previous chapter, the more followers, the more concurrent viewers. The more concurrent viewers, the more money to be made from advertisements.

Some channels accept different types of optional donations (Twitch terms: “Subscriptions” and “Bits”), but at the moment they are not used as mandatory payments. Therefore, broadcasts are usually funded by partner companies who want attention for their products and services. There’s a long way to go to monetize esports broadcasts, but this blog series isn’t about that, so let’s not dig too deep into it.

One more thing to note about the competition between esports broadcasters. One could think that broadcasting companies of the same nationality would be the biggest competitors for each other. However, in the esports context, national channels battle most with the main broadcast of a tournament. As opposed to traditional sports, viewers can choose between multiple broadcasts and languages of the same tournament, even if the broadcasting company has bought exclusive rights for country or language.

Main channels – ESL_CSGO, BLASTPremier and DreamHackCS to name a few for a video game called Counter-Strike: Global Offensive or CS:GO – have more or less one million followers, and they tend to be highlighted on Twitch during their tournaments. None of the Finnish esports channels have reached even the milestone of 100,000 followers, meaning that many potential Finnish Twitch users might select English broadcasts purely because they are recommended, while Finnish channels need to be searched manually.

This blog series analyzes data about followers of the four biggest broadcasters of Finnish esports: Elisa, Pelaajatcom, Telia and YLE. Only one Twitch channel was analyzed for each of them, even though some of them have more than 10000 followers for their secondary channels too.

Honorable mentions that weren’t in the mix, but who would deserve some analyzing: Finnish Esports League, PUBG Finland and Kanaliiga.

The major players of Finnish esports broadcasting

First of all, it’s not just these four esports channels that have had remarkable Twitch numbers in Finland, but these four are the most interesting ones in my opinion. All of them have built most of their follower base using the success and popularity of Finnish CS:GO teams, especially ENCE during 2019-2020.

Three of four channels – ElisaViihdeSport, Pelaajatcom and TES_csgo – are owned by commercial companies, while yleeurheilu is run by Finland’s national public service media company Yleisradio Oy.

Pelaajatcom ownership is under a consulting company called North Empire Oy. ElisaViihdeSport and TES_csgo are in another league, as their owners are telecommunication giants Elisa Oyj and Telia Finland Oyj.

Channel Owner Created Followers
ElisaViihdeSport Elisa Oyj 5.4.2018 69341
Pelaajatcom North Empire Oy 17.9.2018 67819
TES_csgo Telia Finland Oyj 29.3.2019 32684
yleeurheilu Yleisradio 16.1.2019 20377
Data for all graphs in this post was fetched from Twitch API 22.6.2021.

To make the text more readable, I’ll ignore the actual and factual channel names and refer to ElisaViihdeSport as Elisa, pelaajatcom as Pelaajat, TES_csgo as Telia and yleeurheilu as YLE.

All the numbers in this series have been fetched from Twitch API, if not presented otherwise. It’s a public API that allows registered users to make 800 requests per minute for their endpoints. It’s important to note that a “follow” is wiped out of the Twitch database – or at least API – if a user unfollows a channel. What this means is that historical Twitch API data might be missing some follows here or there, because some followers have unfollowed the channel afterward.

Twitch API data is not the absolute truth about follower statistics, but it provides good insights about the content that makes Finnish Twitch users press that purple Follow button.


Glossary

Quick little glossary to clear up some of these terms:

  • User: Twitch account made for a person or company.
    Can stream on own channel, follow other users and chat during Twitch broadcasts. In this post, users aren’t considered as broadcasters or channels, even if they technically are those too.
  • Channel: Twitch account of a broadcasting company
    Technically every user is also a channel and vice versa. In this text, I’ll present four broadcaster users as channels, while users that are following them, are presented as followers.
  • Follower: In the context of this post: Twitch User that follows 1-4 of esports channels that are studied. Not all users are followers, but all followers are users. One user can follow up to 2000 channels on Twitch but usually follows less than 50 channels.
  • Dedicated follower: A completely made-up term by myself. It represents a user, who follows only one of four channels that are studied in this blog post. In other words, a dedicated follower might follow up to 2000 channels on Twitch, but only one of them is either Elisa, Pelaajat, Telia or YLE. In other words, no “dedication” needed, might be just pure coincidence.

Digging into user data – looking for new followers

Firstly, an important note to remember with all these numbers flying around: This chapter studies the following numbers of Twitch users that follow one to four Finnish esports channels. All other Twitch users are excluded, and on the other hand, all those four specific channels are studied, not just followers of one or two.

There are 117049 Twitch users that have followed at least one of these four channels. And most of them have followed only one. On average, these users follow 1.62 channels out of four.

In total, these channels had 117049 unique Twitch users as their followers, when data was fetched from Twitch API 22.6.2021. Less than 9000 users (7.6 %) followed all four channels. Almost two thirds, 74783 users (63.9 %), followed only one of these four channels.

These numbers surprised me a lot. After all, at least in my perspective, all these channels have broadcasted similar esports content. In other words, all these channels have broadcasted both Finnish and international CS:GO tournaments featuring Finnish top teams.

Counter-Strike: Global Offensive is one of the most popular first-person shooter video games in the world. Teams of five players try to win rounds by either eliminating the enemy team or finishing their objective. Photo: Seul ry / Pekka Nummela, Vectorama 2019

If we take a closer look at this huge chunk of users following only one of four channels, we’ll find that they are mostly following either Elisa or Pelaajat. For the sake of clarity, I’ll call these users “dedicated followers”, since they aren’t following any of the other three channels that are inspected.

Distribution of “dedicated” followers and users who follow other studied channels as well. For example: 36000 followers of Elisa are also following Pelaajat, Telia or YLE. However, more than 33000 Elisa followers aren’t following any of those three.

Not only Finnish CS:GO content draws followers

Some growth has been made with other than Finnish CSGO teams’ success. For Elisa, part of their dedicated followers seems to be gained through their broadcasts of PUBG esports tournaments.

Playerunknown’s Battlegrounds, PUBG, is a first-person shooter battle royale game. In esports tournaments, it’s played with 16 teams fielding four players. Photo: PUBG

Almost none of these followers have followed Pelaajat, Telia or YLE in addition to Elisa, which is most likely influenced by these two things:

  • They followed during a PUBG tournament. Other channels aren’t broadcasting this video game that much.
  • They followed the channel when the broadcast was made in English instead of Finnish. Other broadcasters have their productions only in Finnish.
Top 10 days for gaining dedicated followers for Elisa. Two days with significantly more non-dedicated followers (30.4.2020 and 11.5.2019) were CSGO broadcasts that featured ENCE. All other days were PUBG broadcasts with English-speaking production.
Day New dedicated followers New other followers New followers (Total) Share of dedicated (%)
27/04/20 1361 20 1381 98.6
14/05/20 703 16 719 97.8
12/05/20 703 18 721 97.5
30/04/20 700 1237 1937 36.1
19/05/20 698 16 714 97.8
11/05/19 614 987 1601 38.4
20/05/20 576 13 589 97.8
21/05/20 476 14 490 97.1
20/04/20 410 31 441 93.0
20/09/20 390 8 398 98.0

Of course, the definition of “English-speaking esports broadcasts” is not that simple when we take a look at history books of Finnish esports broadcasting. If you’ve been following the scene for a while, you’re probably already asking: “But what about the 2019 Berlin Major rally English broadcasts by Pelaajat.com?”.

And that is definitely something we should inspect more closely.

Pelaajat couldn’t do their broadcasts in Finnish, because Telia owned exclusive rights for the Berlin Major 2019 tournament’s Finnish production. Pelaajat found a loophole around the exclusive broadcasting rights by making their broadcasts using broken English, as English broadcasts were not controlled as strictly as Finnish. They became a viral hit during the tournament and “rally English” made them known outside of Finland too.

Yes, Pelaajat gained thousands of followers during ‘rally English broadcasts, but not just dedicated followers. Almost 5000 followers of those 12 days have stayed as dedicated followers, but in addition to them, more than 4000 new followers have followed other Finnish esports channels as well.

Pelaajat grew its follower count massively during Berlin Major 2019 tournament. About half of these “Rally English” followers have followed other Finnish esports channels as well. Off-days of the tournament (2-4.9.2021) were left out of this graph, but Pelaajat gained more than 200 new followers during those days too, thanks to social media hype.
Day New dedicated followers New other followers New followers (Total) Share of dedicated (%)
28/08/19 534 628 1162 46.0
29/08/19 1200 1162 2362 50.8
30/08/19 888 747 1635 54.3
31/08/19 691 448 1139 60.7
01/09/19 476 348 824 57.8
02/09/19 45 36 81 55.6
03/09/19 63 73 136 46.3
04/09/19 17 20 37 45.9
05/09/19 744 681 1425 52.2
06/09/19 74 62 136 54.4
07/09/19 86 59 145 59.3
08/09/19 92 61 153 60.1
Total 4910 4325 9235 53.2

For comparison, here are the all-time top 10 days for gaining new dedicated followers for Pelaajat. Rally English was part of four top 10 broadcasts, but most of their top-follower-gaining broadcasts were made in Finnish, and approximately half of these new followers have found other Finnish esports channels as well.

Top 10 new dedicated followers per day for Pelaajat. The battle between CS:GO teams ENCE and HAVU on 12.3.2021 broke all kinds of records for Finnish esports broadcasting, new followers per day being one of them.
Day New dedicated followers New other followers New followers (Total) Share of dedicated (%)
12/03/21 1835 1411 3246 56.5
29/08/19 1200 1162 2362 50.8
01/02/20 1119 1597 2716 41.2
03/03/19 1077 600 1677 64.2
28/02/19 911 812 1723 52.9
30/08/19 888 747 1635 54.3
02/03/19 873 654 1527 57.2
05/09/19 744 681 1425 52.2
31/08/19 691 448 1139 60.7
23/02/19 615 792 1407 43.7

Certainly, some dedicated users won’t be following other Finnish esports channels in the future. If one follows a channel for PUBG content spoken in English, it’s not going to happen for any other channel than Elisa. However, a lot of potential followers can be found from dedicated users who have followed because of “traditional” Finnish CSGO broadcasts.

Plenty of space for follows

Twitch users can follow a maximum of 2000 channels with their accounts. If we take a look at how many channels these dedicated users are currently following, we’ll find that there is a lot of room left. Half of them follow less than 40 Twitch channels. Not following any other than one of four Finnish esports channels is not about the limits of Twitch, it’s about something else.

When we widen our sights to all 117049 users following 1-4 Finnish Esports channels, the median mark is at 46 followed channels. In other words, half of the users (50.1 % and 58688 users to be precise) follow 46 channels or less, while the other half is following 47 to 2000 channels. More than one third (39286 users, 34 %) of all users follow less than 25 channels.

Followed channels per user, grouped. For most of the users studied in this research, there’s plenty of room to follow more channels, as 2000 is the limit. The amount of users following less than 50 channels is bigger than all the other groups combined.

The average number of followed channels is 87.3, but that is heavily influenced by users following more than 1000 channels. There are only 386 users (0.3 %) with that amount of followed channels. One could (and probably should) question if these are legitimate users at all, but with this sample size, their impact is irrelevant.

Fun fact: Six users have somehow managed to break the barrier of 2000 followed channels by getting to 2001, 2002 or even 2003.

The “following amount numbers” could be a study of its own. I’m part of this group of 117000 Twitch users and I’m currently following 231 channels on Twitch. It sounds like a big number after studying these numbers, but somehow it’s still possible that the only live channel on my Twitch is a rerun of some random ESL tournament.

Conclusions

This was the first blog post about Twitch API data of Finnish esports broadcasters and I hope you learned something new. In the next part, I’ll dig deeper into the follower data of these four channels.

If you somehow managed to skip all the content and stop right here, I’ll try to summarize the key points and findings of this first part:

  • Twitch is one of the most popular platforms for live streaming, especially for esports tournaments.
  • Esports is a growing industry that needs to figure out its monetization.
  • Esports broadcasters need good follower numbers on Twitch to even have a chance to succeed financially.
  • These four studied Twitch channels (ElisaViihdeSport, Pelaajatcom, TES_csgo and yleeurheilu) have about 117000 unique followers in total
    • Almost two thirds of these Twitch users follow only one of these four channels that were studied
      • They are (usually) potential followers for other channels.
      • Most of them follow either Elisa or Pelaajat.
    • Half of these users follow 1-46 channels on Twitch, so it’s not about the limits of Twitch, which is 2000 followed channels per Twitch user.

The second part will be published after the summer holidays. If you have any questions about anything in this post, you can drop me a DM on Twitter or LinkedIn!

(Cover photo: Unsplashed)

Are you building data services for the Mastermind or the Wannabe?

As in all development projects, you need to know your audience to meet their needs. Data projects are no different.

You are looking at your data platform user adaptation rate, and it is not what you expected. Are you puzzled why aren’t people jumping on board the new, more efficient data tool and ready to leave the legacy system behind?

To know where you are at, ask yourself these questions:

  • When this project was started, did I spend time on getting to know my users?
  • Am I able to tell what is their skill level and how it matches the platform capabilities
  • Do I know what are the problems the users are trying to solve with their data tools?
  • Do I understand what motivates them?

If you don’t know the answers and have a clear picture of your potential crowd, there might be something you have left undone. No worries! Read on, as we present one tool to help you to drill into these questions.

Data development is too often based on presumptions

Back in the day, web services were often designed by developers, people writing the actual code. End-user needs were typically gathered from product specialists, sales or customer service. These people brought their presumptions of the end-users to the design. User insight driven UI design was only lifting its head.

Today, the situation is somewhat different. User insight and service design have been quite well internalized in digital service development. When designing a customer facing web service, no designer wants to lean on best guesses about the end-user’s needs. Designers want to understand and experience themselves what are the motivations and underlying needs of the user.

Data projects shouldn’t really differ from digital development but in their user approach, however, they are at the same level as web design was more than ten years back.

In data projects, what is under the hood counts often more than the surface. The projects tend to rely on assumptions on the users and to some very old organizational hearsay, instead of taking a systematic approach to user insight and service design.

The reasons for this vary. First, modern data platform development projects are a relatively new phenomena, data used to be in the hands of a smaller crowd before (typically skilled specialists from finance). Now that self-service analytics have become popular, also the users are becoming more diverse. Second, the data end-users are generally internal users. The same emphasis is not put in their user experience as for services facing an external customer. This should of course not be the case since you are also trying to convert them – to change their behavior towards being data driven in their decision-making.

This is where service design tools could really help you out. With some quite basic user insight you can make some very relevant findings and change the priorities of your project to ensure maximum adoption by the end users.

How can your data project benefit from a service design approach?

To give you some food for thought on how to benefit from service design, we have drafted you something based on our work on various data projects. This is only one example of what service design tools can do for you.

Our experience and discussions from different organizations have led us to believe that there are universal data personas that apply to most data organizations. They differ in how motivated they are to change their ways of working with data and how well they master modern data development and skills. These personas are meant to inspire you to turn your eyes upon your users.

User personas are a tool widely used in service design to make the users come alive and facilitate discussion about their needs.

So, we proudly present: The Wannabe, The Enabler, The Mastermind and The Skeptic.


THE WANNABE

The Wannabe is very excited about data and craves to learn more. Is a visionary but doesn’t really quite know how to get there. Has basic data skills but is eager to learn more and be more data driven in work.

Give this user support and in return, make them your data community builder. Let them spread their enthusiasm!


THE ENABLER

The Enabler is both capable and motivated. This user is well connected in the data community and sees the advantage of collaboration and shared ways of working brings.

Keep this user close. They are the change makers due to their position in the organization as well as their attitude. Allow them to help others to grow.


THE MASTERMIND

The Mastermind is the user who doesn’t reply to your emails or show up in your workshops. This user doesn’t really need anyone’s help with data, as they have the tools and the skills already. Not very motivated to share expertise either or get connected.

Make them need you by providing help to routine work. This user requires much effort but can result in significant value in return when you can channel their exceptional skills to serve your vision. To get to the Mastermind user, use the Enabler.


THE SKEPTIC

The Skeptic knows all the things that have been tried out in the past and also how many of these have failed. Is an organizational expert and has a long history with data but feels left out.

Use some empathy, spend time with this user and listen closely. If you tackle their problem, you have a strong spokesperson and an ally, because they know everyone and everything in the organization.


As said, sketching these user personas is just one example of using design tools to change your approach. There might be people who don’t fit these descriptions and people who act in different personas depending on the project in question. The idea is to make generalizations in order to make it possible to “jump into the users’ shoes” and make different point of views more concrete.

Make the data personas work for you

So, besides changing your approach to get to different data personas, what else can you do? How can you utilize the information the personas provide you?

First of all, you can start by mapping your services and tools to the needs, skills and expectations of the users. You might be surprised. How many of your users are skilled to use the services you provide? Five? Is this enough? What is the relevance of their work? Does their work serve a wider group of people?

Try out this matrix below for your data services. How does the potential for your services look now?

When building data platforms, you’re actually building services

It’s about time we twist our minds into thinking about people first in data projects; what is the change you wish to see in people’s behavior. To create value, your data platform needs the users as badly as your sales need customers buying their products.

So, stop thinking about data development projects technology first. Start thinking about them as a collection of services. Find out about the users and their needs and prioritize your development accordingly. In the end of the day, people performing better by making better decisions is what you’re after, not a shining data platform that sees no usage.

The data personas are a starting point for discussion. Link this post to your colleague and ask if they identify themselves. And how about you, are you the Wannabe, the Enabler, the Mastermind or the Skeptic?


Authors of this blog post are two work colleagues, who share the same interest in looking at data services from the user perspective. At least once, they have laughed out loud at the office in such volume, that it almost disturbed co-workers in the common space. They take business seriously and life lightly.

You can contact Tuuli Tyrväinen and Kirsi-Marja Kaurala via e-mail (firstname.lastname@solita.fi) for further discussion.

A curated list of new Snowflake features released at Snowflake Summit 2020

Snowflake and Salesforce going live with their strategic partnership, Salesforce connectors coming to Snowflake, external functions, data masking support. New Web UI's going live. The list goes on. Read the following article for a curated list of new features released on Snowflake Summit 2020 a.k.a. virtual "Say Hello To The Data Cloud" -event.

Trying to go through for all the announcements for a product can be sometimes overwhelming. It takes time as you need to go through for all the individual press announcements. To ease the pain, I’ve gathered a curated list of new Snowflake features released at the second annual Snowflake Summit or as it was this year branded as “Say Hello To The Data Cloud” -event due to the event being virtual. Snowflake Summit will be held in 2021 at Caesars Forum, Las Vegas at June 7-10th.

So let’s begin.

Snowflake released new features which can be split down into to following categories: Salesforce partnership, Core Database features and UI features. I’ll list the features by categories and give a more precise description of the features released (if possible).

Salesforce strategic partnership

Most significant news this year was the announced Salesforce partnership. Snowflake CEO Frank Slootman announced together with Salesforce COO and President Bret Taylor the partnership which has been under the works for a while. The first results of this partnerships are the better integration between Salesforce and Snowflake.

Salesforce connectors

The first two visibles feature from Salesforce partnership are now released Einstein Analytics connectors will enable you to use Snowflake data directly at Einstein Analytics and the Einstein Analytics Outbound Connector for Snowflake for loading data into Snowflake.

Core Database functions

External functions

This function is going to unlock so much potential when used with DataOps tools (for example Agile Data Engine). At first sight, external functions might sound that they just add something to existing functions functionality. In reality, external functions enable you to trigger anything in Azure, AWS or Google Cloud if it’s reachable through their API Gateways. To be more precise, you can, for example, create function into Snowflake which will trigger Power BI dataset refresh through Azure API Gateway. This will mean that you can create a data pipeline which will refresh your Power BI reports right after your publish tables have been refreshed using only SQL and DataOps tool or in Snowflake – streams and tasks.

More information: https://docs.snowflake.com/en/sql-reference/external-functions.html

Search optimization service

You could call search optimization service as Snowflake’s answer to indexes. Basically you can define table-by-table basis when you want to Snowflake to pre-compute table information to used and enable faster queries for smaller dataset with smaller warehouses. More information: https://docs.snowflake.com/en/user-guide/search-optimization-service.html

Support for SQL in stored procedures

As the title says, support for SQL procedures is finally coming. JavaScript has been the chosen language for Snowflake stored procedures, but honestly, we’ve been missing SQL.

Geospatial support

As the title says, Geospatial support is now coming to Snowflake. More information: https://docs.snowflake.com/en/sql-reference/data-types-geospatial.html

Richer resource monitoring alerts to Slack and Teams

Resource Monitoring is getting richer and more useful as you can create richer resource monitoring rules and you can finally direct those alerts to Slack and Teams natively.

Programming language extensibility

You can use in the future your chosen language in the user-defined functions. The first language to be supported is going to be Java and Python support is coming. Snowflake is going also to support popular coding paradigms, in this case, Snowflake is going to support data frames with Python.

Dynamic data masking support

Data masking is nothing unusual, but finally, Snowflake has built-in support for data masking internally and with tokenization through external functions. You can now mask columns which include for example social security numbers and you can trust that to numbers are hidden from admins as well. In the age of GDPR – the possibility reduce the footprint of personal data within your databases has become a vital asset in your data toolset.

UI features

New UI’s going live and also new admin view

The Numeracy acquisition was released at March 2019 https://www.snowflake.com/blog/numeracy-investing-in-our-query-ui/ and on last year Summit we got a glimpse of the new UI and it’s now been rolled out to Snowflake customers. New Query UI will enhance the user experience with predictive typing and live data charts. The Admin UI will, for example, give more insights about Snowflake credits costs. Sadly, we will not yet get the possibility to see how much pre-bought credits have been used.

Data Marketplace
Snowflake Data Marketplace which was announced last year is now live. You can now make your data available for other to consume and data consumers can use that data as part of their data pipelines as they would use normal SQL tables. More information: https://www.snowflake.com/data-marketplace/

Hopefully, this list helps you to get a grasp on things to come on Snowflake landscape.

More information by Christian Kleinerman:
https://www.snowflake.com/blog/snowflakes-product-innovations-for-2020/

Career opportunities

MLOps: from data scientist’s computer to production

MLOps refers to the concept of automating the lifecycle of machine learning models from data preparation and model building to production deployment and maintenance. MLOps is not only some machine learning platform or technology, but instead it requires an entire change in the mindset of developing machine learning models towards best practises of software development. In this blog post we introduce this concept and its benefits for anyone having or planning to have machine learning models running in production.

Operationalizing data platforms, DataOps, has been among the hottest topics during the past few years. Recently, also MLOps has become one of the hottest topics in the field of data science and machine learning. Building operational data platforms has made data available for analytics purposes and enabled development of machine learning models in a completely new scale. While development of machine learning models has expanded, the processes of maintaining and managing the models have not followed in the same pace. This is where the concept of MLOps becomes relevant.

What is MLOps?

Machine learning operations, or MLOps, is a similar concept as DevOps (or DataOps), but specifically tailored to needs of data science and more specifically machine learning. DevOps was introduced to software development over a decade ago. DevOps practices aim to improve application delivery by combining the entire life cycle of the application – development, testing and delivery – to one process, instead of having a separate development team handing over the developed solution for the operations team to deploy. The definite benefits of DevOps are shorter development cycles, increased deployment velocity, and dependable releases.

Similarly as DevOps aims to improve application delivery, MLOps aims to productionalize machine learning models in a simple and automated way.

As for any software service running in production, automating the build and deployment of ML models is equally important. Additionally, machine learning models benefit from versioning and monitoring, and the ability to retrain and deploy new versions of the model, not only to be more reliable when data is updated but also from the transparency and AI ethics perspective.

Why do you need MLOps?

Data scientists’ work is research and development, and requires essentially skills from statistics and mathematics, as well as programming. It is iterative work of building and training to generate various models. Many teams have data scientists who can build state-of-the-art models, but their process for building and deploying those models can be entirely manual. It might happen locally, on a personal laptop with copies of data and the end product might be a csv file or powerpoint slides. These types of experiments don’t usually create much business value if they never go live to production. And that’s where data scientists in many cases struggle the most, since engineering and operations skills are not often data scientists’ core competences.

In the best case scenario in this type of development the model ends up in production by a data scientist handing over the trained model artifacts to the ops team to deploy, whereas the ops team might lack knowledge on how to best integrate machine learning models into their existing systems. After deployment, the model’s predictions and actions might not be tracked, and model performance degradation and other model behavioral drifts can not be detected. In the best case scenario your data scientist monitors model performance manually and manually retrains the model with new data, with always a manual handover again in deployment.

The described process might work for a short time when you only have a few models and a few data scientists, but it is not scalable in the long term. The disconnection between development and operations is what DevOps originally was developed to solve, and the lack of monitoring and re-deployment is where MLOps comes in.

ML model development lifecycle. The process consists of development, training, packaging and deploying, automating and managing and monitoring.

 

How can MLOps help?

Instead of going back-and-forth between the data scientists and operations team, by integrating MLOps into the development process one could enable quicker cycles of deployment and optimization of algorithms, without always requiring a huge effort when adding new algorithms to production or updating existing ones.

MLOps can be divided into multiple practices: automated infrastructure building, versioning important parts of data science experiments and models, deployments (packaging, continuous integration and continuous delivery), security and monitoring.

Versioning

In software development projects it is typical that source code, its configurations and also infrastructure code are versioned. Tracking and controlling changes to the code enables roll-backs to previous versions in case of failures and helps developers to understand the evolution of the solution. In data science projects source code and infrastructure are important to version as well, but in addition to them, there are other parts that need to be versioned, too.

Typically a data scientist runs training jobs multiple times with different setups. For example hyperparameters and used features may vary between different runs and they affect the accuracy of the model. If the information about training data, hyperparameters, model itself and model accuracy with different combinations are not saved anywhere it might be hard to compare the models and choose the best one to deploy to production.

Templates and shared libraries

Data scientists might lack knowledge on infrastructure development or networking, but if there is a ready template and framework, they only need to adapt the steps of a process. Templating and using shared libraries frees time from data scientists so they can focus on their core expertise.

Existing templates and shared libraries that abstract underlying infrastructure, platforms and databases, will speed up building new machine learning models but will also help in on-boarding any new data scientists.

Project templates can automate the creation of infrastructure that is needed for running the preprocessing or training code. When for example building infrastructure is automated with Infrastructure as a code, it is easier to build different environments and be sure they’re similar. This usually means also infrastructure security practices are automated and they don’t vary from project to project.

Templates can also have scripts for packaging and deploying code. When the libraries used are mostly the same in different projects, those scripts very rarely need to be changed and data scientists don’t have to write them separately for every project.

Shared libraries mean less duplicate code and smaller chance of bugs in repeating tasks. They can also hide details about the database and platform from data scientists, when they can use ready made functions for, for instance, reading from and writing to database or saving the model. Versioning can be written into shared libraries and functions as well, which means it’s not up to the data scientist to remember which things need to be versioned.

Deployment pipeline

When deploying either a more traditional software solution or ML solution, the steps in the process are highly repetitive, but also error-prone. An automated deployment pipeline in CI/CD service can take care of packaging the code, running automated tests and deployment of the package to a selected environment. This will not only reduce the risk of errors in deployment but also free time from the deployment tasks to actual development work.

Tests are needed in deployment of machine learning models as in any software, including typical unit and integration tests of the system. In addition to those, you need to validate data and the model, and evaluate the quality of the trained model. Adding the necessary validation creates a bit more complexity and requires automation of steps that are manually done before deployment by data scientists to train and validate new models. You might need to deploy a multi-step pipeline to automatically retrain and deploy models, depending on your solution.

Monitoring

After the model is deployed to production some people might think it remains functional and decays like any traditional software system. In fact, machine learning models can decay in more ways than traditional software systems. In addition to monitoring the performance of the system, the performance of models themselves needs to be monitored as well. Because machine learning models make assumptions of real-world based on the data used for training the models, when the surrounding world changes, accuracy of the model may decrease. This is especially true for the models that try to model human behavior. Decreasing model accuracy means that the model needs to be retrained to reflect the surrounding world better and with monitoring the retraining is not done too seldom or often. By tracking summary statistics of your data and monitoring the performance of your model, you can send notifications or roll back when values deviate from the expectations made in the time of last model training.

Applying MLOps

Bringing MLOps thinking to the machine learning model development enables you to actually get your models to production if you are not there yet, makes your deployment cycles faster and more reliable, reduces manual effort and errors, and frees time from your data scientists from tasks that are not their core competences to actual model development work. Cloud providers (such as AWS, Azure or GCP) are especially good places to start implementing MLOps in small steps, with ready made software components you can use. Moreover, all the CPU / GPU that is needed for model training with pay as you go model.

If the maturity of your AI journey is still in early phase (PoCs don’t need heavy processes like this), robust development framework and pipeline infra might not be the highest priority. However, any effort invested in automating the development process from the early phase will pay back later and reduce the machine learning technical debt in the long run. Start small and change the way you develop ML models towards MLOps by at least moving the development work on top of version control, and automating the steps for retraining and deployment.

DevOps was born as a reaction to systematic organization needed around rapidly expanding software development, and now the same problems are faced in the field of machine learning. Take the needed steps towards MLOps, like done successfully with DevOps before.

Career opportunities

Performance of semi-structured data types in Snowflake

Snowflake extends the idea of traditional relational database by having the possibility to store and handle semi-structured data types. As shown in this post, semi-structured data types have effects on query performance and results of numerical operations.

On top of the traditional relational data, Snowflake also has support for semi-structured data with flexible schema. This can be utilized with for example following data formats: JSON and XML. Inside Snowflake, these are stored as either variant, array or object data types. Let us take a closer look what these mean.

Variant is a tagged universal type that can hold up to 16 MB of any data type supported by Snowflake. Variants are stored as columns in relational tables. Array is a list-like indexed data type that consists of variant values. Object, on the other hand, is a data type that consist of key-value pairs, where key is a not-null string and value is variant type data.

Snowflake provides guidelines on handling semi-structured data on their documentation. As a baseline, they recommend to store semi-structured data as variant data type, if usage for data is unsure. However, it is stated that when the usage is known, recommendation is to perform test cases to find solution with the best performance. Let us take a look what kind of differences might occur.

Test setup

For testing purposes, we create three different tables:

  1.   Table with 500 million rows
  2.   Table with variant column having 500 million values
  3.   Table with array column having 500 million values

The traditional table includes user identifier, which is a random field between values 1 and 5 million and amount field, which is integer value between values 1 and 500 million. Variant and array tables are both grouped tables. These are created with the traditional table as their source. Schema has user identifier and variant or array typed value field, which has aggregated list of the values for certain user identity. The tables are created with following queries:

CREATE TABLE T_500M (
   user_id   INT,
   amount    INT); 

INSERT INTO T_500M
   SELECT UNIFORM (1, 5000000, random())
        , UNIFORM (1, 500000000, random())
   FROM TABLE ( GENERATOR ( ROWCOUNT => 50000000 ));
 
CREATE TABLE T_500M_VARIANT AS 
   SELECT user_id,
        , CAST(ARRAY_AGG(amount) AS VARIANT) AS variant_field
   FROM T_500M
   GROUP BY user_id;

CREATE TABLE T_500M_ARRAY AS 
   SELECT user_id
       ,  ARRAY_AGG(amount) AS array_field
   FROM T_500M
   GROUP BY user_id;

Evaluating performance

Storing values as variant or array might seem like a good idea, if we want to aggregate sums on amount field for every user identity. As a query result, we want to show user identifier, count number of occurrences for that user and aggregated sum for the amount field. We can achieve it for each table with following queries:

SELECT user_id
     , COUNT(*) AS value_count
     , SUM(amount) AS sum
FROM T_500M
GROUP BY user_id;

SELECT user_id
     , COUNT(flat_variant.value) AS value_count
     , SUM(flat_variant.value::INTEGER) AS sum
FROM T_500M_VARIANT
    , lateral flatten(INPUT => variant_field) flat_variant
GROUP BY user_id;

SELECT user_id
     , COUNT(flat_array.value) AS value_count
     , SUM(flat_array.value::INTEGER) AS sum
FROM T_500M_ARRAY
     , lateral flatten(INPUT => array_field) flat_array
GROUP BY user_id;

Select-clause takes 15.6 seconds for the traditional relational table, 22.1 seconds with variant table and 21.9 seconds with array table. The difference is significant with the queries being over 40 % slower for semi-structured tables.

Another thing to consider with semi-structured formats is that queries on semi-structured data will not use result cache. We can notice this by running the queries again. The traditional table query takes only 0.2 seconds thanks to Snowflake’s persisted query results, but the queries to other tables take the same circa 22 seconds as earlier to complete.

The tested difference in our query time between structured data table and table with semi-structured data type exists, but it is still acceptable in some cases, where loading semi-structured data is a lot easier to variant or array columns. However, it needs to be noted, as stated in Snowflake documentation, query performance for data types that are not native for JSON are even worse for tables using variant or array. Shown test included only native JSON data types, but including for example datetime as variant would make the difference even bigger.

Explicit data type conversion

It is important to pay attention to the data types when accessing array or variant data. Consider the following example, where we query total sum of the amount-field with following select-statement on the variant table:

SELECT COUNT (*) AS row_count
     , SUM(flat_variant.value) AS sum_without_cast
     , SUM(flat_variant.value)::INTEGER AS cast_after_sum
     , SUM(flat_variant.value::INTEGER) AS cast_before_sum
FROM T_500M_VARIANT
     , lateral flatten(INPUT => variant_field) flat_variant;

For query results, we get three different sums:
   No specified casting:           1.24998423949572e+17
   Casting before sum:            124998423949572384
   Casting after sum:               124998423949572368

The same sum amounts are received when running the above query for the array table. Difference comes from Snowflake’s calculations, where variant and array are handled using JavaScript. JavaScript language uses Float as data type for numeric values. As shown in the earlier blog post using floating-point numeric data types, may lead to imprecise values. Even though the table only has fixed-point numeric values, using variant or array converts them to floating points unless determined explicitly when querying data.

Conclusion

Possibility to store semi-structured data in relational tables on Snowflake comes in handy for many business needs that do not have traditional relational source data. It enables loading semi-structured data straight to Snowflake and parsing data onwards from there. Even though this is possible, it should be tested per use case whether it is the best solution.

When evaluating query performance we noticed that querying tables with semi-structured data types, the select-clauses resulted in being 40 % slower compared to similar table with structured data types. This is with JavaScript native data types, while non-native types will result in even bigger difference for execution time. Semi-structured data types can’t  utilize result cache, so re-running the queries will take similar time as the initial one.

Best practice is converting data types explicitly, when accessing variant or arrays. Snowflake engine uses JavaScript to handle these data types, so as was shown, numeric values may suffer from rounding.

New call-to-action

Lessons learned from combining SQS and Lambda in a data project

In June 2018, AWS Lambda added Amazon Simple Queue Service (SQS) to supported event sources, removing a lot of heavy lifting of running a polling service or creating extra SQS to SNS mappings. In a recent project we utilized this functionality and configured our data pipelines to use AWS Lambda functions for processing the incoming data items and SQS queues for buffering them. The built-in functionality of SQS and Lambda provided us serverless, scalable and fault-tolerant basis, but while running the solution we also learned some important lessons. In this blog post I will discuss the issue of valid messages ending up in dead-letter queues (DLQ) and correctly configuring your DLQ to catch only erroneous messages from your source SQS queue.

What are Amazon SQS and Lambda?

In brief, Amazon SQS is a lightweight, fully managed message queueing service, that enables decoupling and scaling microservices, distributed systems and serverless applications. With SQS, it is easy to send, store, and receive messages between software components, without losing messages.

AWS Lambda is a fully managed, automatically scaling service that lets you run code in multiple different languages in a serverless manner, without having to provision any servers. You can configure a Lambda function to execute on response to various events or orchestrate the invocations. Your code runs in parallel and processes each invocation individually, scaling with the size of the workload.

When you configure an SQS queue as an event source for your Lambda, Lambda functions are automatically triggered when messages arrive to the SQS queue. The Lambda service automatically scales up and down based on the number of inflight messages in the queue. The polling, reading and removing of messages from the queue will be thus automatically handled by the built-in functionality. Successfully processed messages will be removed and the failed ones will be returned to the queue or forwarded to the DLQ, without needing to explicitly configure these steps inside your Lambda function code.

Problems with valid messages ending up in DLQ

In the recent project we needed to process data that would be coming in daily as well as in larger batches with historical data loadings through the same data processing pipeline. In order to handle changing data loads, SQS decouples the source system from processing and balances the load for both use cases. We used SQS for queueing metadata of new data files and Lambda function for processing the messages and passing on metadata to next steps in the pipeline. When testing our solution with pushing thousands of messages rapidly to the queue, we observed many of the messages ending up in a dead-letter queue, even though they were not erroneous.

From the CloudWatch metrics, we found no execution errors during the given period, but instead there was a peak in the Lambda throttling metric. We had configured a DLQ to catch erroneous messages, but ended up having completely valid and unprocessed messages in the DLQ. How does this happen? To understand this, let’s dive deeper into how Lambda functions are triggered and scaled when they have SQS configured as the event source.

Lambda scales automatically with the number of messages arriving to SQS – up to a limit

Let’s first introduce briefly the parameters of SQS and Lambda that are relevant to this problem.

SQS

ReceiveMessageWaitTimeSeconds: Time that the poller waits for new messages before returning a response. Your messages might be arriving to the SQS queue unevenly, sometimes in bursts and sometimes there might be no messages arriving at all. The default value is 0, which equals constant polling of messages. If the queue is empty and your solution allows some lag time, it might be beneficial not to poll the queue all the time and return empty responses. Instead of polling for messages constantly, you can specify a wait time between 1 and 20 seconds.

VisibilityTimeout: The length of time during which a message will be invisible to consumers after the message is read from the queue. When a poller reads a message from the SQS queue, that message still stays in the queue but becomes invisible for the period of VisibilityTimeout. During this time the read message will be unavailable for any other poller trying to read the same message and gives the initial component time to process and delete the message from the queue.

maxReceiveCount: Defines the number of times a message can be delivered back to being visible in the source queue before moving it to the DLQ. If the processing of the message is successful, the consumer will delete it from the queue. When ever an error occurs in processing of a message and it cannot be deleted from the queue, the message will become visible again in the queue with an increased ReceiveCount. When the ReceiveCount for a message exceeds the maxReceiveCount for a queue, message is moved to a dead-letter queue.

Lambda

Reserved concurrency limit: The number of executions of the Lambda function that can run simultaneously. There is an account specific limit how many executions of Lambda functions can run simultaneously (by default 1,000) and it is shared between all your Lambda functions. By reserving part of it for one function, other functions running at the same time can’t prevent your function from scaling.

BatchSize: The maximum number of messages that Lambda retrieves from the SQS queue in a single batch. Batchsize is related to the Lambda event source mapping, which defines what triggers the Lambda functions. In this case they are triggered from SQS.

In the Figure 1 below, it is illustrated how Lambda actually scales when messages arrive in bursts to the SQS queue. Lambda uses long polling to poll messages in the queue, which means that it does not poll the queue all the time but instead on an interval between 1 to 20 seconds, depending on what you have configured to be your queue’s ReceiveMessageWaitTimeSeconds. Lambda service’s internal poller reads messages as batches from the queue and invokes the Lambda function synchronously with an event that contains a batch of messages. The number of messages in a batch is defined by the BatchSize that is configured in the Lambda event source mapping.

When messages start arriving to the queue, Lambda reads first up to 5 batches and invokes a function for each. If there are messages still available, the number of processes reading the batches are increased by up to 60 more instances per minute (Figure 2), until it reaches the 1) reserved concurrency limit configured for the Lambda function in question or 2) the account’s limit of total concurrent Lambda executions (by default 1,000), whichever is lower (N  in the figure).

By setting up a reserved concurrency limit for your Lambda, you guarantee that it will get part of the account’s Lambda resources at any time, but at the same time you also limit your function from scaling over that limit, even though there would be Lambda resources available for your account to use. When that limit is reached and there are still messages available in the queue, one might assume that those messages will stay visible in the queue, waiting until there’s free Lambda capacity available to handle those messages. Instead, the internal poller still tries to invoke new Lambda functions for all the new messages and therefore causes throttling of the Lambda invokes (figure 2). Why are some messages ending up in DLQ then?

Let’s look at how the workflow goes for an individual message batch if it succeeds or fails (figure 3). First, the Lambda internal poller reads a message batch from the queue and those messages stay in the queue but become invisible for the length of the configured VisibilityTimeout. Then it invokes a function synchronously, meaning that it will wait for a response that indicates a successful processing or an error, that can be caused by e.g. function code error, function timeout or throttling. In the case of a successful processing, the message batch is deleted from the queue. In the case of failure, however, the message becomes visible again.

The SQS queue is not aware of what happens beyond the event source mapping, if the invocations are failed or throttled. It keeps the messages in the queue invisible, until they get either deleted or turned back to visible after the length of VisibilityTimeout has passed. Effectually this means that throttled messages are treated as any other failed messages, so their ReceiveCount is increased every time they are made visible in the queue. If there is a huge burst of messages coming in, some of the messages might get throttled, retried, throttled again, and retried again, until they reach the limit of maxReceiveCount and then moved to the DLQ.

The automatic scaling and concurrency achieved with SQS and Lambda sounds promising, but unfortunately like all AWS services, this combination has its limits as well. Throttling of valid messages can be avoided with the following considerations:

Be careful when configuring a reserved concurrency to your Lambda function: the smaller the concurrency, the greater the chance that the messages get throttled. AWS suggests the reserved concurrency for a Lambda function to be 5 or greater.

Set the maxReceiveCount big enough in your SQS queue’s properties, so that the throttled messages will eventually get processed after the burst of messages. AWS suggest you set it to 5 or greater.

By increasing message VisibilityTimeout of the source queue, you can give more time for your Lambda to retry the messages in the case of message bursts. AWS suggests this to be set to at least 6 times the timeout you configure to your Lambda function.

Of course, tuning these parameters is an act of balancing with what best suits the purpose of your application.

Configuring DLQ to your SQS and Lambda combination

If you don’t configure a DLQ, you will lose all the erroneous (or valid and throttled) messages. If you are familiar with the topic this seems obvious, but it’s worth stating since it is quite important. What is confusing now in this combo, is that you can configure a dead-letter queue to both SQS and Lambda. The AWS documentation states:

Make sure that you configure the dead-letter queue on the source queue, not on the Lambda function. The dead-letter queue that you configure on a function is used for the function’s asynchronous invocation queue, not for event source queues.

To understand this one needs to dive into the difference between synchronous and asynchronous invocation of Lambda functions.

When you invoke a function synchronously Lambda runs the function and waits for a response from it. The function code returns the response, and Lambda returns you this response with some additional data, including e.g. the function version. In the case of asynchronous invocation, however, Lambda sends the invocation event to an internal queue. If the event is successfully sent to the internal queue, Lambda returns success response without waiting for any response from the function execution, unlike in synchronous invocation. Lambda manages the internal queue and attempts to retry failed events automatically with its own logic. If the execution of the function is failing after retries as well, the event is sent to the DLQ configured to the Lambda function. With event source mapping to SQS, Lambda is invoked synchronously, therefore there are no retries like in asynchronous invocation and the DLQ on Lambda is useless.

Recently, AWS launched Lambda Destinations, that makes it possible to route asynchronous function results to a destination resource that can be either SQS, SNS, another Lambda function or EventBridge. With DLQs you can handle asynchronous failure situations and catch the failing events, but with Destinations you can actually get more detailed information on function execution in both success and failure, such as code exception stack traces. Although, Destinations and DLQs can be used together and at the same time, AWS suggests Destinations should be considered a more preferred solution.

Conclusions

The described problems are all stated and deductible from the AWS documentation, but still not completely obvious. With carefully tuning the parameters of our SQS queue, mainly by increasing the maxReceiveCount and VisibilityTimeOut, we were able to overcome the problems with Lambda functions throttling. With configuring the DLQ to the source SQS queue instead of configuring it to Lambda, we were able to catch erroneous or throttled messages. Although adding a DLQ to your source SQS does not solve anything by itself, but you also need to handle the failing messages in some way. We configured a Lambda function also to the DLQ to write the erroneous messages to DynamoDB. This way we have a log of the unhandled messages in DynamoDB and the messages can be resubmitted or investigated further from there.

Of course, there are always several kinds of architectural options to solve these kind of problems in AWS environment. Amazon Kinesis, for example, is a real-time stream processing service, but designed to ingest large volumes of continuous streaming data. Our data, however, comes in uneven bursts, and SQS acts better for that scenario as a message broker and decoupling mechanism. One just needs to be careful with setting up the parameters correctly, as well as be sure that the actions intended for the Lambda function will execute within Lambda limits (including 15 minutes timeout and 3,008 MB maximum memory allocation). The built-in logic with Lambda and SQS enabled the minimal infrastructure to manage and monitor as well as high concurrency capabilities within the given limits.

To float, or not to float – Choosing correct numeric data type in Snowflake

Fixed-point numerical data types should be the default choice when designing Snowflake relational tables. Using floating-point data types has multiple downsides, which are important to understand. The effect of choosing different numerical data type can be easily tested.

Snowflake numeric data types can be split in two main categories: fixed-point numbers and floating-point numbers. In this blog post we are going to look at what these mean from Snowflake database design point of view, and especially should you use floating type when dealing with numeric data?

Fixed-point numbers are exact numeric values, which include data types such as number, integer and decimal. For these data types, developer can determine precision (allowed number of digits in column) and scale (number of digits right of the decimal point). In Snowflake, all fixed-point numeric data types are actually type decimal with precision 38 and scale 0, if not specified differently. Typical use cases for fixed-point data types are natural numbers and exact decimal values, such as monetary figures, where they need to be stored precisely.

On the other side of the spectrum are floating-point numbers, which are approximate representations of numeric values. In Snowflake, floating-point columns can be created with key-words float, double, double precision or real. However, as the Snowflake documentation states, all of these data types are actually stored as double and they do not have difference in precision, but displayed as floats. Floating-point data types are mainly used in mathematics and science to simplify the calculations with scientific notation. Storing numbers with major differences in magnitude is their advantage in databases too, because zeros trailing or following the decimal sign does not consume memory as it does for decimal format. In traditional computing, floats are considered faster for computation, but is that really the case in modern database design?

Floating-point precision

First, let us explore inserting data on Snowflake into table with float as numeric data type. We create a table and insert 500 million rows of generated dummy data to the table with following query:

CREATE TABLE T_FLOAT_500M (
   id            INT,
   float_field   FLOAT );

 
INSERT INTO T_FLOAT_500M
  SELECT SEQ8()
       , UNIFORM(1, 500000000, RANDOM())::FLOAT
  FROM TABLE ( GENERATOR ( ROWCOUNT => 500000000 ) );

To see the effect of using float as the data type for big numeric values, we can run the following query:

SELECT SUM(float_field) AS NO_CONVERSION
     , SUM(float_field::INT) AS CONVERT_VALUES_SEPARATELY
     , SUM(float_field)::INT AS CONVERT_SUM_VALUE
FROM T_FLOAT_500M;

Sum without the conversion produces us a rounded number with scientific notation: 1.24997318579082e+17
Separately converted values sum produces result: 124997318579081654
Conversion made after the calculation produces sum value: 124997318579081664

From the sum results, we will notice the accuracy problem related to storing numeric values as floats as the sum results differ from each other. When dealing with large or extremely accurate numeric values, floats may cause differentiation in results due to their nature of being approximate representations. Same effect can be seen when using WHERE clauses as the approximate representations may not work as designed with conditions that point to exact numeric values.

Storage size of float

Next, we create a similar table as earlier, but with the second field being type integer and populate it without converting random figures to floats. 

CREATE TABLE T_INT_500M (
   id          INT,
   int_field   INT);

INSERT INTO T_INT_500M
  SELECT SEQ8()
       , UNIFORM(1, 500000000, RANDOM())::INT
  FROM TABLE ( GENERATOR ( ROWCOUNT => 500000000 ) );

SHOW TABLES LIKE '%_500M';

Looking at the Snowflake table statistics, we will notice integer table is smaller (3.37 GB) compared to the float table (5.50 GB). The difference in table sizes is significant with the float table being 63 % bigger. This can be explained by Snowflake reserving 64 bits of memory for every float value. Integer values on the other hand are stored in compressed format and take only the necessary amount of memory.

This difference is seen also on SELECT queries, where querying all rows with X-Small warehouse takes only 85 seconds for integer type table compared to the 160 seconds with the float type table. Difference is once again major. 

Summary

Floats still have their use cases with numbers that have majorly different magnitudes. But from the presented test cases, we can draw a conclusion that using floating-point data types will lead to bigger storage sizes and longer query times, which result as an increase to data warehousing costs. Another thing to consider is the possibility of imprecise values, when dealing with extremely accurate data. Whenever possible, it is recommended not to use float as a type for numeric data without a specific reason. Especially precise and whole numbers are not meant to be stored as floats and should be given appropriate data types. 

AWS launches major new features for Amazon SageMaker to simplify development of machine learning models

Machine learning continues to grow on AWS and they are putting serious effort on paving the way for customers’ machine learning development journey on AWS cloud. The Andy Jassy keynote in AWS Re:Invent was a fiesta for data scientists with the newly launched Amazon SageMaker features, including Experiments, Debugger, Model Monitor, AutoPilot and Studio.

AWS really aims to make the whole development life cycle of machine learning models as simple as possible for data scientists. With the newly launched features, they are addressing common, effort demanding problems: monitoring your data validity from your model’s perspective and monitoring your model performance (Model Monitor), experimenting multiple machine learning approaches in parallel for your problem (Experiments), enable cost efficiency of heavy model training with automatic rules (Debugger) and following these processes in a visual interface (Studio). These processes can even be orchestrated for you with AutoPilot, that unlike many services is not a black box machine learning solution but provides all the generated code for you. Announced features also included a SSO integrated login to SageMaker Studio and SageMaker Notebooks, a possibility to share notebooks with one click to other data scientists including the needed runtime dependencies and libraries (preview).

Compare and try out different models with SageMaker Experiments

Building a model is an iterative process of trials with different hyperparameters and how they affect the performance of the model. SageMaker Experiments aim to simplify this process. With Experiments, one can create trial runs with different parameters and compare those. It provides information about the hyperparameters and performance for each trial run, regardless of whether a data scientist has run training multiple times, has used automated hyperparameter tuning or has used AutoPilot. It is especially helpful in the case of automating some steps or the whole process, because the amount of training jobs run is typically much higher than with traditional approach.

Experiments makes it easy to compare trials, see what kind of hyperparameters was used and monitor the performance of the models, without having to set up the versioning manually. It makes it easy to choose and deploy the best model to production, but you can also always come back and look at the artifacts of your model when facing problems in production. It also provides more transparency for example to automated hyperparameter tuning and also for new SageMaker AutoPilot. Additionally, SageMaker Experiments has Experiments SDK so it is possible call the API with Python to get the best model programmatically and deploy endpoint for it.

Track issues in model training with SageMaker Debugger

During the training process of your model, many issues may occur that might prevent your model from correctly finishing or learning patterns. You might have, for example, initialized parameters inappropriately or used un efficient hyperparameters during the training process. SageMaker Debugger aims to help tracking issues related to your model training (unlike the name indicates, SageMaker Debugger does not debug your code semantics).

When you enable debugger in your training job, it starts to save the internal model state into S3 bucket during the training process. A state consists of for example weights for neural network, accuracies and losses, output of each layer and optimization parameters. These debug outputs will be analyzed against a collection of rules while the training job is executed. When you enable Debugger while running your training job in SageMaker, will start two jobs: a training job, and a debug processing job (powered by Amazon SageMaker Processing Jobs), which will run in parallel and analyze state data to check if the rule conditions are met. If you have, for example, an expensive and time consuming training job, you can set up a debugger rule and configure a CloudWatch alarm to it that kills the job once your rules trigger, e.g. loss has converged enough.

For now, the debugging framework of saving internal model states supports only TensorFlow, Keras, Apache MXNet, PyTorch and XGBoost. You can also configure your own rules that analyse model states during the training, or some preconfigured ones such as loss not changing or exploding/vanishing gradients. You can use the debug feature visually through the SageMaker Studio or alternatively through SDK and configure everything to it yourself.

Keep your model up-to-date with SageMaker Model Monitor

Drifts in data might have big impact on your model performance in production, if your training data and validation data start to have different statistical properties. Detecting those drifts requires efforts, like setting up jobs that calculate statistical properties of your data and also updating those, so that your model does not get outdated. SageMaker Model Monitor aims to solve this problem by tracking the statistics of incoming data and aims to ensure that machine learning models age well.

The Model Monitor forms a baseline from the training data used for creating a model. Baseline information includes statistics of the data and basic information like name and datatype of features in data. Baseline is formed automatically, but automatically generated baseline can be changed if needed. Model Monitor then continuously collects input data from deployed endpoint and puts it into a S3 bucket. Data scientists can then create own rules or use ready-made validations for the data and schedule validation jobs. They can also configure alarms if there are deviations from the baseline. These alarms and validations can indicate that the model deployed is actually outdated and should be re-trained.

SageMaker Model Monitor makes monitoring the model quality very easy but at the same time data scientists have the control and they can customize the rules, scheduling and alarms. The monitoring is attached to an endpoint deployed with SageMaker, so if inference is implemented in some other way, Model Monitor cannot be used. SageMaker endpoints are always on, so they can be expensive solution for cases when predictions are not needed continuously.

Start from scratch with SageMaker AutoPilot

SageMaker AutoPilot is an autoML solution for SageMaker. SageMaker has had automatic hyperparameter tuning already earlier, but in addition to that, AutoPilot takes care of preprocessing data and selecting appropriate algorithm for the problem. This saves a lot of time of preprocessing the data and enables building models even if you’re not sure which algorithm to use. AutoPilot supports linear learner, factorization machines, KNN and XGBboost at first, but other algorithms will be added later.

Running an AutoPilot job is as easy as just specifying a csv-file and response variable present in the file. AWS considers that models trained by SageMaker AutoPilot are white box models instead of black box, because it provides generated source code for training the model and with Experiments it is easy to view the trials AutoPilot has run.

SageMaker AutoPilot automates machine learning model development completely. It is yet to be seen if it improves the models, but it is a good sign that it provides information about the process. Unfortunately, the description of the process can only be viewed in SageMaker Studio (only available in Ohio at the moment). Supported algorithms are currently quite limited as well, so the AutoPilot might not provide the best performance possible for some problems. In practice running AutoPilot jobs takes a long time, so the costs of using AutoPilot might be quite high. That time is of course saved from data scientist’s working time. One possibility is, for example, when approaching a completely new data set and problem, one might start by launching AutoPilot and get a few models and all the codes as template. That could serve as a kick start to iterating your problem by starting from tuning the generated code and continuing development from there, saving time from initial setup.

SageMaker Studio – IDE for data science development

The launched SageMaker Studio (available in Ohio) is a fully integrated development environment (IDE) for ML, built on top of Jupyter lab. It pulls together the ML workflow steps in a visual interface, with it’s goal being to simplify the iterative nature of ML development. In Studio one can move between steps, compare results and adjust inputs and parameters.  It also simplifies the comparison of models and runs side by side in one visual interface.

Studio seems to nicely tie the newly launched features (Experiments, Debugger, Model Monitor and Autopilot) into a single web page user interface. While these new features are all usable through SDKs, using them through the visual interface will be more insightful for a data scientist.

Conclusion

These new features enable more organized development of machine learning models, moving from notebooks to controlled monitoring and deployment and transparent workflows. Of course several actions enabled by these features could be implemented elsewhere (e.g. training job debugging, or data quality control with some scheduled smoke tests), but it requires again more coding and setting up infrastructure. The whole public cloud journey of AWS has been aiming to simplify development and take load away by providing reusable components and libraries, and these launches go well with that agenda.

AWS Redshift breaks bond between compute and storage

AWS Redshift took a huge leap forwards with new releases. Decoupling the storage and compute are the first steps towards modern cloud data warehouse.

AWS Redshift is the world’s most popular data warehouse, but has faced some tough competition from the market. AWS Redshift has the compute and storage coupled, meaning that with the specific amount of instance you get set of storage that sometimes can be limiting. At the Andy Jassy keynote AWS released a new managed storage model for Redshift that allows you to scale the compute decoupled from the storage.

The storage model uses SSDs and S3 for the storage behind the scenes and is utilising architectural improvements of the infrastructure. This allows to users to keep the hot data in SSD and also query historical data stored in S3 seamlessly from Redshift. On top of this, you only pay for the SSD you use. It also comes with a new Nitro based compute instances. In Ireland RA3 instance has price of $15.578 per node/hour, but you get 48 vCPUs and 384 GB of memory and up to 64 TB of storage. You can cluster these up to 128 instances. AWS promises to give 3x the performance of any other cloud data warehouse service and Redshift Dense Storage (DS2) users are promised to get twice the performance and twice the storage at the same cost. RA3 is available now in Europe in EU (Frankfurt), EU (Ireland), and EU (London).

Related to the decoupling of the compute and storage, AWS released AWS AQUA. Advanced Query Accelerator promises 10 times better query performance. AQUA sits on top of S3 and is Redshift compatible. For this feature we have to wait for mid 2020 to get hands on. 

While AWS Redshift is the world’s most popular data warehouse, it is not practical to load all kind of data there. Sometimes data lakes are more suitable places for data, especially for unstructured data. Amazon S3 is the most popular choice for cloud data lakes. New Redshift features help to tie structured and unstructured data together to enable even better and more comprehensive insight.

With Federated Query feature (preview), it is now possible to query data in Amazon RDS PostgreSQL, and Amazon Aurora PostgreSQL from a Redshift cluster. The queried data can then be combined with data in the Redshift cluster, and Amazon S3. Federated queries allows data ingestion into Redshift, without any other ETL tool, by extracting data from above-mentioned data sources, transforming it on the fly, and loading data into Redshift. Data can also be uploaded from Redshift to S3 in Apache Parquet format using Data Lake Export feature. With this feature you are able to build some nice lifecycle features into your design. 

“One should use the best tool for the job”, reminded Andy Jassy at the keynote. With long awaited decoupling of storage and compute and big improvements to the core, Redshifts took a huge leap forward. It is extremely interesting to start designing new solutions with these features.

AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

Introduction to AWS Glue for big data ETL

AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

AWS Glue is a serverless ETL tool in cloud. In brief ETL means extracting data from a source system, transforming it for analysis and other applications and then loading back to data warehouse for example.

In this blog post I will introduce the basic idea behind AWS Glue and present potential use cases.

The emphasis is in the big data processing. You can read more about Glue catalogs here and data catalogs in general here.

Why to use AWS Glue?

Replacing Hadoop. Hadoop can be expensive and a pain to configure. AWS Glue is simple. Some say that Glue is expensive, but it depends where you compare. Because of on demand pricing you only pay for what you use. This fact might make AWS Glue significantly cheaper than a fixed size on-premise Hadoop cluster.

AWS Lambda can not be used. A wise man said, use lambda functions in AWS whenever possible. Lambdas are simple, scalable and cost efficient. They can also be triggered by events. For big data lambda functions are not suitable because of the 3 GB memory limitation and 15 minute timeout. AWS Glue is specifically built to process large datasets.

Apply DataOps practices. Drag and drop ETL tools are easy for users, but from the DataOps perspective code based development is a superior approach. With AWS Glue both code and configuration can be stored in version control. The data development becomes similar to any other software development. For example the data transformation scripts written by scala or python are not limited to AWS cloud. Environment setup is easy to automate and parameterize when the code is scripted.

An example use case for AWS Glue

Now a practical example about how AWS Glue would work in practice.

A production machine in a factory produces multiple data files daily. Each file is a size of 10 GB. The server in the factory pushes the files to AWS S3 once a day.

The factory data is needed to predict machine breakdowns. For that, the raw data should be pre-processed for the data science team.

Lambda is not an option for the pre-processing because of the memory and timeout limitation. Glue seems to be reasonable option when work hours and costs are compared to alternative tools.

The simplest way of get started with the ETL process is to create a new Glue job and write code to the editor. The script can be either in scala or python programming language.

Extract. The script first reads all the files from the specified S3 bucket to a single data frame. You can think a data frame as a table in Excel. The reading can be just a one-liner.

Transform. This is the most of the code. Let’s say that the original data had 100 records per second. The data science team wants the data to be aggregated per each 1 minute with a specific logic. This could be just tens of code lines if the logic is simple.

Load. Write data back to another S3 bucket for the data science team. It’s possible that a single line of code will do.

The code runs on top of the spark framework which is configured automatically in Glue. Thanks to spark, data will be divided to small chunks and processed in parallel on multiple machines simultaneously.

What makes AWS Glue serverless?

Serverless means you don’t have machines to configure. AWS provisions and allocates the resources automatically.

The processing power is adjusted by the number of data processing units (DPU). You can do additional configuration, but it’s likely that a proof of concept works out of the box.

In an on-premise environment you would have to make a decision about the computation cluster size. A big cluster is expensive but fast. A small cluster would be cheaper but slow to run.

With AWS Glue your bill is the result the following equation:

[ETL job price] = [Processing time] * [Number of DPUs]

 

The on demand pricing means that the increase in processing power does not compromise with the price of the ETL job. At least in theory, as too many DPUs might cause overhead in processing time.

When is AWS Glue a wrong choice?

This is not an advertisement, so let’s give some critique for Glue as well.

Lots of small ETL jobs. Glue has a minimum billing of 10 minutes and 2 DPUs. With the price of 0.44$ per DPU hour, the minimum cost for a run becomes around 0.15$. The starting price makes Glue unappealing alternative to process small amount of data often.

Specific networking requirements. If you spin up a standard EC2 virtual machine, an IP address will be attached to it. The serverless nature of Glue means you have to put more effort on network planning in some cases. One such scenario would be whitelisting a Glue job in a firewall to extract data from an external system.

Summary about AWS Glue

The most common argument against Glue is “It’s expensive”. True, in a sense that the first few test runs can already cost a few dollars. In a nutshell, Glue is cost efficient for infrequent big data workloads.

In the big picture AWS Glue saves a lot of time and unnecessary hardware engineering. The costs should be compared against alternative options such as on-premise Hadoop cluster or development hours required for a custom solution.

As Glue pricing model is predictable, the business cases are straightforward to calculate. It might be enough to test just the critical parts of the ETL pipeline to become confident about the performance and costs.

I feel that optimizing the code for distributed computing has been more of a challenge than the Glue service itself. The next blog post will focus on how data developers get started with Glue using python and spark.

A curated list of new Snowflake features released at Snowflake Summit 2019

Snowflake coming to Google Cloud, Data Replication, Snowflake Organizations, external tables, Data Pipelines, Data Exchange. The list goes on. Read following article for a curated list of new features released on Snowflake Summit keynote at San Francisco.

Trying to go through for all the announcements for a product can be sometimes overwhelming. It takes time as you need to go through for all the individual press announcements. To ease the pain, I’ve gathered a curated list of new Snowflake features released at the first annual Snowflake Summit at sunny California at Hotel Hilton.

So let’s begin.

Snowflake released new features which can be broken down into to following categories: Core Data Warehouse, Data Pipelines, Global Snowflake and Secure Data Sharing. I’ll list the features and give a more precise description of the feature (if possible).

Core Data Warehouse

Core Data Warehouse features are the bread and butter of Snowflake. Everybody is already familiar with Snowflake features such as virtual warehouses, separation of storage and compute so Snowflake didn’t release any announcements on those. They though said that they are working on decreasing concurrency latency and making large ad-hoc queries to work even better. In larger scale this means that the boundaries of operational databases and operational data platforms are narrowing down.

On the security side, Snowflake told that they are working with multiple vendors on enabling identity passing from BI -tools to Snowflake. Basically this means that your user id which you use to log on to for example to Tableau, is passed straight to Snowflake. This will enable even better row-level security and secure views possibilities. As of today, Snowflake reminded, that Snowflake already supports OAuth 2.0 as an authentication method.

Under the hood, JavaScript Stored Procedures support was reminded and the possibility of Geospatial capability was spoken (nothing released, rather teased).

Guys and gals at Snowflake are also keen on providing a richer experience for the end users. They told that they had gathered information from the end users, to provide better worksheets, Worksheet 3.0 so to speak. The end result of that was that during the investigation phase they encountered company called Numeracy and eventually decided to make their first acquisition as Numeracy was working on creating even better UI for Snowflake. Now the features of Numeracy UI are being ported into Snowflake UI. Numecary UI provides better editor suggestions, visualizations and provides worksheet sharing. On text this doesn’t sound anything new, but my colleague took a video of the new UI and it is awesome.

Data Pipelines

Data Pipelines is a new feature coming into Snowflake. Basically Snowflake has now the capability to auto-ingest data coming into the cloud storage layer and the possibility to do a data transformation for the data based on user-defined tasks. This means that you no longer need any external scheduler to trigger the small ELT or scheduled jobs.

This is possible due to the following new features:
– Auto-Ingest
– Streams and Tasks

During the Data Pipelines presentation a Snowflake Connector for the Kafka was also announced.

Under the Data Pipelines headline, the concept of Snowflake as a Data Lake was introduced more in detail because Snowflake will now support external tables and SQL over external tables, which means that you don’t need to load the data into Snowflake to get the insight of the data (and data structure). To make things interesting, Snowflake will now also support materialized views on external tables.

As a surprise effect, all the features which were were listed above are available today in public preview.

Global Snowflake

Under the title of Global Snowflake the new regions were introduced. AWS got Canada Central, US-Central (Ohio) and Japan. For Azure, new regions will open in US-West-2, Canada Central, Government and Singapore.

The most anticipated release was the release of Google Cloud version of Snowflake, which will be on preview at Q3 at 2019. Google Cloud supported stages are though possible already, as noted earlier.

Snowflake Organizations was published as a new way to control your Snowflake instances within a large corporation. Now you have the possibility to deploy Snowflake instances in your chosen cloud and region through Snowflake UI and you can act as organization admin. Organizations will provide a dashboard of warehouse and storage costs of all your accounts. So, if your company has multiple Snowflake accounts, this the new way to go.

Finally, Database Replication was announced. Database replication offers the possibility to replicate your data, per database level, to another Snowflake account regardless whether the account is Azure, AWS or Google Cloud. It doesn’t even matter on which region you’re planning to replicate your data as database replication supports cross-region replications.

Database replication is a point-and-click version of the traditional Oracle Data Guard or Microsoft SQL Server AlwaysOn Cluster where you define database which you want to replicate to a different location for business continuity purposes or just to provide read-only data nearer to the end users. The difference to the Oracle and Microsoft versions is that Snowflake implementation works out-of-the-box without any hassle. Database replication also supports failover and failback for client application end user point, meaning that if you have Tableau Server connected to Snowflake and you do a failover the database, Tableau Server will reconnect to the new location.

Secure Data Sharing

On Secure Data Sharing field, Snowflake introduced their concept of a data marketplace, called Data Exchange. Basically, Data Exchange is a marketplace for data that can be used through Snowflake Data Sharing. You can buy data shares (for example consumer travel data, weather data etc.) through Data Exchange and use them straight away in your SQL queries.

There are two options for this, Standard and Personalized. Standard -data is your normal data set, available for everybody. Personalized -data requires authentication between you and the data provider, but it will grant you the possibility to ingest your own data (if possible) straight from the provider using Data Sharing -functionality.
If that’s not all, Snowflake even advertised in mid-sentence, a simpler way to deploy Tableau into AWS and configure it to use Snowflake, now in 30 minutes.

Hopefully, this list helps you to get a grasp on things to come on Snowflake landscape. I will update this list if new features will be launched tomorrow or Thursday.

Link list in raw format.
https://www.snowflake.com/blog/snowflake-data-pipelines/
https://www.snowflake.com/blog/numeracy-investing-in-our-query-ui/
https://www.snowflake.com/blog/using-oauth-2-0-with-snowflake/
https://www.snowflake.com/blog/snowflake-introduces-javascript-stored-procedures/
https://www.snowflake.com/news/snowflake-and-google-cloud-announce-strategic-partnership/
https://www.snowflake.com/blog/snowflake-launches-database-replication-and-failover/
https://www.snowflake.com/news/snowflake-announces-data-exchange-to-break-down-data-barriers/
https://aws.amazon.com/about-aws/whats-new/2019/06/new-quick-start-deploys-tableau-and-snowflake-on-the-aws-cloud/