Performance of semi-structured data types in Snowflake
On top of the traditional relational data, Snowflake also has support for semi-structured data with flexible schema. This can be utilized with for example following data formats: JSON and XML. Inside Snowflake, these are stored as either variant, array or object data types. Let us take a closer look what these mean.
Variant is a tagged universal type that can hold up to 16 MB of any data type supported by Snowflake. Variants are stored as columns in relational tables. Array is a list-like indexed data type that consists of variant values. Object, on the other hand, is a data type that consist of key-value pairs, where key is a not-null string and value is variant type data.
Snowflake provides guidelines on handling semi-structured data on their documentation. As a baseline, they recommend to store semi-structured data as variant data type, if usage for data is unsure. However, it is stated that when the usage is known, recommendation is to perform test cases to find solution with the best performance. Let us take a look what kind of differences might occur.
For testing purposes, we create three different tables:
- Table with 500 million rows
- Table with variant column having 500 million values
- Table with array column having 500 million values
The traditional table includes user identifier, which is a random field between values 1 and 5 million and amount field, which is integer value between values 1 and 500 million. Variant and array tables are both grouped tables. These are created with the traditional table as their source. Schema has user identifier and variant or array typed value field, which has aggregated list of the values for certain user identity. The tables are created with following queries:
CREATE TABLE T_500M ( user_id INT, amount INT); INSERT INTO T_500M SELECT UNIFORM (1, 5000000, random()) , UNIFORM (1, 500000000, random()) FROM TABLE ( GENERATOR ( ROWCOUNT => 50000000 )); CREATE TABLE T_500M_VARIANT AS SELECT user_id, , CAST(ARRAY_AGG(amount) AS VARIANT) AS variant_field FROM T_500M GROUP BY user_id; CREATE TABLE T_500M_ARRAY AS SELECT user_id , ARRAY_AGG(amount) AS array_field FROM T_500M GROUP BY user_id;
Storing values as variant or array might seem like a good idea, if we want to aggregate sums on amount field for every user identity. As a query result, we want to show user identifier, count number of occurrences for that user and aggregated sum for the amount field. We can achieve it for each table with following queries:
SELECT user_id , COUNT(*) AS value_count , SUM(amount) AS sum FROM T_500M GROUP BY user_id; SELECT user_id , COUNT(flat_variant.value) AS value_count , SUM(flat_variant.value::INTEGER) AS sum FROM T_500M_VARIANT , lateral flatten(INPUT => variant_field) flat_variant GROUP BY user_id; SELECT user_id , COUNT(flat_array.value) AS value_count , SUM(flat_array.value::INTEGER) AS sum FROM T_500M_ARRAY , lateral flatten(INPUT => array_field) flat_array GROUP BY user_id;
Select-clause takes 15.6 seconds for the traditional relational table, 22.1 seconds with variant table and 21.9 seconds with array table. The difference is significant with the queries being over 40 % slower for semi-structured tables.
Another thing to consider with semi-structured formats is that queries on semi-structured data will not use result cache. We can notice this by running the queries again. The traditional table query takes only 0.2 seconds thanks to Snowflake’s persisted query results, but the queries to other tables take the same circa 22 seconds as earlier to complete.
The tested difference in our query time between structured data table and table with semi-structured data type exists, but it is still acceptable in some cases, where loading semi-structured data is a lot easier to variant or array columns. However, it needs to be noted, as stated in Snowflake documentation, query performance for data types that are not native for JSON are even worse for tables using variant or array. Shown test included only native JSON data types, but including for example datetime as variant would make the difference even bigger.
Explicit data type conversion
It is important to pay attention to the data types when accessing array or variant data. Consider the following example, where we query total sum of the amount-field with following select-statement on the variant table:
SELECT COUNT (*) AS row_count , SUM(flat_variant.value) AS sum_without_cast , SUM(flat_variant.value)::INTEGER AS cast_after_sum , SUM(flat_variant.value::INTEGER) AS cast_before_sum FROM T_500M_VARIANT , lateral flatten(INPUT => variant_field) flat_variant;
For query results, we get three different sums:
No specified casting: 1.24998423949572e+17
Casting before sum: 124998423949572384
Casting after sum: 124998423949572368
Possibility to store semi-structured data in relational tables on Snowflake comes in handy for many business needs that do not have traditional relational source data. It enables loading semi-structured data straight to Snowflake and parsing data onwards from there. Even though this is possible, it should be tested per use case whether it is the best solution.