AWS Redshift breaks bond between compute and storage

AWS Redshift took a huge leap forwards with new releases. Decoupling the storage and compute are the first steps towards modern cloud data warehouse.

December 5, 2019

Antti Loukiala

Johanna Vainio

AWS Redshift is the world’s most popular data warehouse, but has faced some tough competition from the market. AWS Redshift has the compute and storage coupled, meaning that with the specific amount of instance you get set of storage that sometimes can be limiting. At the Andy Jassy keynote AWS released a new managed storage model for Redshift that allows you to scale the compute decoupled from the storage.

The storage model uses SSDs and S3 for the storage behind the scenes and is utilising architectural improvements of the infrastructure. This allows to users to keep the hot data in SSD and also query historical data stored in S3 seamlessly from Redshift. On top of this, you only pay for the SSD you use. It also comes with a new Nitro based compute instances. In Ireland RA3 instance has price of $15.578 per node/hour, but you get 48 vCPUs and 384 GB of memory and up to 64 TB of storage. You can cluster these up to 128 instances. AWS promises to give 3x the performance of any other cloud data warehouse service and Redshift Dense Storage (DS2) users are promised to get twice the performance and twice the storage at the same cost. RA3 is available now in Europe in EU (Frankfurt), EU (Ireland), and EU (London).

Related to the decoupling of the compute and storage, AWS released AWS AQUA. Advanced Query Accelerator promises 10 times better query performance. AQUA sits on top of S3 and is Redshift compatible. For this feature we have to wait for mid 2020 to get hands on.

While AWS Redshift is the world’s most popular data warehouse, it is not practical to load all kind of data there. Sometimes data lakes are more suitable places for data, especially for unstructured data. Amazon S3 is the most popular choice for cloud data lakes. New Redshift features help to tie structured and unstructured data together to enable even better and more comprehensive insight.

With Federated Query feature (preview), it is now possible to query data in Amazon RDS PostgreSQL, and Amazon Aurora PostgreSQL from a Redshift cluster. The queried data can then be combined with data in the Redshift cluster, and Amazon S3. Federated queries allows data ingestion into Redshift, without any other ETL tool, by extracting data from above-mentioned data sources, transforming it on the fly, and loading data into Redshift. Data can also be uploaded from Redshift to S3 in Apache Parquet format using Data Lake Export feature. With this feature you are able to build some nice lifecycle features into your design.

“One should use the best tool for the job”, reminded Andy Jassy at the keynote. With long awaited decoupling of storage and compute and big improvements to the core, Redshifts took a huge leap forward. It is extremely interesting to start designing new solutions with these features.