Cover image from Azure Synapse

Azure Synapse Analytics – Unifying your data pipeline toolbox

This blog post will serve as an overview of the capabilities to give you insight on how to position Synapse Analytics to your overall data platform architecture. We won’t do a deep dive to technical implementation in this post.

Introduction

Azure SQL capabilities have evolved a lot over the years. Azure offers everything from VMs for running your own SQL Server setup, to SQL DB Hyperscale where you are getting your extremely scalable, but still traditional, DB as a service. But what about massively parallel processing data warehousing? Data integrations? Data science? What about limitless analytics? For that, there’s Azure Synapse Analytics.

In November 2019 Microsoft announced Azure Synapse as limitless analytics service and the next evolution of SQL Data Warehouse. Only thing released in GA was re-naming of SQL DW. Other synapse features were only available in limited private preview and for most setting up Synapse Analytics meant setting up former SQL DW. Nowadays Azure Synapse Analytics is a name for the whole analytics service with former SQL DW being part of it. While writing this blog the former SQL DW (also known as Synapse SQL provisioned) is still the only thing in GA, rest of the features are in public preview and anyone can set up Synapse Analytics workspace. Yes, the naming is confusing but hang on, we will try to clear that for you.

This blog post will serve as an overview of the capabilities to give you insight on how to position Synapse Analytics to your overall data platform architecture. We won’t do a deep dive to technical implementation in this post.

Synapse Analytics Architecture

Note: At the moment of writing this blog, Synapse Analytics can refer to both “Synapse Analytics (formerly SQL DW)” and “Synapse Analytics (workspace preview)” in Azure documentation. Here we are talking about unified experience in the workspace preview.

Azure Synapse Analytics is the common naming for integrated tooling providing everything from source system integration to relational databases to data science tooling to reporting data sets.  Synapse Analytics contains the following

"<yoastmark

  • Synapse SQL
    • This is the data warehouse part formerly known as Azure SQL DW.
    • Provides both serverless (SQL on-demand) and pre-allocated (SQL Pool) resources
    • Shares a common metastore with the Spark engine for seamless integration
  • Apache Spark
    • Seamlessly integrated big data engine.
    • Shares a common metastore with Synapse SQL
  • Data Flow and Integrations
    • Shares codebase with Azure Data Factory, so has everything you expect and more
    • Integrate to nearly a hundred data sources to ingest data
    • Orchestrate SQL Procedures and Spark Notebooks
  • Management and Monitoring
    • Familiar management and monitoring tools from Azure Data Factory are available.
  • Synapse Studio
    • Single web UI where you can create everything:
      • Integrate to source systems
      • Land data to Azure Data Lake
      • Explore the data using Spark notebooks
      • Load data to Synapse SQL using T-SQL scripts
      • Predict what needs predicting using Python, Scala C# or SQL in Spark
      • Publish data sets to Power BI
      • Manage and orchestrate everything with pipelines

In conclusion, Synapse Analytics refers to all of the capabilities available to you and not a single included tool. Synapse SQL, as the name suggests, is perhaps the most recognizable part with the SQL DW and T-SQL. However, Synapse Analytics also contains a serverless SQL form factor named SQL On-demand. So using T-SQL no longer requires a provisioned SQL Pool.

Synapse Analytics Unique value proposal

By unifying all of the tools mentioned above, Synapse Analytics really brings something new to the table. Using Synapse Analytics you can, without ever leaving the Synapse Studio, connect to a new on-premises data source, extract, load and transform that data to Data Lake and Synapse SQL, enrich it further with ML models, and provide it for reporting usage.

Synapse Analytics provides capabilities for each of the steps in a data pipeline: ingestion, preparing data, storing, exploring, transforming and serving the data:

Azure Synapse Analytics fit

Ingest

If you have previously used Azure Data Factory, you will be right at home using the data integration tools in Synapse Analytics. Synapse Analytics even shares the same codebase with Data Factory, so everything you have grown accustomed to is already there (almost everything, you can check the complete differences from https://docs.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences). You can use an Integration Runtime running inside your on-premises network to access all your data sources inside your network, or an Azure hosted one for massive scale. The list of natively supported systems is constantly growing, and you can find up-to-date information from https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-overview#supported-data-stores-and-formats. And if your source system is not on the list, you can always use a self hosted integration runtime with ODBC or JDBC drivers.

Despite their similarities, there’s currently no tooling to migrate ADF pipelines to Synapse Analytics Pipelines, and you can’t use the same Integration Runtime for both ADF and Synapse Analytics.

Another interesting preview feature is Azure Synapse Link. It allows you to run analytical workloads on top of operational data in Azure Cosmos DB near real time without affecting the operational usage. This means that application data in Cosmos DB is available for Synapse SQL and Spark using cloud native HTAP without building ETL workflows. https://docs.microsoft.com/en-us/azure/cosmos-db/configure-synapse-link

Prepare

After connecting to your data source, you can extract your data to Azure. There are numerous ways to organize the data in the data lake. Generally, you’ll probably want to have the original untouched data in a “raw” area, and processed and more refined data in another area.

With the new COPY INTO T-SQL command, you can use gzipped csv format to reduce the storage footprint. If using csvs, it is advisable to split large files into smaller ones, depending on your DWU capacity. And finally, you don’t have to worry about the 1 MB row limit and hard coded separator values as with PolyBase.

To really leverage all the possibilities of Synapse Analytics, you should use the parquet format in your data lake. In addition to being a compressed format, using parquet supports predicate pushdown in Synapse SQL and Spark. This greatly speeds up your exploratory queries to data lake from Synapse SQL or Spark, as you don’t have to read all the files from a given folder structure to get the rows you want. With parquet, the COPY INTO command automatically splits your files to speed up the processing.

Using Data Flows to clean and unify your data

Data Flows provide a no-code approach to transforming your data. As integrations, Data Flows are also already familiar from Azure Data Factory. With Data Flows, you could for example combine a few columns, delete a redundant one, calculate a working unique identifier and join the data with another flow before storing back to Data Lake as a processed file and also persisting it to SQL Pool table.

Explore data

Synapse Studio gives you multiple options to explore your data. We can graphically view the data lake structure, and easily get a SQL script or Spark Notebook to view the file contents just by right clicking a file.

Select from data lake file

With SQL On-demand, you can explore the contents of files without moving or importing them anywhere, and generate simple graphs in Synapse Studio UI to give you an idea of what you are looking at.

Graph T-SQL results in Azure Synapse Analytics

Synapse SQL, both on-demand and provisioned can be connected outside Synapse Studio with different clients using application protocols like ODBC, JDBC and ADO.NET.

In Apache Spark (example in Scala), using data from the SQL Pool is as easy as:

%%spark
val df = spark.read.sqlanalytics("SQLPool.schema.table")

This doesn’t require any configurations on Spark’s side, as the Spark and SQL are just two different runtimes operating on the same metadata and data sources.

Transform and enrich data

For transforming and enrichment, Synapse Analytics offers Spark Notebooks in addition to T-SQL. However, it has already been pretty easy to add Databricks notebooks as a part of your Azure Data Factory pipelines. With Synapse Analytics, again this integration is a bit more ready-made and easier.

One interesting possibility is SQL On-demand and it’s external tables. SQL On-demand doesn’t get access to SQL Pool’s tables (as you don’t even need to provision a SQL Pool or Synapse SQL to use SQL On-demand), but you can create external tables using T-SQL. External tables are stored as parquet backed files in Azure Data Lake Storage Gen 2. Compared to the current Azure General Availability offering, moving towards SQL On-demand and external tables in your transformations, offers serverless and fully scalable architecture. What’s the performance of external tables and SQL On-demand is going to be like, remains to be seen.

Serve

Synapse Analytics offers a few new interesting features for serving enriched data: SQL On-demand and Power BI Service connection.

With SQL On-demand, it is possible to create a highly scalable data pipeline all the way from source systems to transforming and storing data to serving it to Power BI, without any predetermined service level or capacity choices:

  • Ingest and Transform data using Data Pipelines and Data Flow. Use SQL On-demand or Data Flow as a part of your pipeline where needed
  • Store the results with “CETAS”, a T-SQL command CREATE EXTERNAL TABLE AS SELECT to store the results of your final SELECT statement to Azure Data Lake Storage Gen 2
  • Create a data set in Power BI. You can either use the files directly or use Direct Query via SQL On-demand.

Lastly, Synapse Analytics has the capability to link to Power BI Service! You can create linked services with Power BI Workspaces, view datasets and also create Power BI reports without ever leaving Synapse Studio. This greatly simplifies the process of creating reports from new data.

Connect Synapse Analytics with Power BI workspace

Do note that at the moment, you can only link to ONE Power BI Workspace. After connecting to a Power BI Workspace, you can create new Power BI reports based on published data sets without leaving Synapse Studio.

Create Power BI Reports straight from Synapse Studio

Manage and orchestrate

For orchestrating your pipelines, Synapse Analytics offers pretty much the same tooling as Azure Data Factory. You can easily combine your Data Flows, Spark Notebooks, T-SQL queries and everything to form pipelines as in Azure Data Factory. The familiar triggers are also there.

Management views also share much of the same as Azure Data Factory.

Synapse Analytics excels

Synapse Analytics strengths lie in it’s unified and yet versatile tooling.

  • Starting a new data platform project
    • Synapse Analytics offers an unified experience creating ingestion, preparation, transformations and serving your data in one place
  • Architecture with Data Lake
    • Synapse Analytics’s unified tooling makes it easier to work with and explore Data Lake
  • Architecture with Cosmos DB (or other future possible Synapse Link sources)
    • Near-real-time analytics based on operative data sources without any manual ETL processes
  • Explorative work on unknown data
    • Many tools and languages in one place
  • Security management in Synapse Workspace
    • Instead of setting up and configuring multiple separate tools with authentications and networks between them, you have only the Synapse Analytics workspace to setup
  • Possibility for both no-code and code approaches
    • A suitable approach probably exists for many developers to get started
  • Synapse SQL Pool as an MPP database
    • Synapse SQL is a powerful database in it’s own right

What we wish to see in future updates

  • While Synapse SQL Pool does pack a punch, it does require dba work and manual maintenance. Maybe more automation regarding this in the future?
    • Different indexes, partitions and distributions will have an impact on performance and storage costs
    • Workload management: classification, importance, isolation all need to be planned for and addressed for full scale operations
  • More dynamic resource scaling
    • Scaling Synapse SQL is an offline operation
    • While SQL On-demand offers some exciting possibilities, it fails to deliver any database functionalities: no access to relational tables for example as it doesn’t require a SQL Pool. There are parquet backed external tables, but their performance remains to be seen
  • Development in the Spark-sector as Synapse’s Spark is not on par with Databricks
    • For example, one notebook locks one Spark pool
    • Different runtime versions compared to Databricks
  • Implementation of version control
    • There’s no GIT or any other version control system support. The only way to save work is to publish, which also makes it visible to all other developers.
    • Managing different environments without version control system requires the use of SDK or APIs, which means a lot of work to get production ready
  • Getting from preview to GA
    • There still are instabilities and UI issues. The usual Preview shenanigans which are hopefully taken care of before moving to general availability.

Conclusion

In conclusion, Synapse Analytics’s vision is a great one: unified tool and experience to create almost everything you require in Azure to get your data from a data source to a published report. At the moment, there are some drawbacks as the whole workspace experience is still in preview, of which the lack of version control is definitely not the smallest.

At this point, we haven’t made any performance analysis and it is too early to say, if Synapse Analytics can really be a silver bullet for combining data pipelines, warehousing and analytics. In addition to performance, some key unique features are somewhat handicapped at this point: for example you can only connect to one Power BI Workspace. The vision is definitely there, and we will be anxiously waiting for future updates.

Getting started with your Azure data pipeline

Majority of data project limbo around text files. Companies have 10s of different software that they use. Integrating all of the can be sometimes impossible and not so business oriented. That’s why exporting text files from different systems is always easier, than building a datahub.

Self-service solutions make it very easy to import text files and visualise them. This can at one point mean that an individual in an organisation has X amount of excels and csv files that take a lot of space and became a problem. What was the file and is it up to date?

Data in Azure

Storaging, analysing and loading them into Azure is a good option. Azure has several different storage related services available, choosing the right one should always be done case by case. Most convenient one is Azure Data Lake (ADL). ADL is the dream storage place for developer, data scientists and analysts, that need scalable data storage, with easy access to big data. It has all the capabiltities an enterpise needs, security, manageability, scalability, reliability and availability to serve demanding storage requierments. Data projects start with gathering these text files into ADL and then copying them into Azure data warehouse(ADW)/SQL server for reporting. Finding suitable way for doing that can be challenging and time consuming.

Azure provides options like Azure Data Factor(ADF) and PolyBase. Azure Data factor(ADF) is a data processing tool, for managing data pipelines. It is a fully managed ETL service in cloud. ADF can orchestrate data flows from on-premise and cloud sources, which makes it a very flexible and easy to use tool for moving data to and from ADL. It is not just for copying data into databases, you can schedule, manage, analyse, processes and monitor your data pipeline with it. As of most of the systems change, schemas and data models do that as well. ADF works well when nothing isn’t changed in the table side.

On average text files that are > 1 GB that need to be load into ADW/SQL server, would be suggestively done with Polybase. PolyBase is a technology that connects external/internal data with database via t-sql language.

Polybase and ADF loading time are different, for a 5 GB text file it varies from 20 -30 minutes. ADF has a “warming up time”, which means that the system needs some time to be fully available. With PolyBase you can make an insert and it will take around 3-5 minutes for the text file to be in ADW.

Practicalities with PolyBase

Both ADF and Polybase are very sensitive with the data. Key thing to make sure before you start querying data:

  • Identical schemas and data types
  • Most errors are symbols inside text rows
  • Header is not specified
  • Source file in ADL/Blob has to have permission (Read, Write, Execute)
  • Reject_value for column names

For making PolyBase data pipeline you need the following: Database scope credentials, External data source, External file Format and External Table. For querying data, I suggest to insert data from external table into regular table.

How it works in practice?

Before you can create Scope credentials you need your client_id and Token_EndPoint, which can be found from azure portal, under Azure Active Directory. After scope credentials have been created, use the same credentials name in the external data source credential phase. Location of the data source, is the place where you have the file.

Creating a file format, define the format based on the text file you are trying to insert. Here is where you can define what type of a text file you have.

CREATE DATABASE SCOPED CREDENTIAL
WITH
IDENTITY = '@',
SECRET = ''
;

CREATE EXTERNAL DATA SOURCE
WITH (
TYPE = HADOOP,
LOCATION = 'adl://.azuredatalakestore.net,
CREDENTIAL =
);

CREATE EXTERNAL FILE FORMAT TextFile
WITH (
FORMAT_TYPE = DelimitedText,
FORMAT_OPTIONS (FIELD_TERMINATOR = '|'
,STRING_DELIMITER = '' ),

);

After that create external table add the location in the Data Lake.

CREATE EXTERNAL TABLE [TABLE.NAME]
(
COLUMN INT,
COLUMN2 VARCHAR(10)
)
WITH (DATA_SOURCE = [Data Lakename ], LOCATION = N'/foldername/', FILE_FORMAT = [TextFiletype], REJECT_TYPE = VALUE, REJECT_VALUE = 1)

Credentials are connected to the folder inside the Data Lake, so you can’t create the external table before the folder has credentials. According to Microsoft this is the best practice to insert data into ADW.

Summary:

In case where schemas won’t change or have little changes external tables are a good way of managing regular data flows. Automating loads between ADL and ADW, Microsoft suggest SSIS usage. There is also an open source option Airflow, which is a platform to programmatically author, schedule and monitor workflows.

CREATE EXTERNAL TABLE (Transact-SQL.
Access control in Azure Data Lake Store.