The editor to modify the python flavored spark code.

AWS Glue tutorial with Spark and Python for data developers

This AWS Glue tutorial is a hands-on introduction to create a data transformation script with Spark and Python. Basic Glue concepts such as database, table, crawler and job will be introduced.

In this tutorial you will create an AWS Glue job using Python and Spark. You can read the previous article for a high level Glue introduction.

In the context of this tutorial Glue could be defined as “A managed service to run Spark scripts”.

In some parts of the tutorial I reference to this GitHub code repository.

Create a data source for AWS Glue

Glue can read data either from database or S3 bucket. For this tutorial I created an S3 bucket called glue-blog-tutorial-bucket. You have to come up with another name on your AWS account.

Create two folders from S3 console called read and write.

The S3 bucket has two folders. In AWS folder is actually just a prefix for the file name.
The S3 bucket has two folders. In AWS a folder is actually just a prefix for the file name.

 

Upload this movie dataset to the read folder of the S3 bucket.

The data for this python and spark tutorial in Glue contains just 10 rows of data. Source: IMDB.
The data for this Python and Spark tutorial in Glue contains just 10 rows of data. Source: IMDB.

Crawl the data source to the data catalog

Glue has a concept of crawler. A crawler sniffs metadata from the data source such as file format, column names, column data types and row count. The metadata makes it easy for others to find the needed datasets. The Glue catalog enables easy access to the data sources from the data transformation scripts.

The crawler will catalog all files in the specified S3 bucket and prefix. All the files should have the same schema.

In Glue crawler terminology the file format is known as a classifier. The crawler identifies the most common classifiers automatically including CSV, json and parquet. It would be possible to create a custom classifiers where the schema is defined in grok patterns which are close relatives of regular expressions.

Our sample file is in the CSV format and will be recognized automatically.

Instructions to create a Glue crawler:

  1. In the left panel of the Glue management console click Crawlers.
  2. Click the blue Add crawler button.
  3. Give the crawler a name such as glue-blog-tutorial-crawler.
  4. In Add a data store menu choose S3 and select the bucket you created. Drill down to select the read folder.
  5. In Choose an IAM role create new. Name the role to for example glue-blog-tutorial-iam-role.
  6. In Configure the crawler’s output add a database called glue-blog-tutorial-db.

 

Summary of the AWS Glue crawler configuration.
Summary of the AWS Glue crawler configuration.

 

When you are back in the list of all crawlers, tick the crawler that you created. Click Run crawler.

Note: If your CSV data needs to be quoted, read this.

The crawled metadata in Glue tables

Once the data has been crawled, the crawler creates a metadata table from it. You find the results from the Tables section of the Glue console. The database that you created during the crawler setup is just an arbitrary way of grouping the tables.

Metadata for the Glue table. You can see properties as well as column names and data types from this view.
Metadata for the Glue table.

 

Glue tables don’t contain the data but only the instructions how to access the data.

Note: For large CSV datasets the row count seems to be just an estimation.

AWS Glue jobs for data transformations

From the Glue console left panel go to Jobs and click blue Add job button.

Follow these instructions to create the Glue job:

  1. Name the job as glue-blog-tutorial-job.
  2. Choose the same IAM role that you created for the crawler. It can read and write to the S3 bucket.
  3. Type: Spark.
  4. Glue version: Spark 2.4, Python 3.
  5. This job runsA new script to be authored by you.
  6. Security configuration, script libraries, and job parameters
    1. Maximum capacity2. This is the minimum and costs about 0.15$ per run.
    2. Job timeout10. Prevents the job to run longer than expected.
  7. Click Next and then Save job and edit the script.

Editing the Glue script to transform the data with Python and Spark

Copy this code from Github to the Glue script editor.

Remember to change the bucket name for the s3_write_path variable.

Save the code in the editor and click Run job.

The Glue editor to modify the python flavored spark code.
The Glue editor to modify the python flavored Spark code.

 

The detailed explanations are commented in the code. Here is the high level description:

  1. Read the movie data from S3
  2. Get movie count and rating average for each decade
  3. Write aggregated data back to S3

The execution time with 2 Data Processing Units (DPU) was around 40 seconds. Relatively long duration is explained by the start-up overhead.

The data transformation creates summarized movie data. For example 90's had 4 movies in the top 10 with the average score of 8.95.
The data transformation script creates summarized movie data. For example 1990 decade had 4 movies in the IMDB top 10 with the average score of 8.95.

 

You can download the result file from the write folder of your S3 bucket. Another way to investigate the job would be to take a look at the CloudWatch logs.

The data is stored back to S3 as a CSV in the "write" prefix. The number of partitions equals number of output files.
The data is stored back to S3 as a CSV in the “write” prefix. The number of partitions equals the number of the output files.

Speeding up Spark development with Glue dev endpoint

Developing Glue transformation scripts is slow, if you just run a job after another. Provisioning the computation cluster takes minutes and you don’t want to wait after each change.

Glue has a dev endpoint functionality where you launch a temporary environment that is constantly available. For development and testing it’s both faster and cheaper.

Dev endpoint provides the processing power, but a notebook server is needed to write your code. Easiest way to get started is to create a new SageMaker notebook by clicking Notebooks under the Dev endpoint in the left panel.

About Glue performance

In the code example we did read the data first to Glue’s DynamicFrame and then converted that to native PySpark DataFrame. This method makes it possible to take advantage of Glue catalog but at the same time use native PySpark functions.

However, our team has noticed Glue performance to be extremely poor when converting from DynamicFrame to DataFrame. This applies especially when you have one large file instead of multiple smaller ones. If the execution time and data reading becomes the bottleneck, consider using native PySpark read function to fetch the data from S3.

Summary about the Glue tutorial with Python and Spark

Getting started with Glue jobs can take some time with all the menus and options. Hopefully this tutorial gave some idea what is the role of database, table, job and crawler.

The focus of this tutorial was in a single script, but Glue also provides tools to manage larger group of jobs. You can schedule jobs with triggers or orchestrate relationships between triggers, jobs and crawlers with workflows.

Learning the Glue console is one thing, but the actual logic lies in the Spark scripts. Tuning the code impacts significantly to the execution performance. That will be the topic of the next blog post.

AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

Introduction to AWS Glue for big data ETL

AWS Glue works well for big data processing. This is a brief introduction to Glue including use cases, pricing and a detailed example.

AWS Glue is a serverless ETL tool in cloud. In brief ETL means extracting data from a source system, transforming it for analysis and other applications and then loading back to data warehouse for example.

In this blog post I will introduce the basic idea behind AWS Glue and present potential use cases.

The emphasis is in the big data processing. You can read more about Glue catalogs here and data catalogs in general here.

Why to use AWS Glue?

Replacing Hadoop. Hadoop can be expensive and a pain to configure. AWS Glue is simple. Some say that Glue is expensive, but it depends where you compare. Because of on demand pricing you only pay for what you use. This fact might make AWS Glue significantly cheaper than a fixed size on-premise Hadoop cluster.

AWS Lambda can not be used. A wise man said, use lambda functions in AWS whenever possible. Lambdas are simple, scalable and cost efficient. They can also be triggered by events. For big data lambda functions are not suitable because of the 3 GB memory limitation and 15 minute timeout. AWS Glue is specifically built to process large datasets.

Apply DataOps practices. Drag and drop ETL tools are easy for users, but from the DataOps perspective code based development is a superior approach. With AWS Glue both code and configuration can be stored in version control. The data development becomes similar to any other software development. For example the data transformation scripts written by scala or python are not limited to AWS cloud. Environment setup is easy to automate and parameterize when the code is scripted.

An example use case for AWS Glue

Now a practical example about how AWS Glue would work in practice.

A production machine in a factory produces multiple data files daily. Each file is a size of 10 GB. The server in the factory pushes the files to AWS S3 once a day.

The factory data is needed to predict machine breakdowns. For that, the raw data should be pre-processed for the data science team.

Lambda is not an option for the pre-processing because of the memory and timeout limitation. Glue seems to be reasonable option when work hours and costs are compared to alternative tools.

The simplest way of get started with the ETL process is to create a new Glue job and write code to the editor. The script can be either in scala or python programming language.

Extract. The script first reads all the files from the specified S3 bucket to a single data frame. You can think a data frame as a table in Excel. The reading can be just a one-liner.

Transform. This is the most of the code. Let’s say that the original data had 100 records per second. The data science team wants the data to be aggregated per each 1 minute with a specific logic. This could be just tens of code lines if the logic is simple.

Load. Write data back to another S3 bucket for the data science team. It’s possible that a single line of code will do.

The code runs on top of the spark framework which is configured automatically in Glue. Thanks to spark, data will be divided to small chunks and processed in parallel on multiple machines simultaneously.

What makes AWS Glue serverless?

Serverless means you don’t have machines to configure. AWS provisions and allocates the resources automatically.

The processing power is adjusted by the number of data processing units (DPU). You can do additional configuration, but it’s likely that a proof of concept works out of the box.

In an on-premise environment you would have to make a decision about the computation cluster size. A big cluster is expensive but fast. A small cluster would be cheaper but slow to run.

With AWS Glue your bill is the result the following equation:

[ETL job price] = [Processing time] * [Number of DPUs]

 

The on demand pricing means that the increase in processing power does not compromise with the price of the ETL job. At least in theory, as too many DPUs might cause overhead in processing time.

When is AWS Glue a wrong choice?

This is not an advertisement, so let’s give some critique for Glue as well.

Lots of small ETL jobs. Glue has a minimum billing of 10 minutes and 2 DPUs. With the price of 0.44$ per DPU hour, the minimum cost for a run becomes around 0.15$. The starting price makes Glue unappealing alternative to process small amount of data often.

Specific networking requirements. If you spin up a standard EC2 virtual machine, an IP address will be attached to it. The serverless nature of Glue means you have to put more effort on network planning in some cases. One such scenario would be whitelisting a Glue job in a firewall to extract data from an external system.

Summary about AWS Glue

The most common argument against Glue is “It’s expensive”. True, in a sense that the first few test runs can already cost a few dollars. In a nutshell, Glue is cost efficient for infrequent big data workloads.

In the big picture AWS Glue saves a lot of time and unnecessary hardware engineering. The costs should be compared against alternative options such as on-premise Hadoop cluster or development hours required for a custom solution.

As Glue pricing model is predictable, the business cases are straightforward to calculate. It might be enough to test just the critical parts of the ETL pipeline to become confident about the performance and costs.

I feel that optimizing the code for distributed computing has been more of a challenge than the Glue service itself. The next blog post will focus on how data developers get started with Glue using python and spark.

A Data Catalog can be the foundation for your data democracy – if you think of it being more than just a catalog

The hype around data catalog software is justified, but the term "data catalog" is misleading and often misunderstood. We should talk about data libraries i.e. combinations of software, new ways of working and user experience; all aiming to drive data democratization.

I love libraries.

In my younger days, I spent a lot of time in libraries. Many associate libraries with books, but for me libraries are also about music. In libraries, I explored music that would greatly shape my personality and my own approach to music as my personal passion. This was all before music was digitized, so I had to acquire music in physical form.

For me, libraries were a perfectly designed service for music discovery:

  • All products were indexed and easy to find
  • Music magazines were adjacently available for further context of the product
  • You could preview (listen to) the products right there and then
  • The product offering was actively curated by knowledgeable professionals
  • The service was public and available to anyone who had an interest.

I was very curious of non-mainstream music and libraries helped me find artists like Tom Waits, Nick Cave and Pavement, who have all since meant much to me. Had I relied only on my own social tribe (friends and schoolmates), I would never have discovered these artists. This is why I will always be thankful for the existence of libraries.

The Hype Around Data Catalog Software

Much has been said and written about data catalog software in the last year. Data catalogs have been called “the new black”[1] and “the most important data management breakthrough to have emerged in the last decade”[2]. While I agree with most of what Forrester, Gartner and others say, they have all gotten one important aspect wrong: The term “data catalog” is misleading and too narrow. These new software solutions are to data-driven organizations what libraries were to me; a safe place for discovery, learning and transformation. Think of this software as an open-to-everyone service, where:

  • Content is cataloged and curated by experts
  • Contextual information is easily available alongside the core product
  • It is safe and easy to explore the content.

As a bonus, while traditional libraries encourage silence, modern data catalog software encourage conversations, thought exchange and collaboration. They foster data literacy, which is essential for data-driven organizations. Libraries, as we know them, are a foundation for our democracy and even civilization. A data catalog, or a Data Library as I would call it, should be the foundation for your organization’s data/analytic democracy.

I should also add that, unlike with books in libraries, data catalog software does not require data to be physically loaded into the software. The software connects to the data at its original location.

Building the Case for a Data Library

Each organization has to identify its own drivers and build its own business case for a data library. I have worked on and studied data catalog initiatives driven by very different needs:

  • A desire to capture more contextual metadata
  • An aim to foster collaboration in a data & analytic community
  • A need to identify certain data assets for compliance purposes (e.g. PII for GDPR)
  • An ambition to collect information about a multitude of data platforms into one location.

Below are some examples of how common pain points (or “opportunities” if you prefer) can be expressed as benefits for a catalog business case, and how these can then drive the focus of the solution.

 

Some of you might now be wondering what happened to metadata management, ETL, lineage and similar functionalities often associated with data catalogs. While these are important features, they narrow the solution too much and you end up with a technical data catalog, not a data library.

Time & Productivity. You are likely to achieve greater buy-in and success when focusing on business benefits like shorter time to insight or productivity boost. These benefits are achieved by reducing the time data workers spend in finding, accessing and learning how to use data.

Better Decisions. You can also focus on the benefits of faster and more accurate decision making. Faster analytics leads to faster decisions, which leads to seizing business opportunities faster. Similarly, a better understanding of available data assets, leads to richer and more precise analytic outputs, making business decisions more accurate.

Data Value Assessment. Finally, Chief Data Officers in particular should actively be looking at the value of their data assets to understand investment and optimization opportunities. Features like data utilization metrics and user stories shared in the data library will help a CDO understand and assess the value of their data assets.

The highest prioritized benefits should then drive your approach to implementing a data library. Is it more important to capture rich metadata or usage data? Will you favor crowdsourcing or automation for creating the content? Is the library centrally curated or can anyone add/edit/remove contents? The answers to these questions should drive your technology selection and implementation plan.

It’s a Way of Working Aiming To Deliver a Great User Experience

While modern data catalog software products are already pretty great on their own, just like libraries, they won’t provide value without active curation and stewardship. The key to success with a data library is to make it a way of working. This means tampering with behaviors in your organization, i.e. what people do and how they do it.

Libraries in essence are just hollow buildings with empty shelves. What makes them work is their content, and the people curating and designing services around that content. You need to think of data libraries similarly. Identify the people most passionate about your data assets and promote them to become “Supreme Information Curators” or “Esteemed Data Council Members”. Avoid making this about policing and controlling, and instead focus on enabling and empowering users. Then, use insight from how your organization works to design a data library user experience that does to data workers what libraries did to me: Changed beliefs, thoughts and behaviors. Well-designed data libraries will make data workers more engaged, productive, collaborative and knowledgeable. By doing so, the libraries will also drive data workers to contribute to the common wisdom of your organization.

Data Catalogs Are Just Software But Data Libraries Are Foundational Capabilities

The data catalog software market is a dynamic and interesting market, and these products can solve a number of business problems. However, they miss the mark if they are deployed as catalogs only. Think of these products as the enabler of your organization’s data library and the foundation for your data and/or analytic democracy. When building the case for your data library, identify which business problem is most relevant to your organization and then look for solutions that address those specific problems. A data library is a way of working, not just a technology solution. Find the right people in your organization and give them the responsibility – and privilege – of curating your organization’s data wisdom. Think of your data library as a service that provides value to its users and changes behaviors. If you get this user experience right, your organization will generate more value of its data assets, and your data people will be smarter, more engaged and more productive than any of your competitors.

—–

At Solita we have explored the data catalog software market extensively and have also implemented data catalog/library solutions with our clients. We’d love to help you build a business case for a data library and implement a library service that drives data democracy in your organization.

[1] Gartner: “Data Catalogs Are the New Black in Data Management and Analytics” Analyst Report, Ehtisham Zaidi, December 13, 2017

[2] 451 Research: “From out of nowhere: the unstoppable rise of the data catalog”, Matt Aslett, October 10, 2018