We have a miniseries before Christmas coming where we talk S-Q-L, /ˈsiːkwəl/ “sequel”. Yes, the 47 years old domain-specific language used in programming and designed for managing data. It’s very nice to see how old faithful SQL is going stronger than ever for stream processing as well the original relational database management purposes.
What is data then and how that should be used ? Take a look on article written in Finnish “Data ei ole öljyä, se on lantaa”
We will show you how to query and manipulate data across different solutions using the same SQL programming language.
The Solita Developer survey has become a tradition here at Solita and please check out the latest survey. It’s easy to see how SQL is dominating in a pool of many cool programming languages. It might take an average learner about two to three weeks to master the basic concepts of SQL and this is exactly what we will do with you.
At Christmas time our business is around gifts and how to deliver those on time. Our model is a bit more simplified and will include operational technology systems such as assets (Santa’s workshop) and fleet (sleighs) operations. There might always be something broken so few maintenance needs are pushed to technicians (elfs). Distributed data platform is used for supply chain and logistics analytics to remove bottlenecks so business owners can be satisfied (Santa Claus and the team) and all gifts will be delivered to the right address just in time.
Case Santa’s workshop
We can later use OEE to calculate that workshop performance is between the limits in order to produce high quality nice gifts. Data is ingested real time and contextualized so once a while Santa and the team will check how we are doing.
In this specific case we know that using Athena we can find relevant production line data just querying the S3 bucket where all raw data is stored already.
Day 1 – creating a Santa’s table for time series data
Let’s create a very basic table to capture all data from Santa’s factory floor. You will notice there are different data types like bigint and string. You can even add comments to help others to later find what kind of data field should include. In this case raw data is Avro but you do not have to worry about that so let’s go.
CREATE EXTERNAL TABLE `raw`( `seriesid` string COMMENT 'from deserializer', `timeinseconds` bigint COMMENT 'from deserializer', `offsetinnanos` bigint COMMENT 'from deserializer', `quality` string COMMENT 'from deserializer', `doublevalue` double COMMENT 'from deserializer', `stringvalue` string COMMENT 'from deserializer', `integervalue` int COMMENT 'from deserializer', `booleanvalue` boolean COMMENT 'from deserializer', `jsonvalue` string COMMENT 'from deserializer', `recordversion` bigint COMMENT 'from deserializer' ) PARTITIONED BY ( `startyear` string, `startmonth` string, `startday` string, `seriesbucket` string )
Day 2 – query Santas’s data
Now we have a table and how to query that one ? That is easy with SELECT and taking all fields using asterix. It’s even possible to limit that to 10 rows which is always a good practice.
SELECT * FROM "sitewise_out"."raw" limit 10;
Day 3 – Creating a view from query
View is a virtual presentation of data that will help to organize assets more efficiently. One golden rule is still now to create many views on top of other views and keep the solution simple. You will notice that CREATE VIEW works nicely and now we have timeinseconds and actual factory floor value (doublevalue) captured. You can even drop the view using DROP command.
CREATE OR REPLACE VIEW "v_santa_data" AS SELECT timeinseconds, doublevalue FROM "sitewise_out"."raw" limit 10;
Day 4 – Using functions to format dates to Santa
You noticed that timeinseconds is in Epoch so let’s use functions to have more human readable output. So we add a small from_unixtime function and combine that with date_format to have formatted output like we want. Perfect, now we know from which data Santa Claus manufacturing data originated.
SELECT date_format(from_unixtime(timeinseconds),'%Y-%m-%dT%H:%i:%sZ') , doublevalue FROM "sitewise_out"."raw" limit 10;
Day 5 – CTAS creating a table
Using CTAS (CREATE TABLE AS SELECT) you can even create a new physical table easily. You will notice that Athena specific format has been added that you do not need on relational databases.
CREATE TABLE IF NOT EXISTS new_table_name WITH (format='Avro') AS SELECT timeinseconds , doublevalue FROM "sitewise_out"."raw" limit 10;
Day 6 – Limit the result sets
Now I want to limit the results to only those where the quality is Good.Adding a WHERE clause I can have only those rows printed to my output – that is cool!
SELECT * FROM "sitewise_out"."raw" where quality='GOOD' limit 10;
You might be interested Industrial equipment data at scale, let’s keep fresh mind and stay tuned for next episode where we might go from Factory to Fleet !