December is the most celebrated time of year in the Data Engineering calendar as we embrace the important holiday: change freeze season.
As we come back to the office to start our new projects, I wanted to make you aware of some quietly released features in the data engineering ecosystem in Databricks in the hope they might benefit you. To try and spice up what is essentially a list of features, I’ve categorised them into release stage (GA vs PuPr) as well as the Holly Smith scale of feature usefulness:
If you’d like to see worked examples, check out this notebook here.
Not to be confused with sql queries, a .sql file can now be run in workflows. What’s even better is this can come from a repo and so the file could in theory have been authored in a different tool. For some of you, this will open up a new way of architecting your workflows and make interoperability between tools easier.
Tables in Unity Catalog can now be updated or created with definitions of primary keys and foreign key relationships that are persisted in the information_schema and reflected in the UI.
To see the full syntax of how these tables are created, check out the notebook here.
For some of you this is a massive deal, however there is a snag; at the moment these are informational and not enforced. If I write a blog about them being enforced, you can guarantee that’s a 🐳 feature.
Side note: 10/10 for the AI trying to make sense of these ridiculous tables.
Up until now you had to define upfront an empty table and the clustering keys before streaming into it, which, yes might be production best practice but a nasty papercut for quickly testing or debugging something.
This is no longer the case with structured streaming writes and the following will now work:
(spark.readStream.format("rate") # generate some random data
.load()
.writeStream
.clusterBy("timestamp") # <- this bit is new and now possible
.option("checkpointLocation", "dbfs:/tmp/checkpoint/random_numbers")
.trigger(once=True) # let's not have this run forever
.toTable("random_numbers") # new table created as part of this syntax
)
Yes I know this is a python example in a SQL blog, but I can't make a random number generator easily stream in SQL.
To validate this has been successful, you can run DESCRIBE EXTENDED against the table name and look for clustering information, or see it in the table properties in the last row of the results.
Connect to Oracle with Databricks in a one time setup to let users query a table just like they would any in Unity Catalog.
The exact setup steps aren’t something Im going to replicate here as they are fully documented, but in short:
Keep in mind the data type mappings, if you don’t have timezones on your timestamps you might quickly end up in timezone hell.
It won’t look particularly exciting in the catalog explorer UI either. To the uninitiated it’ll just show as another standard catalog.
For me, this is an enormous win for breaking down silos and easy movement of data. You can now start plugging this in with other Databricks features like workflows and genie to dramatically simplify your data landscape.
Finally! Programmatic access to warehouse data in system tables. This will allow you to see what’s been created and deleted over the last 30 days. Great for admins wanting to keep an eye on how many 10 cluster 4XL warehouses are being created. You can also use the data to find oversized auto stop times or those not tagged.
To enable this, ask an account admin to run:
%sh
curl -v -X PUT -H "Authorization: Bearer <Developer Access Token>" "https://<your workspace url>.cloud.databricks.com/api/2.0/unity-catalog/metastores/<metastore id found in UC>/systemschemas/compute"
MVs have had a few welcome updates recently, notably the ability to monitor runs with the query history UI, masking parity, and an easier way to schedule that doesn’t require deciphering CRON syntax. All of these have working examples in this notebook here.
I’ll do anything to avoid using the quartz cron scheduler (it’s a good site, I just lack the ability to retain any understanding of it). The new schedule is useful for standard intervals ranging from hours to weeks, but if you need something more intricate, it’s back to figuring out * vs ?
Instead of
%sql
CREATE MATERIALIZED VIEW ham_feasibility
SCHEDULE CRON '0 0 * * * * *' --old schedule syntax
AS . . .
It’s now
%sql
CREATE MATERIALIZED VIEW ham_feasibility
SCHEDULE EVERY 1 HOUR --new schedule syntax
AS . . .
The next big update that MVs have had is the ability to see the performance in the Query History UI. Now it’s a lot easier to understand any drops in performance, especially if they’re sporadic and difficult to recreate. This can be accessed from the left hand menu and filter on compute as ‘DLT’ and statement as ‘REFRESH’
Another update to materialised views (and streaming tables) is the new support for filters and column masks.
For row filtering, create a function like you would for any other table based on group membership, and add the keywords ‘with row filter’ to the creation statement
%sql
CREATE MATERIALIZED VIEW ham_options (ham_type STRING, is_edible BOOLEAN)
WITH ROW FILTER veggie_filter --apply the function we just defined
AS SELECT * FROM ham_types
--now this view will show results tailored to whether or not the user is part of the omnivore group stored in unity catalog
And for column masking, define the function then apply it in the creation step after the defining the column
%sql
CREATE OR REPLACE MATERIALIZED VIEW ham_options
(ham_type STRING,
is_edible BOOLEAN MASK undefined_preference) -- applying the column mask to our materialised view
AS SELECT * FROM ham_types;
One final new update for materialised views is the ability to set the compute to run with features only accessible in a private preview. This is done with the channel setting at table creation.
%sql
CREATE MATERIALIZED VIEW preview_channel_example
TBLPROPERTIES(pipelines.channel = "PREVIEW") --set the compute to use features in public preview
AS SELECT day('2019-01-07') as result
Do you ever run a pipeline and know that you should run a vacuum, but decide against it because it’ll take too long (but ultimately make the problem worse for yourself tomorrow)? Lite mode is a balance between the two, offering a faster option to prevent the proliferation of stale files.
The syntax is dead easy, it's just VACUUM table_name LITE;
It doesn’t replace the default full version, it will only scan for stale files in the delta log and won’t do a full clean out checking for empty directories, parquet files from failed transactions or arbitrary files from connectors.
This feature didn’t quite make it to 🦩 sized as the majority of people don’t have data big enough to be running into this problem. This feature is for you if you notice your VACUUM statements are taking over 15 mins.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.