cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
holly
Databricks Employee
Databricks Employee

December is the most celebrated time of year in the Data Engineering calendar as we embrace the important holiday: change freeze season. 

As we come back to the office to start our new projects, I wanted to make you aware of some quietly released features in the data engineering ecosystem in Databricks in the hope they might benefit you. To try and spice up what is essentially a list of features, I’ve categorised them into release stage (GA vs PuPr) as well as the Holly Smith scale of feature usefulness:

🐳 Extra Large - 🐘 Large  - 🦩 Medium - 🦔 Small

General Availability

🐘 New SQL functions:

  • try_url_decode() - instead of coding in try/catch for decoding error prone URLs into ASCII with url_decode(), simplify your code with try_url_decode() to check your URLs are valid before returning them.
  • zeroifnull() and nullifzero() - does what it says, think of it as a more specific coalesce() or if then else logic.
  • day() - can now be used as a synonym for extract(DAY from valid_date) for cleaner code
  • uniform() - will generate random numbers with a uniform, ie even, distribution. The benefit of this over rand() is the option to specify upper and lower bounds without having to do the maths yourself
  • randstr() - generate a random string of uppercase, lowercase letters and numbers of specified length
  • More functions now have named parameter invocation. Syntactic sugar good for code readability and to match the majority of functions that already have this feature, although given that all the functions updated have only one argument, this is a 🦔 for me. Updated functions: variant_explodevariant_explode_outer, inline, inline_outer, posexplode, posexplode_outer

If you’d like to see worked examples, check out this notebook here.

🐘 New workflows task type - .sql files

Not to be confused with sql queries, a .sql file can now be run in workflows. What’s even better is this can come from a repo and so the file could in theory have been authored in a different tool. For some of you, this will open up a new way of architecting your workflows and make interoperability between tools easier.

holly_0-1736421009114.png

🦩Primary Key and Foreign Key relationship constraints

Tables in Unity Catalog can now be updated or created with definitions of primary keys and foreign key relationships that are persisted in the information_schema and reflected in the UI.

holly_1-1736421024246.png

To see the full syntax of how these tables are created, check out the notebook here

For some of you this is a massive deal, however there is a snag; at the moment these are informational and not enforced. If I write a blog about them being enforced, you can guarantee that’s a 🐳 feature.

Side note: 10/10 for the AI trying to make sense of these ridiculous tables.

holly_2-1736421037458.png

🦩Structured Streaming and liquid clustering compatibility Improvements

Up until now you had to define upfront an empty table and the clustering keys before streaming into it, which, yes might be production best practice but a nasty papercut for quickly testing or debugging something.

This is no longer the case with structured streaming writes and the following will now work:

 

(spark.readStream.format("rate") # generate some random data
 .load()
 .writeStream
 .clusterBy("timestamp") # <- this bit is new and now possible
 .option("checkpointLocation", "dbfs:/tmp/checkpoint/random_numbers")
 .trigger(once=True)  # let's not have this run forever
 .toTable("random_numbers") # new table created as part of this syntax
)

 

Yes I know this is a python example in a SQL blog, but I can't make a random number generator easily stream in SQL.

To validate this has been successful, you can run DESCRIBE EXTENDED against the table name and look for clustering information, or see it in the table properties in the last row of the results. 

 

Public Previews

🐳 Oracle Lakehouse Federation

Connect to Oracle with Databricks in a one time setup to let users query a table just like they would any in Unity Catalog.

The exact setup steps aren’t something Im going to replicate here as they are fully documented, but in short:

  • Check the requirements for UC, compute, networking and permissions
  • Create a Databricks connection for external data. You’ll need hostname, port, user and password details
  • Create a Foreign catalog in UC to mirror an Oracle database
  • [Optional] create views that make the best use of supported performance pushdowns like filter and limit

Keep in mind the data type mappings, if you don’t have timezones on your timestamps you might quickly end up in timezone hell. 

It won’t look particularly exciting in the catalog explorer UI either. To the uninitiated it’ll just show as another standard catalog.

For me, this is an enormous win for breaking down silos and easy movement of data. You can now start plugging this in with other Databricks features like workflows and genie to dramatically simplify your data landscape.

🐘 Warehouse system tables 

Finally! Programmatic access to warehouse data in system tables. This will allow you to see what’s been created and deleted over the last 30 days. Great for admins wanting to keep an eye on how many 10 cluster 4XL warehouses are being created. You can also use the data to find oversized auto stop times or those not tagged.

holly_3-1736421071673.png

To enable this, ask an account admin to run:

 

%sh
curl -v -X PUT -H "Authorization: Bearer <Developer Access Token>" "https://<your workspace url>.cloud.databricks.com/api/2.0/unity-catalog/metastores/<metastore id found in UC>/systemschemas/compute"

 

 

🐘Materialised View updates

MVs have had a few welcome updates recently, notably the ability to monitor runs with the query history UI, masking parity, and an easier way to schedule that doesn’t require deciphering CRON syntax. All of these have working examples in this notebook here

I’ll do anything to avoid using the quartz cron scheduler (it’s a good site, I just lack the ability to retain any understanding of it). The new schedule is useful for standard intervals ranging from hours to weeks, but if you need something more intricate, it’s back to figuring out * vs ?

Instead of 

 

%sql
CREATE MATERIALIZED VIEW ham_feasibility
SCHEDULE CRON '0 0 * * * * *' --old schedule syntax
AS . . .

 

It’s now

 

%sql
CREATE MATERIALIZED VIEW ham_feasibility
SCHEDULE EVERY 1 HOUR --new schedule syntax
AS . . .

 

The next big update that MVs have had is the ability to see the performance in the Query History UI. Now it’s a lot easier to understand any drops in performance, especially if they’re sporadic and difficult to recreate. This can be accessed from the left hand menu and filter on compute as ‘DLT’ and statement as ‘REFRESH’

holly_4-1736421096010.png

Another update to materialised views (and streaming tables) is the new support for filters and column masks. 

For row filtering, create a function like you would for any other table based on group membership, and add the keywords ‘with row filter’ to the creation statement

 

%sql
CREATE MATERIALIZED VIEW ham_options (ham_type STRING, is_edible BOOLEAN) 
WITH ROW FILTER veggie_filter --apply the function we just defined
AS SELECT * FROM ham_types

--now this view will show results tailored to whether or not the user is part of the omnivore group stored in unity catalog

 

And for column masking, define the function then apply it in the creation step after the defining the column

 

%sql
CREATE OR REPLACE MATERIALIZED VIEW ham_options
 (ham_type STRING,
 is_edible BOOLEAN MASK undefined_preference) -- applying the column mask to our materialised view
AS SELECT * FROM ham_types;

 

 

One final new update for materialised views is the ability to set the compute to run with features only accessible in a private preview. This is done with the channel setting at table creation.

 

%sql
CREATE MATERIALIZED VIEW preview_channel_example
TBLPROPERTIES(pipelines.channel = "PREVIEW") --set the compute to use features in public preview
AS SELECT day('2019-01-07') as result

 

 

🦔 Lite mode VACUUM 

Do you ever run a pipeline and know that you should run a vacuum, but decide against it because it’ll take too long (but ultimately make the problem worse for yourself tomorrow)? Lite mode is a balance between the two, offering a faster option to prevent the proliferation of stale files.

The syntax is dead easy, it's just VACUUM table_name LITE;

It doesn’t replace the default full version, it will only scan for stale files in the delta log and won’t do a full clean out checking for empty directories, parquet files from failed transactions or arbitrary files from connectors. 

This feature didn’t quite make it to 🦩 sized as the majority of people don’t have data big enough to be running into this problem. This feature is for you if you notice your VACUUM statements are taking over 15 mins.

Sources and Further Reading

1 Comment
Contributors