cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
cancel
Showing results for 
Search instead for 
Did you mean: 

Use SQL Command LIST Volume for Alerts

RobinK
New Contributor III

Hi,

we have implemented a Databricks Workflow that saves an Excel Sheet to a Databricks Volume. Now we want to notify users with an Alert, when new data arrives in the volume.

In the docs I found the SQL command LIST which returns the columns path, name, size and modification_time for all files located in a volume.

 

 

LIST '/Volumes/my-catalog/my-schema/my-volume'

 

 

 But I cannot find a way to integrate the LIST command into a "normal" SQL Statement, which is required for Databricks Alerts. My approach (which returns an [PARSE_SYNTAX_ERROR]):

 

WITH v AS (
  SELECT *
  FROM LIST '/Volumes/my-catalog/my-schema/my-volume'
)
SELECT *
FROM v

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

gabsylvain
New Contributor III
New Contributor III

Hi @RobinK ,

I've tested your code and I was able to reproduce the error. Unfortunately, I haven't found a pure SQL alternative to selecting the results of the LIST command as part of a subquery or CTE, and create an alert based on that. 

Fortunately, the content of a Volume (with the info you need, i.e. path, modification time, etc.) can be obtained via other means.

My assumption here is that not every Workflow run will ingest a new Excel sheet in the Volume? And that is why you need additional logic to know when to send the alert? If that is the case, then my proposition to you is the following:

First, you can use Databricks Utils to list the content of the volume in a Python notebook:

dbutils.fs.ls("/Volumes/<catalog>/<schema>/<volume>/")

Then, you can iterate through the files and get each element's information into a JSON object:

data = [{"path": x.path, "name": x.name, "size": x.size, "modificationTime": x.modificationTime} for x in files_in_volume]

And finally, you can create a dataframe from this JSON object and save it as a table:

df = spark.createDataFrame(data)
df.write.mode("overwrite").saveAsTable("<catalog_name>.<schema_name>.<table_name>")

🔄 This could be ran as an additional step to your Databricks Workflow, so that every time the workflow runs, the data table containing the volume content is updated as well.

 Once the results are loaded in a table, you can run SELECT queries on it and therefore you can set alerts as well.

If, on the contrary, a new Excel sheet is loaded in the Volume everytime the Workflow runs, then you could simply set an alert on success directly within your workflow.

Hope that can be useful to you. 

Thanks,

Gab

 

 

View solution in original post

2 REPLIES 2

gabsylvain
New Contributor III
New Contributor III

Hi @RobinK ,

I've tested your code and I was able to reproduce the error. Unfortunately, I haven't found a pure SQL alternative to selecting the results of the LIST command as part of a subquery or CTE, and create an alert based on that. 

Fortunately, the content of a Volume (with the info you need, i.e. path, modification time, etc.) can be obtained via other means.

My assumption here is that not every Workflow run will ingest a new Excel sheet in the Volume? And that is why you need additional logic to know when to send the alert? If that is the case, then my proposition to you is the following:

First, you can use Databricks Utils to list the content of the volume in a Python notebook:

dbutils.fs.ls("/Volumes/<catalog>/<schema>/<volume>/")

Then, you can iterate through the files and get each element's information into a JSON object:

data = [{"path": x.path, "name": x.name, "size": x.size, "modificationTime": x.modificationTime} for x in files_in_volume]

And finally, you can create a dataframe from this JSON object and save it as a table:

df = spark.createDataFrame(data)
df.write.mode("overwrite").saveAsTable("<catalog_name>.<schema_name>.<table_name>")

🔄 This could be ran as an additional step to your Databricks Workflow, so that every time the workflow runs, the data table containing the volume content is updated as well.

 Once the results are loaded in a table, you can run SELECT queries on it and therefore you can set alerts as well.

If, on the contrary, a new Excel sheet is loaded in the Volume everytime the Workflow runs, then you could simply set an alert on success directly within your workflow.

Hope that can be useful to you. 

Thanks,

Gab

 

 

RobinK
New Contributor III

Hi @gabsylvain ,

thanks for the reply! That could work for us. 

You don't even need to use dbutils, SQL in a Notebook works just fine:

df = spark.sql(f"LIST '/Volumes/my-catalog/my-schema/my-volume'")
df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.my_volume_metadata_table")

We are using SparkConnect and VSCode for development, so using dbutils is always a bit painful in local development. And the Databricks WorkspaceClient() Class does not seem to support listing volumes right now.

Kind regards

Robin

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.