cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Use SQL Command LIST Volume for Alerts

RobinK
Contributor

Hi,

we have implemented a Databricks Workflow that saves an Excel Sheet to a Databricks Volume. Now we want to notify users with an Alert, when new data arrives in the volume.

In the docs I found the SQL command LIST which returns the columns path, name, size and modification_time for all files located in a volume.

 

 

LIST '/Volumes/my-catalog/my-schema/my-volume'

 

 

 But I cannot find a way to integrate the LIST command into a "normal" SQL Statement, which is required for Databricks Alerts. My approach (which returns an [PARSE_SYNTAX_ERROR]):

 

WITH v AS (
  SELECT *
  FROM LIST '/Volumes/my-catalog/my-schema/my-volume'
)
SELECT *
FROM v

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

gabsylvain
New Contributor III
New Contributor III

Hi @RobinK ,

I've tested your code and I was able to reproduce the error. Unfortunately, I haven't found a pure SQL alternative to selecting the results of the LIST command as part of a subquery or CTE, and create an alert based on that. 

Fortunately, the content of a Volume (with the info you need, i.e. path, modification time, etc.) can be obtained via other means.

My assumption here is that not every Workflow run will ingest a new Excel sheet in the Volume? And that is why you need additional logic to know when to send the alert? If that is the case, then my proposition to you is the following:

First, you can use Databricks Utils to list the content of the volume in a Python notebook:

dbutils.fs.ls("/Volumes/<catalog>/<schema>/<volume>/")

Then, you can iterate through the files and get each element's information into a JSON object:

data = [{"path": x.path, "name": x.name, "size": x.size, "modificationTime": x.modificationTime} for x in files_in_volume]

And finally, you can create a dataframe from this JSON object and save it as a table:

df = spark.createDataFrame(data)
df.write.mode("overwrite").saveAsTable("<catalog_name>.<schema_name>.<table_name>")

🔄 This could be ran as an additional step to your Databricks Workflow, so that every time the workflow runs, the data table containing the volume content is updated as well.

 Once the results are loaded in a table, you can run SELECT queries on it and therefore you can set alerts as well.

If, on the contrary, a new Excel sheet is loaded in the Volume everytime the Workflow runs, then you could simply set an alert on success directly within your workflow.

Hope that can be useful to you. 

Thanks,

Gab

 

 

View solution in original post

2 REPLIES 2

gabsylvain
New Contributor III
New Contributor III

Hi @RobinK ,

I've tested your code and I was able to reproduce the error. Unfortunately, I haven't found a pure SQL alternative to selecting the results of the LIST command as part of a subquery or CTE, and create an alert based on that. 

Fortunately, the content of a Volume (with the info you need, i.e. path, modification time, etc.) can be obtained via other means.

My assumption here is that not every Workflow run will ingest a new Excel sheet in the Volume? And that is why you need additional logic to know when to send the alert? If that is the case, then my proposition to you is the following:

First, you can use Databricks Utils to list the content of the volume in a Python notebook:

dbutils.fs.ls("/Volumes/<catalog>/<schema>/<volume>/")

Then, you can iterate through the files and get each element's information into a JSON object:

data = [{"path": x.path, "name": x.name, "size": x.size, "modificationTime": x.modificationTime} for x in files_in_volume]

And finally, you can create a dataframe from this JSON object and save it as a table:

df = spark.createDataFrame(data)
df.write.mode("overwrite").saveAsTable("<catalog_name>.<schema_name>.<table_name>")

🔄 This could be ran as an additional step to your Databricks Workflow, so that every time the workflow runs, the data table containing the volume content is updated as well.

 Once the results are loaded in a table, you can run SELECT queries on it and therefore you can set alerts as well.

If, on the contrary, a new Excel sheet is loaded in the Volume everytime the Workflow runs, then you could simply set an alert on success directly within your workflow.

Hope that can be useful to you. 

Thanks,

Gab

 

 

Hi @gabsylvain ,

thanks for the reply! That could work for us. 

You don't even need to use dbutils, SQL in a Notebook works just fine:

df = spark.sql(f"LIST '/Volumes/my-catalog/my-schema/my-volume'")
df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.my_volume_metadata_table")

We are using SparkConnect and VSCode for development, so using dbutils is always a bit painful in local development. And the Databricks WorkspaceClient() Class does not seem to support listing volumes right now.

Kind regards

Robin

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group