Databricks Community

RobinK · ‎12-07-2023

Hi,

we have implemented a Databricks Workflow that saves an Excel Sheet to a Databricks Volume. Now we want to notify users with an Alert, when new data arrives in the volume.

In the docs I found the SQL command LIST which returns the columns path, name, size and modification_time for all files located in a volume.

LIST '/Volumes/my-catalog/my-schema/my-volume'

But I cannot find a way to integrate the LIST command into a "normal" SQL Statement, which is required for Databricks Alerts. My approach (which returns an [PARSE_SYNTAX_ERROR]):

WITH v AS (
  SELECT *
  FROM LIST '/Volumes/my-catalog/my-schema/my-volume'
)
SELECT *
FROM v

gabsylvain · ‎12-11-2023

Hi @RobinK ,

I've tested your code and I was able to reproduce the error. Unfortunately, I haven't found a pure SQL alternative to selecting the results of the LIST command as part of a subquery or CTE, and create an alert based on that.

Fortunately, the content of a Volume (with the info you need, i.e. path, modification time, etc.) can be obtained via other means.

My assumption here is that not every Workflow run will ingest a new Excel sheet in the Volume? And that is why you need additional logic to know when to send the alert? If that is the case, then my proposition to you is the following:

First, you can use Databricks Utils to list the content of the volume in a Python notebook:

dbutils.fs.ls("/Volumes/<catalog>/<schema>/<volume>/")

Then, you can iterate through the files and get each element's information into a JSON object:

data = [{"path": x.path, "name": x.name, "size": x.size, "modificationTime": x.modificationTime} for x in files_in_volume]

And finally, you can create a dataframe from this JSON object and save it as a table:

df = spark.createDataFrame(data)
df.write.mode("overwrite").saveAsTable("<catalog_name>.<schema_name>.<table_name>")

🔄 This could be ran as an additional step to your Databricks Workflow, so that every time the workflow runs, the data table containing the volume content is updated as well.

✅ Once the results are loaded in a table, you can run SELECT queries on it and therefore you can set alerts as well.

If, on the contrary, a new Excel sheet is loaded in the Volume everytime the Workflow runs, then you could simply set an alert on success directly within your workflow.

Hope that can be useful to you.

Thanks,

Gab

View solution in original post

gabsylvain · ‎12-11-2023

Hi @RobinK ,

I've tested your code and I was able to reproduce the error. Unfortunately, I haven't found a pure SQL alternative to selecting the results of the LIST command as part of a subquery or CTE, and create an alert based on that.

Fortunately, the content of a Volume (with the info you need, i.e. path, modification time, etc.) can be obtained via other means.

My assumption here is that not every Workflow run will ingest a new Excel sheet in the Volume? And that is why you need additional logic to know when to send the alert? If that is the case, then my proposition to you is the following:

First, you can use Databricks Utils to list the content of the volume in a Python notebook:

dbutils.fs.ls("/Volumes/<catalog>/<schema>/<volume>/")

Then, you can iterate through the files and get each element's information into a JSON object:

data = [{"path": x.path, "name": x.name, "size": x.size, "modificationTime": x.modificationTime} for x in files_in_volume]

And finally, you can create a dataframe from this JSON object and save it as a table:

df = spark.createDataFrame(data)
df.write.mode("overwrite").saveAsTable("<catalog_name>.<schema_name>.<table_name>")

🔄 This could be ran as an additional step to your Databricks Workflow, so that every time the workflow runs, the data table containing the volume content is updated as well.

✅ Once the results are loaded in a table, you can run SELECT queries on it and therefore you can set alerts as well.

If, on the contrary, a new Excel sheet is loaded in the Volume everytime the Workflow runs, then you could simply set an alert on success directly within your workflow.

Hope that can be useful to you.

Thanks,

Gab

RobinK · ‎12-11-2023

Hi @gabsylvain ,

thanks for the reply! That could work for us.

You don't even need to use dbutils, SQL in a Notebook works just fine:

df = spark.sql(f"LIST '/Volumes/my-catalog/my-schema/my-volume'")
df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.my_volume_metadata_table")

We are using SparkConnect and VSCode for development, so using dbutils is always a bit painful in local development. And the Databricks WorkspaceClient() Class does not seem to support listing volumes right now.

Kind regards

Robin