topic DLT cloudfiles trigger interval not working in Data Engineering

DLT cloudfiles trigger interval not working

elifa — Mon, 07 Aug 2023 17:58:00 GMT

I have the following streaming table definition using cloudfiles format and pipelines.trigger.interval setting to reduce file discovery costs but the query is triggering every 12 seconds instead of every 5 minutes.

Is there another configuration I am missing or DLT cloudfiles does not work with that setting?

@dlt.table def s3_data( spark_conf={"pipelines.trigger.interval" : "5 minutes"}, table_properties={ "quality": "bronze", "pipelines.reset.allowed": "false" # preserves the data in the delta table if you do full refresh } ): return ( spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .load("s3://my-bucket/") .withColumn("filePath", input_file_name()) )

Re: DLT cloudfiles trigger interval not working

Timothydickers — Tue, 08 Aug 2023 04:39:51 GMT

@elifa wrote:
I have the following streaming table definition using cloudfiles format and pipelines.trigger.interval setting to reduce file discovery costs but the query is triggering every 12 seconds instead of every 5 minutes.
Is there another configuration I am missing or DLT cloudfiles does not work with that setting? Sheetz Listens

Hello,

The pipelines.trigger.interval setting is designed to control the discovery interval for new files in the input path when using Delta Lake Time Travel with CloudFiles as the streaming source in Databricks. However, there seems to be an issue with the trigger interval not being honored as expected.

First, verify that you are using the correct syntax for the pipelines.trigger.interval setting. The correct format is "5m", representing 5 minutes, instead of "5 minutes". Update your spark_conf to set the trigger interval as follows:

spark_conf={"pipelines.trigger.interval": "5m"}

If the issue persists, consider checking for any potential limitations or updates in the Databricks version you are using. It is possible that there might be a bug or a compatibility issue in the version you have installed.

Another possible approach is to check the Databricks documentation or forums to see if there are any known issues or workarounds related to the pipelines.trigger.interval setting when using CloudFiles as the streaming source.

In case you are still facing the problem, you may consider reaching out to Databricks support for assistance and further investigation. They can provide specific insights into the behavior of the pipelines.trigger.interval setting with CloudFiles in your Databricks environment and offer guidance on resolving the issue.

Re: DLT cloudfiles trigger interval not working

Tharun-Kumar — Tue, 08 Aug 2023 04:39:52 GMT

@elifa

Could you check for this message in the log file?

INFO EnzymePlanner: Planning for flow: s3_data

According to the config pipelines.trigger.interval, the planning should happen once in every 5 minutes.

Re: DLT cloudfiles trigger interval not working

elifa — Tue, 08 Aug 2023 12:44:01 GMT

The log below is how I can see that it is running every 12 seconds. I am using the same configuration on other tables that are not cloudfiles format and it works fine on them.

23/08/08 04:59:00 INFO MicroBatchExecution: Streaming query made progress: { "name" : "s3_data", "timestamp" : "2023-08-08T04:59:00.005Z", "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, } 23/08/08 04:59:12 INFO MicroBatchExecution: Streaming query made progress: { "name" : "s3_data", "timestamp" : "2023-08-08T04:59:12.000Z", "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, } 23/08/08 04:59:36 INFO MicroBatchExecution: Streaming query made progress: { "name" : "s3_data", "timestamp" : "2023-08-08T04:59:36.000Z", "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, } 23/08/08 04:59:48 INFO MicroBatchExecution: Streaming query made progress: { "name" : "s3_data", "timestamp" : "2023-08-08T04:59:48.002Z", "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, }