topic SAS token issue for long running micro-batches in Data Engineering

SAS token issue for long running micro-batches

deecee — Thu, 07 Nov 2024 15:16:51 GMT

Hi everyone,

I'm having an issue with some of our Databricks workloads. We're processing these workloads using the forEachBatch stream processing method. Whenever we are performing a full reload on some of our datasources, we get the following error.

[STREAM_FAILED] Query [id = 00000000-0000-0000-0000-000000000000, runId = 00000000-0000-0000-0000-000000000000] terminated with exception: Failed to acquire a SAS token for get-status on /checkpoints/commits/0 due to java.util.concurrent.ExecutionException: com.databricks.sql.managedcatalog.UnityCatalogServiceException: [RequestId=00000000-0000-0000-0000-000000000000 ErrorClass=INVALID_PARAMETER_VALUE.INVALID_PARAMETER_VALUE] Input path abfss://some-container@somestorageaccount.dfs.core.windows.net/ overlaps with other external tables or volumes. Conflicting tables/volumes: some_catalog.some_schema.some_table SQLSTATE: XXKST

The error message is quite strange, since we don't have any overlapping tables or checkpoints. We have noticed that this only happens when the micro-batches become so large that it takes more than 1 hour to complete a single micro-batch.

Could it be that the SAS token expires after 1 hour, which causes the checkpoint commit to fail?

Thanks

Re: SAS token issue for long running micro-batches

VZLA — Sat, 23 Nov 2024 18:23:31 GMT

@deecee

Can you please confirm there are no external locations or volumes which can lead to this overlap of locations? what you actually have in "some_catalog.some_schema.some_table" and the "abfss://some-container@somestorageaccount.dfs.core.windows.net/" ?
Also just curious, are you saying a microbatch in your streaming application is expected to take more than an hour? Could you please clarify the use case if possible?

Re: SAS token issue for long running micro-batches

deecee — Mon, 25 Nov 2024 13:06:05 GMT

Hi @VZLA,

I can indeed confirm there are no overlapping locations. We eventually got a successful run by just increasing the cluster until the micro-batches stayed below 1 hour. I was really thrown off by the error message though, so was wondering if and how it is related to the micro-batch size.

What we are trying to do is process a table's CDF stream and merge changes into another table. In this particular case, we had to reprocess the whole table, which resulted in some micro-batches of over 40 billion records. Looking at the Spark-UI I noticed that it is reading in a 1000 files per micro-batch, so the approach now is to leverage the maxFilesPerTrigger option to tune the micro-batch size.

Thanks