04-18-2024 07:54 PM
Hello all.
We are a new team implementing DLT and have setup a number of tables in a pipeline loading from s3 with UC as the target. I'm noticing that if any of the 20 or so tables fail to load, the entire pipeline fails even when there are no dependencies between the tables. In our case, a new table was added to the DLT notebook but the source s3 directory is empty. This has caused the pipeline to fail with error "org.apache.spark.sql.catalyst.ExtendedAnalysisException: Unable to process statement for Table 'table_name'.
Is there a way to change this behavior in the pipeline configuration so that one table failing doesn't impact the rest of the pipeline?
06-25-2024 12:00 AM
@Retired_mod , could you please elaborate more on how to "allow other tables to continue processing even if one table encounters an error"?
04-30-2024 05:19 PM
Thank you for sharing this @Retired_mod. @dashawn did you were able to check Kaniz's docs? do you still need help or shall you accept Kaniz's solution?
08-29-2024 12:21 PM
could please provide link for the docs
3 weeks ago
Can you Please Provide the docs?
3 weeks ago - last edited 3 weeks ago
DLT treats the whole pipeline as one unit, so if any table definition throws an error during the planning phase (not just execution), the entire update fails. An empty S3 directory causing a schema inference failure is exactly the kind of thing that kills the whole run.
The most practical fix for the empty-source problem specifically is to add a guard in your table definition. If you're using Auto Loader, you can provide an explicit schema with cloudFiles.schemaLocation or cloudFiles.schema so Spark doesn't try to infer from an empty directory. That way the table definition stays valid even when there's nothing to read yet. It just processes zero rows.
For the broader "one table shouldn't tank the pipeline" concern, DLT doesn't have a built-in "skip on error" flag. It's a known pain point. What some teams do is split tables into separate pipelines grouped by criticality or source reliability, then orchestrate them through a Databricks Workflow. That way a flaky source only affects its own pipeline.
If splitting pipelines feels heavy-handed, you can also wrap the source read in a try/except within a Python DLT definition and return an empty DataFrame with the correct schema on failure. Something like:
@dlt.table
def my_table():
try:
return spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.schema(my_schema) \
.load("s3://bucket/path")
except Exception:
return spark.createDataFrame([], my_schema)Not the prettiest, but it keeps the rest of the pipeline running. The table just stays empty until the source shows up.
One other thing worth knowing: Databricks recently added the ability to selectively refresh specific tables or retry just the failed tables from the pipeline UI. That doesn't prevent the initial failure, but it helps with recovery so you don't have to reprocess everything from scratch.
Please mark it as a Solution if this helps/resolves your issue so that others can benefit from it!