cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually

databricksero
New Contributor

Hi everyone,

I’m running into an issue with a Delta Live Tables (DLT) pipeline that processes a few transformation layers (raw → intermediate → primary → feature).

When I trigger the entire pipeline, it fails with the following error:
can not infer schema from empty dataset

The error happens at this line:

 
df_spark = spark.createDataFrame(df_cleaned) 

However, if I run the steps manually (table by table), everything works perfectly. Even more strangely, once I’ve run the layers manually, the full pipeline runs successfully afterward. This makes me think the issue is related to dependency resolution or execution timing in DLT.


Simplified example

Here’s a simplified version of my code:

import dlt
from pyspark.sql import functions as F

@dlt.table(name="bronze_table")
def bronze_table():
    return spark.read.table("source_table")

@dlt.table(name="silver_intermediate")
def silver_intermediate():
    df = dlt.read("bronze_table")
    return df.withColumn("processed_col", F.upper(F.col("some_col")))

@dlt.table(name="silver_primary")
def silver_primary():
    df = dlt.read("silver_intermediate")
    df = df.withColumn("year", F.substring(F.col("date_col"), 0, 4))
    pdf = df.pandas_api()
    pdf_filtered = pdf[pdf["year"].notnull()]
    return pdf_filtered.to_spark()

@dlt.table(name="silver_feature")
def silver_feature():
    df = dlt.read("silver_primary").pandas_api()
    pdf = df.to_pandas()
    pdf_cleaned = pdf.dropna()
    # This line fails when the pipeline runs end-to-end
    df_spark = spark.createDataFrame(pdf_cleaned)
    return df_spark
 
 

What I suspect

It seems that DLT might be running silver_feature before silver_primary has finished materializing, causing dlt.read("silver_primary") to return an empty dataset. When I run things manually, each dependency already exists, so it works fine.


Questions

  1. Is there a known timing or dependency issue in DLT when chaining multiple transformations that mix Spark and Pandas API on Spark operations (and even pandas ops)?

  2. Is there a way to ensure that DLT waits until an upstream table has data before running the next step?

8 REPLIES 8

ManojkMohan
Honored Contributor

@databricksero  

The error occurs right at this line:

python
df_spark = spark.createDataFrame(df_cleaned)
This issue arises because, during the end-to-end execution of the pipeline, df_cleaned might end up being an empty pandas DataFrame. This can happen if the downstream table (silver_primary) hasn't been fully materialized or populated yet.

I shall try few code snippets and get back to you with exact code later today but i would try   -  implementing empty data frame handling and using on ly sparrk only tranformations

BS_THE_ANALYST
Esteemed Contributor II

@databricksero seems like you've identified the issue. It's certainly leaning towards the order of execution. 

Firstly, here's some great documentation on how DLT works conceptually: https://docs.databricks.com/aws/en/ldp/concepts 

Here's a 6 video Youtube playlist on Lakeflow Declarative Pipelines: https://youtube.com/playlist?list=PL7S7dD8r4QdU5FZzMNS7qlUkTEby6I9VK&si=kTN4bHCfbjHAAHyK it even has a project in there 😀.

@databricksero once you've created the LDP, I'm sure there's a way to export it as YAML etc. You can see how to string it together through code that way 🙂.

All the best,
BS

Just updating my previous comment. I wasn't too sure about the order of execution with Lakeflow Declarative Pipelines, I'm just learning about them now. I didn't know the execution order is handled implicitly (which is freaking awesome by the way, kudos to LDP/DLT). I retract my previous comment about that being a root cause. Below is a screenshot from a lecture I'm currently on, I appreciate it's with relation to SQL but it shows the theory, for anyone else who was curious 🙂

BS_THE_ANALYST_0-1760698737632.png

All the best,
BS

szymon_dybczak
Esteemed Contributor III

Hi @databricksero ,

This is well know limitation of DLT/Declarative Pipelines. You just shouldn't use toPandas() as a part of your Lakeflow Declarative code:

szymon_dybczak_0-1760553368587.png

 

But following excerpt from old version of documentation is interesting:

szymon_dybczak_0-1760554908781.png

 

@databrickseroI wonder if the following workaround could work. I haven’t tested it, and there might be some typos since I wrote it from memory, but I hope you get the idea.

def pandas_function(spark_df):
  pdf = spark_df.toPandas()
  pdf_filtered = pdf[pdf["year"].notnull()]
  return spark.createDataFrame(pdf_filtered )


@dlt.table(name="silver_primary")
def silver_primary():
    df = dlt.read("silver_intermediate")
    df = df.withColumn("year", F.substring(F.col("date_col"), 0, 4))
    df_transformed = pandas_function(df)
    return df_transformed.to_spark()

 

Thanks for your reply! I also tried this, but also doesn't work unfortunately.

Is there by chance a workaround or "hack" to explicitly state the dependency such that the Databricks planner can still figure out the proper order of execution?

 

 

szymon_dybczak
Esteemed Contributor III

Hi @databricksero ,

Unfortunately, I don't think so. Probably that's why they're saying in docs that we should not use certain operation in declarative pipeline 😕

ManojkMohan
Honored Contributor

@databricksero  

Explicit Schema Definition: When calling spark.createDataFrame(pdf_cleaned), explicitly provide the schema even if the DataFrame is empty. This helps Spark infer the types and prevents the “cannot infer schema from empty dataset” error.

ManojkMohan_0-1760610930269.png

Guard Against Empty DataFrames: Check if pdf_cleaned is empty before creating a Spark DataFrame. If it’s empty, create a dummy DataFrame (with the right schema) instead

ManojkMohan_1-1760610971213.png

I agree with @szymon_dybczak  and @BS_THE_ANALYST  There isn’t a safe “hack” to force DLT dependency order when mixing Spark and Pandas APIs inside declarative tables, because DLT (and Lakeflow Pipelines) relies on dependency inference based on dlt.read() calls and doesn’t always guarantee materialization or downstream table population before execution, particularly when converting to/from Pandas

Knowledge base article calling this limitation - https://kb.databricks.com/delta-live-tables