Databricks Community

databricksero · ‎10-15-2025

Hi everyone,

I’m running into an issue with a Delta Live Tables (DLT) pipeline that processes a few transformation layers (raw → intermediate → primary → feature).

When I trigger the entire pipeline, it fails with the following error:
can not infer schema from empty dataset

The error happens at this line:

df_spark = spark.createDataFrame(df_cleaned)

However, if I run the steps manually (table by table), everything works perfectly. Even more strangely, once I’ve run the layers manually, the full pipeline runs successfully afterward. This makes me think the issue is related to dependency resolution or execution timing in DLT.

Simplified example

Here’s a simplified version of my code:

import dlt
from pyspark.sql import functions as F

@dlt.table(name="bronze_table")
def bronze_table():
    return spark.read.table("source_table")

@dlt.table(name="silver_intermediate")
def silver_intermediate():
    df = dlt.read("bronze_table")
    return df.withColumn("processed_col", F.upper(F.col("some_col")))

@dlt.table(name="silver_primary")
def silver_primary():
    df = dlt.read("silver_intermediate")
    df = df.withColumn("year", F.substring(F.col("date_col"), 0, 4))
    pdf = df.pandas_api()
    pdf_filtered = pdf[pdf["year"].notnull()]
    return pdf_filtered.to_spark()

@dlt.table(name="silver_feature")
def silver_feature():
    df = dlt.read("silver_primary").pandas_api()
    pdf = df.to_pandas()
    pdf_cleaned = pdf.dropna()
    # This line fails when the pipeline runs end-to-end
    df_spark = spark.createDataFrame(pdf_cleaned)
    return df_spark

What I suspect

It seems that DLT might be running silver_feature before silver_primary has finished materializing, causing dlt.read("silver_primary") to return an empty dataset. When I run things manually, each dependency already exists, so it works fine.

Questions

Is there a known timing or dependency issue in DLT when chaining multiple transformations that mix Spark and Pandas API on Spark operations (and even pandas ops)?
Is there a way to ensure that DLT waits until an upstream table has data before running the next step?

ManojkMohan · ‎10-15-2025

@databricksero

The error occurs right at this line:

python
df_spark = spark.createDataFrame(df_cleaned)
This issue arises because, during the end-to-end execution of the pipeline, df_cleaned might end up being an empty pandas DataFrame. This can happen if the downstream table (silver_primary) hasn't been fully materialized or populated yet.

I shall try few code snippets and get back to you with exact code later today but i would try - implementing empty data frame handling and using on ly sparrk only tranformations

BS_THE_ANALYST · ‎10-15-2025

@databricksero seems like you've identified the issue. It's certainly leaning towards the order of execution.

Firstly, here's some great documentation on how DLT works conceptually: https://docs.databricks.com/aws/en/ldp/concepts

Here's a 6 video Youtube playlist on Lakeflow Declarative Pipelines: https://youtube.com/playlist?list=PL7S7dD8r4QdU5FZzMNS7qlUkTEby6I9VK&si=kTN4bHCfbjHAAHyK it even has a project in there 😀.

@databricksero once you've created the LDP, I'm sure there's a way to export it as YAML etc. You can see how to string it together through code that way 🙂.

All the best,
BS

BS_THE_ANALYST · ‎10-17-2025

Just updating my previous comment. I wasn't too sure about the order of execution with Lakeflow Declarative Pipelines, I'm just learning about them now. I didn't know the execution order is handled implicitly (which is freaking awesome by the way, kudos to LDP/DLT). I retract my previous comment about that being a root cause. Below is a screenshot from a lecture I'm currently on, I appreciate it's with relation to SQL but it shows the theory, for anyone else who was curious 🙂

All the best,
BS

szymon_dybczak · ‎10-15-2025

Hi @databricksero ,

This is well know limitation of DLT/Declarative Pipelines. You just shouldn't use toPandas() as a part of your Lakeflow Declarative code:

szymon_dybczak · ‎10-15-2025

But following excerpt from old version of documentation is interesting:

@databrickseroI wonder if the following workaround could work. I haven’t tested it, and there might be some typos since I wrote it from memory, but I hope you get the idea.

def pandas_function(spark_df):
  pdf = spark_df.toPandas()
  pdf_filtered = pdf[pdf["year"].notnull()]
  return spark.createDataFrame(pdf_filtered )


@dlt.table(name="silver_primary")
def silver_primary():
    df = dlt.read("silver_intermediate")
    df = df.withColumn("year", F.substring(F.col("date_col"), 0, 4))
    df_transformed = pandas_function(df)
    return df_transformed.to_spark()

databricksero · ‎10-16-2025

Thanks for your reply! I also tried this, but also doesn't work unfortunately.

Is there by chance a workaround or "hack" to explicitly state the dependency such that the Databricks planner can still figure out the proper order of execution?

szymon_dybczak · ‎10-16-2025

Hi @databricksero ,

Unfortunately, I don't think so. Probably that's why they're saying in docs that we should not use certain operation in declarative pipeline 😕

ManojkMohan · ‎10-16-2025

@databricksero

Explicit Schema Definition: When calling spark.createDataFrame(pdf_cleaned), explicitly provide the schema even if the DataFrame is empty. This helps Spark infer the types and prevents the “cannot infer schema from empty dataset” error.

Guard Against Empty DataFrames: Check if pdf_cleaned is empty before creating a Spark DataFrame. If it’s empty, create a dummy DataFrame (with the right schema) instead

I agree with @szymon_dybczak and @BS_THE_ANALYST There isn’t a safe “hack” to force DLT dependency order when mixing Spark and Pandas APIs inside declarative tables, because DLT (and Lakeflow Pipelines) relies on dependency inference based on dlt.read() calls and doesn’t always guarantee materialization or downstream table population before execution, particularly when converting to/from Pandas

Knowledge base article calling this limitation - https://kb.databricks.com/delta-live-tables

Databricks Community

DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually

Simplified example

What I suspect

Questions

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐