<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135045#M50263</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191983"&gt;@databricksero&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;This is well know limitation of DLT/Declarative Pipelines. You just shouldn't use toPandas() as a part of your Lakeflow Declarative code:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1760553368587.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20756i7DB56D69B6E21A14/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1760553368587.png" alt="szymon_dybczak_0-1760553368587.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 15 Oct 2025 18:36:40 GMT</pubDate>
    <dc:creator>szymon_dybczak</dc:creator>
    <dc:date>2025-10-15T18:36:40Z</dc:date>
    <item>
      <title>DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135007#M50254</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I’m running into an issue with a Delta Live Tables (DLT) pipeline that processes a few transformation layers (raw → intermediate → primary → feature).&lt;/P&gt;&lt;P&gt;When I trigger the entire pipeline, it fails with the following error:&lt;BR /&gt;&lt;EM&gt;can not infer schema from empty dataset&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;The error happens at this line:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;LI-CODE lang="python"&gt;df_spark = spark.createDataFrame(df_cleaned) &lt;/LI-CODE&gt;&lt;P&gt;However, if I run the steps manually (table by table), everything works perfectly. Even more strangely, once I’ve run the layers manually, the full pipeline runs successfully afterward. This makes me think the issue is related to dependency resolution or execution timing in DLT.&lt;/P&gt;&lt;HR /&gt;&lt;H3&gt;Simplified example&lt;/H3&gt;&lt;P&gt;Here’s a simplified version of my code:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import dlt
from pyspark.sql import functions as F

@dlt.table(name="bronze_table")
def bronze_table():
    return spark.read.table("source_table")

@dlt.table(name="silver_intermediate")
def silver_intermediate():
    df = dlt.read("bronze_table")
    return df.withColumn("processed_col", F.upper(F.col("some_col")))

@dlt.table(name="silver_primary")
def silver_primary():
    df = dlt.read("silver_intermediate")
    df = df.withColumn("year", F.substring(F.col("date_col"), 0, 4))
    pdf = df.pandas_api()
    pdf_filtered = pdf[pdf["year"].notnull()]
    return pdf_filtered.to_spark()

@dlt.table(name="silver_feature")
def silver_feature():
    df = dlt.read("silver_primary").pandas_api()
    pdf = df.to_pandas()
    pdf_cleaned = pdf.dropna()
    # This line fails when the pipeline runs end-to-end
    df_spark = spark.createDataFrame(pdf_cleaned)
    return df_spark&lt;/LI-CODE&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;HR /&gt;&lt;H3&gt;What I suspect&lt;/H3&gt;&lt;P&gt;It seems that DLT might be running silver_feature before silver_primary has finished materializing, causing dlt.read("silver_primary") to return an empty dataset. When I run things manually, each dependency already exists, so it works fine.&lt;/P&gt;&lt;HR /&gt;&lt;H3&gt;Questions&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;Is there a known timing or dependency issue in DLT when chaining multiple transformations that mix Spark and Pandas API on Spark operations (and even pandas ops)?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Is there a way to ensure that DLT waits until an upstream table has data before running the next step?&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Wed, 15 Oct 2025 14:01:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135007#M50254</guid>
      <dc:creator>databricksero</dc:creator>
      <dc:date>2025-10-15T14:01:31Z</dc:date>
    </item>
    <item>
      <title>Re: DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135009#M50256</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191983"&gt;@databricksero&lt;/a&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The error occurs right at this line:&lt;/P&gt;&lt;P&gt;python&lt;BR /&gt;df_spark = spark.createDataFrame(df_cleaned)&lt;BR /&gt;This issue arises because, during the end-to-end execution of the pipeline, df_cleaned might end up being an empty pandas DataFrame. This can happen if the downstream table (silver_primary) hasn't been fully materialized or populated yet.&lt;/P&gt;&lt;P&gt;I shall try few code snippets and get back to you with exact code later today but i would try&amp;nbsp; &amp;nbsp;-&amp;nbsp; implementing empty data frame handling and using on ly sparrk only tranformations&lt;/P&gt;</description>
      <pubDate>Wed, 15 Oct 2025 14:12:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135009#M50256</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-10-15T14:12:52Z</dc:date>
    </item>
    <item>
      <title>Re: DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135041#M50261</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191983"&gt;@databricksero&lt;/a&gt;&amp;nbsp;seems like you've identified the issue. It's certainly leaning towards the order of execution.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Firstly, here's some great documentation on how DLT works conceptually:&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/ldp/concepts" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/ldp/concepts&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here's a 6 video Youtube playlist on Lakeflow Declarative Pipelines:&amp;nbsp;&lt;A href="https://youtube.com/playlist?list=PL7S7dD8r4QdU5FZzMNS7qlUkTEby6I9VK&amp;amp;si=kTN4bHCfbjHAAHyK" target="_blank" rel="noopener"&gt;https://youtube.com/playlist?list=PL7S7dD8r4QdU5FZzMNS7qlUkTEby6I9VK&amp;amp;si=kTN4bHCfbjHAAHyK&lt;/A&gt;&amp;nbsp;it even has a project in there&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":grinning_face:"&gt;😀&lt;/span&gt;.&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191983"&gt;@databricksero&lt;/a&gt;&amp;nbsp;once you've created the LDP, I'm sure there's a way to export it as YAML etc. You can see how to string it together through code that way&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;.&lt;/P&gt;&lt;P&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Wed, 15 Oct 2025 18:10:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135041#M50261</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-10-15T18:10:58Z</dc:date>
    </item>
    <item>
      <title>Re: DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135045#M50263</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191983"&gt;@databricksero&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;This is well know limitation of DLT/Declarative Pipelines. You just shouldn't use toPandas() as a part of your Lakeflow Declarative code:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1760553368587.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20756i7DB56D69B6E21A14/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1760553368587.png" alt="szymon_dybczak_0-1760553368587.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 15 Oct 2025 18:36:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135045#M50263</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-10-15T18:36:40Z</dc:date>
    </item>
    <item>
      <title>Re: DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135052#M50265</link>
      <description>&lt;P&gt;But following excerpt from old version of documentation is interesting:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1760554908781.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20760iDB3729BBA257E363/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1760554908781.png" alt="szymon_dybczak_0-1760554908781.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191983"&gt;@databricksero&lt;/a&gt;I wonder if the following workaround could work. I haven’t tested it, and there might be some typos since I wrote it from memory, but I hope you get the idea.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def pandas_function(spark_df):
  pdf = spark_df.toPandas()
  pdf_filtered = pdf[pdf["year"].notnull()]
  return spark.createDataFrame(pdf_filtered )


@dlt.table(name="silver_primary")
def silver_primary():
    df = dlt.read("silver_intermediate")
    df = df.withColumn("year", F.substring(F.col("date_col"), 0, 4))
    df_transformed = pandas_function(df)
    return df_transformed.to_spark()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 15 Oct 2025 19:07:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135052#M50265</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-10-15T19:07:51Z</dc:date>
    </item>
    <item>
      <title>Re: DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135093#M50274</link>
      <description>&lt;P&gt;Thanks for your reply! I also tried this, but also doesn't work unfortunately.&lt;/P&gt;&lt;P&gt;Is there by chance a workaround or "hack" to explicitly state the dependency such that the Databricks planner can still figure out the proper order of execution?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Oct 2025 08:34:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135093#M50274</guid>
      <dc:creator>databricksero</dc:creator>
      <dc:date>2025-10-16T08:34:11Z</dc:date>
    </item>
    <item>
      <title>Re: DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135097#M50275</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191983"&gt;@databricksero&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Unfortunately, I don't think so. Probably that's why they're saying in docs that we should not use certain operation in declarative pipeline &lt;span class="lia-unicode-emoji" title=":confused_face:"&gt;😕&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Oct 2025 09:41:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135097#M50275</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-10-16T09:41:53Z</dc:date>
    </item>
    <item>
      <title>Re: DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135105#M50277</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191983"&gt;@databricksero&lt;/a&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Explicit Schema Definition: When calling spark.createDataFrame(pdf_cleaned), explicitly provide the schema even if the DataFrame is empty. This helps Spark infer the types and prevents the “cannot infer schema from empty dataset” error.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ManojkMohan_0-1760610930269.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20766i042F6895EA95CBA4/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ManojkMohan_0-1760610930269.png" alt="ManojkMohan_0-1760610930269.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Guard Against Empty DataFrames: Check if pdf_cleaned is empty before creating a Spark DataFrame. If it’s empty, create a dummy DataFrame (with the right schema) instead&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ManojkMohan_1-1760610971213.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20767i5A58C23497FC2A94/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ManojkMohan_1-1760610971213.png" alt="ManojkMohan_1-1760610971213.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I agree with&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp; and&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146924"&gt;@BS_THE_ANALYST&lt;/a&gt;&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;There isn’t a safe “hack” to force DLT dependency order when mixing Spark and Pandas APIs inside declarative tables, because DLT (and Lakeflow Pipelines) relies on dependency inference based on dlt.read() calls and doesn’t always guarantee materialization or downstream table population before execution, particularly when converting to/from Pandas&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Knowledge base article calling this limitation -&amp;nbsp;&lt;A href="https://kb.databricks.com/delta-live-tables" target="_blank"&gt;https://kb.databricks.com/delta-live-tables&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Oct 2025 10:42:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135105#M50277</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-10-16T10:42:37Z</dc:date>
    </item>
    <item>
      <title>Re: DLT pipeline fails with “can not infer schema from empty dataset” — works fine when run manually</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135234#M50311</link>
      <description>&lt;P&gt;Just updating my previous comment. I wasn't too sure about the order of execution with Lakeflow Declarative Pipelines, I'm just learning about them now. I didn't know the execution order is handled implicitly (which is freaking awesome by the way, kudos to LDP/DLT). I retract my previous comment about that being a root cause. Below is a screenshot from a lecture I'm currently on, I appreciate it's with relation to SQL but it shows the theory, for anyone else who was curious&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="BS_THE_ANALYST_0-1760698737632.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20808iF05679DC76A6203F/image-size/large?v=v2&amp;amp;px=999" role="button" title="BS_THE_ANALYST_0-1760698737632.png" alt="BS_THE_ANALYST_0-1760698737632.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Fri, 17 Oct 2025 11:02:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-pipeline-fails-with-can-not-infer-schema-from-empty-dataset/m-p/135234#M50311</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-10-17T11:02:24Z</dc:date>
    </item>
  </channel>
</rss>

