Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @cdn_yyz_yul,

My best guess is that the issue is being triggered by the array-flattening step rather than by the join itself. In Databricks, materialized views are maintained via refresh logic, and the public documentation on incremental refresh for materialized views explains that not every query shape can be processed incrementally. The same documentation also notes that when a query contains unsupported expressions or operators, Databricks may fall back to a full recompute instead of using the incremental refresh path.

That lines up with the behaviour you described. When the flattening step is removed, the rows are present. When the flattening logic is added back, the first full refresh returns zero rows, but the next normal run produces the expected flattened output. That usually points to the query shape around the array expansion being the part that is not behaving well during the full refresh or recompute path, even though the upstream join itself is valid and the source data is available.

As a practical workaround, I would try separating the pipeline into two stages. First, materialise trans_with_description as-is. Then perform the array expansion in a separate downstream table or processing step. This usually makes it easier to confirm whether the full-refresh issue is specifically tied to the flattening logic. If you want to validate further, you can also review the public docs for pipeline refresh semantics, materialized views, and incremental refresh behavior, and test the query with EXPLAIN CREATE MATERIALIZED VIEW.

If useful, a simplified version of that pattern would look like this:

from pyspark import pipelines as dp
from pyspark.sql import functions as F

@dp.materialized_view(name="trans_with_description")
def trans_with_description():
    tx = spark.read.table("transactions")
    desc = spark.read.table("description")
    return tx.join(desc, "id", "left")

@dp.table(name="final_result")
def final_result():
    df = spark.read.table("trans_with_description")
    return (
        df
        .withColumn("col_a_item", F.explode_outer("col_a"))
        .withColumn("col_b_item", F.explode_outer("col_b"))
    )

This is not necessarily the only valid implementation, but it is a useful way to isolate whether the refresh problem is specifically coming from the flattening step.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post