cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Serverless compute throws OUT_OF_MEMORY exception

bi_123
New Contributor II

I'm running a Lakeflow Declarative Pipeline that reads data from a bronze table ingested by Auto Loader and writes it to a silver table with simple transformations.

The source data contains struct columns with many deeply nested fields. The table currently has approximately 593,500 rows.

The pipeline was originally running on serverless compute, but it recently started failing with OUT_OF_MEMORY errors. The failure message includes:

`Job aborted due to stage failure: Task 0 in stage 303.0 failed 4 times ...`

What is the recommended way to resolve this? If I switch from serverless to pipeline compute, what should the cluster configuration look like, and what settings or design considerations should I pay attention to for this type of nested-struct workload?

2 REPLIES 2

anmolhhns
New Contributor

 

Hi @bi_123 , before changing compute configurations, I would first try to narrow down where the memory pressure is coming from. With only ~593K rows, the issue is likely not the row count alone, but the width/depth of the nested struct columns or how they are being transformed.A few checks I would do first:

  1. Check the failed update / event log
    Look at the Lakeflow pipeline update details and event log to identify which table/flow failed and whether the failure is tied to a specific transformation step.
  2. Inspect the schema width and nesting depth
    Check how many nested fields are present and whether the silver transformation is carrying the full struct forward.

     

     
    df.printSchema()
     
    1. Identify heavy transformations
      Review whether the pipeline is doing wide select("*"), from_json on a large schema, explode on arrays of structs, joins, aggregations, or shuffles on nested data.
    2. Check partition/task behavior
      If one task is repeatedly failing, it may indicate skew or that a single partition has too much nested data to process.
    3. Validate whether all nested fields are required
      If silver only needs a subset of fields, project those fields early instead of carrying the entire nested payload.

      Once you identify the failing flow and transformation, the fix becomes clearer, for example, projecting only required nested fields, flattening in smaller stages, dropping raw nested structs after extraction, or adjusting partitioning before the heavy step.

      If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.


 

amirabedhiafi
New Contributor III

Hello @bi_123  !

Serverless is normally the recommended default for lakeflow declarative pipelines because DBKS manages the infrastructure and uses enhanced autoscaling including horizontal and vertical scaling. However you may require classic compute which lets you explicitly choose worker and driver instance types and it is useful for memory heavy workloads.

For this case, I would first treat the nested structs as the main issue not the number of rows and do SELECT * or expand all nested fields unless they are really needed because project only the required nested fields using dot notation for example:

df.select(
    "id",
    "event_date",
    "payload.customer.id",
    "payload.order.amount"
)

https://docs.databricks.com/gcp/en/semi-structured/complex-types

If you want to switch to classic compute, you can use enhanced autoscaling and a memory optimized worker type and start with something like:

{
  "clusters": [
    {
      "label": "updates",
      "autoscale": {
        "min_workers": 2,
        "max_workers": 8,
        "mode": "ENHANCED"
      },
      "node_type_id": "<memory_optimized_worker>",
      "driver_node_type_id": "<memory_optimized_or_general_purpose_driver>"
    }
  ]
}

For Azure, that could be an E-series VM such as Standard_E8ds_v5 or Standard_E16ds_v5 and and for AWS an R-series type such as r6i/r7i. The exact size depends on the width of the nested structs and the Spark UI metrics since DBKS allows instance type selection specifically to improve performance or address memory issues and the updates label can be used so the larger instance type applies to the pipeline update cluster rather than unnecessarily to maintenance compute.

If this answer resolves your question, could you please mark it as “Accept as Solution”? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP