Hello @bi_123 !
Serverless is normally the recommended default for lakeflow declarative pipelines because DBKS manages the infrastructure and uses enhanced autoscaling including horizontal and vertical scaling. However you may require classic compute which lets you explicitly choose worker and driver instance types and it is useful for memory heavy workloads.
For this case, I would first treat the nested structs as the main issue not the number of rows and do SELECT * or expand all nested fields unless they are really needed because project only the required nested fields using dot notation for example:
df.select(
"id",
"event_date",
"payload.customer.id",
"payload.order.amount"
)https://docs.databricks.com/gcp/en/semi-structured/complex-types
If you want to switch to classic compute, you can use enhanced autoscaling and a memory optimized worker type and start with something like:
{
"clusters": [
{
"label": "updates",
"autoscale": {
"min_workers": 2,
"max_workers": 8,
"mode": "ENHANCED"
},
"node_type_id": "<memory_optimized_worker>",
"driver_node_type_id": "<memory_optimized_or_general_purpose_driver>"
}
]
}For Azure, that could be an E-series VM such as Standard_E8ds_v5 or Standard_E16ds_v5 and and for AWS an R-series type such as r6i/r7i. The exact size depends on the width of the nested structs and the Spark UI metrics since DBKS allows instance type selection specifically to improve performance or address memory issues and the updates label can be used so the larger instance type applies to the pipeline update cluster rather than unnecessarily to maintenance compute.
If this answer resolves your question, could you please mark it as โAccept as Solutionโ? It will help other users quickly find the correct fix.
Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP