cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT: Autoloader Perf

Gilg
Contributor II

Hi Team,

I am looking for some advice to perf tune my bronze layer using DLT.

I have the following code very simple and yet very effective.

 

@dlt.create_table(name="bronze_events",
                  comment = "New raw data ingested from storage account landing zone.")
def bronze_events():
    df = (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "json")
        .schema(schema)
        .load("abfss://data@<some storage account>.dfs.core.windows.net/0_Landing")
      )

    return df

 

that generates this DAG.

Gilg_0-1696561163925.png

Before it was executing quite fast but as days goes by it is becoming more and more slower like from 2 to 5 to 12 mins. Silver and Gold are all executing less than a minute. So wondering what performance tuning I should do with the bronze layer.

Cheers,

G

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @Gilg , You can tune bronze layer performance by:


 - **Batch Size Tuning:** Adjust batch size to optimize GPU utilization and avoid CUDA out of memory errors.
 - **Stage-Level Scheduling:** Increase parallelism by scheduling multiple tasks per GPU using Spark.

- **Repartition Data:** Use all hardware by repartitioning data. Check partition count with, repartition with repartitioned_df = df.repartition(desired_partition_count).
 - **Cache the Model:** If frequently loading a model from different or restarted clusters, consider caching the model in DBFS root volume.
- Performance tuning is iterative and may require different strategy combinations.

Tharun-Kumar
Honored Contributor II
Honored Contributor II

Hi @Gilg 

Is it ingesting the same number of files as before?

Also, you could try using Auto Loader with file notification mode. If there are too many files in the source directory, then significant amount of time would be spent on listing of the directory. We can validate this by analyzing the logs.

It can be from 600 files to up to 1.5k files. The DLT is set to Triggered in Pipeline mode and Continuous in Trigger Type in Workflows. 

Tharun-Kumar
Honored Contributor II
Honored Contributor II

Hi @Gilg 

You mentioned that micro-batch time is around 12 minutes recently. Do we also see jobs/stages with 12 minutes in the spark ui. If that is the case, then the processing of the file itself takes 12 minutes. If not, the 12 minutes is spent on listing the directory and maintaining the checkpoint. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group