Peformnace improvement of Databricks Spark Job

pinaki1 — Fri, 23 Aug 2024 05:35:53 GMT

Hi,
I need performance improvement for data bricks job in my project. Here are some steps being done in the project
1. Read csv/Json files with small size (100MB,50MB) from multiple locations in s3
2. Write the data in bronze layer in delta/parquet format
3. Read from bronze layer
4. Do some filter for data cleaning
5. Write to silver in delta/parquet format
6. Read from silver layer
7. Do lot of joins and other transformations like union, distinct
8. Write the final data to AWS RDS

I'm not getting enough performance improvement. for 5KB data it is taking almost 1min 30 sec
Also, I observed that enough parallelism is not there, and all cores are not getting utilized (I have 4 cores)

Please give some suggestions on this

Re: Peformnace improvement of Databricks Spark Job

-werners- — Fri, 23 Aug 2024 06:30:35 GMT

In case of performance issues, always look for 'expensive' operations. Mainly wide operations (shuffle) and collecting data to the driver.
Start with checking how long the bronze part takes, then silver etc.
Pinpoint where it starts to get slow, then dig into the query plan.
Chances are that some join slows things down.

topic Peformnace improvement of Databricks Spark Job in Data Engineering

Peformnace improvement of Databricks Spark Job

Re: Peformnace improvement of Databricks Spark Job