cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Peformnace improvement of Databricks Spark Job

pinaki1
New Contributor III

Hi,
I need performance improvement for data bricks job in my project. Here are some steps being done in the project
1. Read csv/Json files with small size (100MB,50MB) from multiple locations in s3
2. Write the data in bronze layer in delta/parquet format
3. Read from bronze layer
4. Do some filter for data cleaning
5. Write to silver in delta/parquet format
6. Read from silver layer
7. Do lot of joins and other transformations like union, distinct
8. Write the final data to AWS RDS

I'm not getting enough performance improvement. for 5KB data it is taking almost 1min 30 sec
Also, I observed that enough parallelism is not there, and all cores are not getting utilized (I have 4 cores)

Please give some suggestions on this

1 REPLY 1

-werners-
Esteemed Contributor III

In case of performance issues, always look for 'expensive' operations. Mainly wide operations (shuffle) and collecting data to the driver.
Start with checking how long the bronze part takes, then silver etc.
Pinpoint where it starts to get slow, then dig into the query plan.
Chances are that some join slows things down.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group