Peformnace improvement of Databricks Spark Job
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2024 10:35 PM
Hi,
I need performance improvement for data bricks job in my project. Here are some steps being done in the project
1. Read csv/Json files with small size (100MB,50MB) from multiple locations in s3
2. Write the data in bronze layer in delta/parquet format
3. Read from bronze layer
4. Do some filter for data cleaning
5. Write to silver in delta/parquet format
6. Read from silver layer
7. Do lot of joins and other transformations like union, distinct
8. Write the final data to AWS RDS
I'm not getting enough performance improvement. for 5KB data it is taking almost 1min 30 sec
Also, I observed that enough parallelism is not there, and all cores are not getting utilized (I have 4 cores)
Please give some suggestions on this
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2024 11:30 PM
In case of performance issues, always look for 'expensive' operations. Mainly wide operations (shuffle) and collecting data to the driver.
Start with checking how long the bronze part takes, then silver etc.
Pinpoint where it starts to get slow, then dig into the query plan.
Chances are that some join slows things down.