cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Understand if the configs I use to SparkSession.builder still make sense for Databricks 10+

alejandrofm
Valued Contributor

Hi! I currently have this as an old generic template with amends over time to optimize Databricks Spark execution, can you help me to know if this still makes sense for v10-11-12 or if there are new recommendations? Maybe some of this is making my processes slower but I did not having any deprecation warning or suggestion when creating the session.

Thanks!

spark_session = SparkSession.builder \

.config('spark.speculation', 'false') \

.config('sparkyarn.maxAppAttempts', '1') \

.config('spark.databricks.delta.preview.enabled', 'true') \

.config('spark.databricks.delta.merge.joinBasedMerge.enabled', 'true') \

.config('spark.databricks.delta.multiClusterWrites.enabled', 'false') \

.config('spark.databricks.adaptive.autoOptimizeShuffle.enabled', 'true') \

.getOrCreate()

2 REPLIES 2

Anonymous
Not applicable

@Alejandro Martinezโ€‹ :

Hi! Your template seems to be a good starting point for configuring a SparkSession in Databricks. However, there are some new recommendations that you can consider for Databricks runtime versions v10-11-12. Here are some suggestions:

  1. Use Databricks Delta 1.0.0 or higher - Databricks Delta has been updated and improved since its preview release. It is now recommended to use Delta version 1.0.0 or higher, which includes many stability and performance improvements.
  2. Configure Spark shuffle partitions - Configuring the number of partitions in Spark shuffle can significantly improve the performance of your Spark jobs. You can set the number of shuffle partitions based on the size of your data and the size of your cluster.
  3. Enable automatic coalescing of small files - Enabling automatic coalescing of small files can help reduce the number of small files in your Delta tables, which can improve query performance.
  4. Use the optimal file format - Different file formats have different performance characteristics. For example, Delta tables are optimized for performance and reliability, whereas Parquet is optimized for storage efficiency. Consider using the file format that best meets your needs.
  5. Use adaptive query execution - Adaptive query execution is a feature that can automatically adjust the execution plan of a query based on the characteristics of the data and the cluster. It can improve the performance of your Spark jobs in many cases.

Here is an updated template that includes these recommendations:

spark_session = SparkSession.builder \
    .config("spark.databricks.delta.retentionDurationCheck.enabled", "false") \
    .config("spark.sql.shuffle.partitions", "500") \
    .config("spark.databricks.delta.optimizeWrite.enabled", "true") \
    .config("spark.databricks.delta.autoCompact.enabled", "true") \
    .config("spark.databricks.delta.join.preferBroadcastHashJoin", "true") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

Please note that these recommendations may not be suitable for all use cases, so you should evaluate them based on your specific requirements and workload characteristics.

Hi, will try that config! Only one question, when you say on the first point "Use Databricks Delta 1.0.0 or higher".

What do you mean? Should I upgrade the table manualรฑy? Didn't find related documentation.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group