cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Understand if the configs I use to SparkSession.builder still make sense for Databricks 10+

alejandrofm
Valued Contributor

Hi! I currently have this as an old generic template with amends over time to optimize Databricks Spark execution, can you help me to know if this still makes sense for v10-11-12 or if there are new recommendations? Maybe some of this is making my processes slower but I did not having any deprecation warning or suggestion when creating the session.

Thanks!

spark_session = SparkSession.builder \

.config('spark.speculation', 'false') \

.config('sparkyarn.maxAppAttempts', '1') \

.config('spark.databricks.delta.preview.enabled', 'true') \

.config('spark.databricks.delta.merge.joinBasedMerge.enabled', 'true') \

.config('spark.databricks.delta.multiClusterWrites.enabled', 'false') \

.config('spark.databricks.adaptive.autoOptimizeShuffle.enabled', 'true') \

.getOrCreate()

2 REPLIES 2

Anonymous
Not applicable

@Alejandro Martinez​ :

Hi! Your template seems to be a good starting point for configuring a SparkSession in Databricks. However, there are some new recommendations that you can consider for Databricks runtime versions v10-11-12. Here are some suggestions:

  1. Use Databricks Delta 1.0.0 or higher - Databricks Delta has been updated and improved since its preview release. It is now recommended to use Delta version 1.0.0 or higher, which includes many stability and performance improvements.
  2. Configure Spark shuffle partitions - Configuring the number of partitions in Spark shuffle can significantly improve the performance of your Spark jobs. You can set the number of shuffle partitions based on the size of your data and the size of your cluster.
  3. Enable automatic coalescing of small files - Enabling automatic coalescing of small files can help reduce the number of small files in your Delta tables, which can improve query performance.
  4. Use the optimal file format - Different file formats have different performance characteristics. For example, Delta tables are optimized for performance and reliability, whereas Parquet is optimized for storage efficiency. Consider using the file format that best meets your needs.
  5. Use adaptive query execution - Adaptive query execution is a feature that can automatically adjust the execution plan of a query based on the characteristics of the data and the cluster. It can improve the performance of your Spark jobs in many cases.

Here is an updated template that includes these recommendations:

spark_session = SparkSession.builder \
    .config("spark.databricks.delta.retentionDurationCheck.enabled", "false") \
    .config("spark.sql.shuffle.partitions", "500") \
    .config("spark.databricks.delta.optimizeWrite.enabled", "true") \
    .config("spark.databricks.delta.autoCompact.enabled", "true") \
    .config("spark.databricks.delta.join.preferBroadcastHashJoin", "true") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

Please note that these recommendations may not be suitable for all use cases, so you should evaluate them based on your specific requirements and workload characteristics.

Hi, will try that config! Only one question, when you say on the first point "Use Databricks Delta 1.0.0 or higher".

What do you mean? Should I upgrade the table manualñy? Didn't find related documentation.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.