Databricks

alejandrofm · ‎03-30-2023

Hi! I currently have this as an old generic template with amends over time to optimize Databricks Spark execution, can you help me to know if this still makes sense for v10-11-12 or if there are new recommendations? Maybe some of this is making my processes slower but I did not having any deprecation warning or suggestion when creating the session.

Thanks!

spark_session = SparkSession.builder \

.config('spark.speculation', 'false') \

.config('sparkyarn.maxAppAttempts', '1') \

.config('spark.databricks.delta.preview.enabled', 'true') \

.config('spark.databricks.delta.merge.joinBasedMerge.enabled', 'true') \

.config('spark.databricks.delta.multiClusterWrites.enabled', 'false') \

.config('spark.databricks.adaptive.autoOptimizeShuffle.enabled', 'true') \

.getOrCreate()

Anonymous · ‎04-02-2023

@Alejandro Martinez :

Hi! Your template seems to be a good starting point for configuring a SparkSession in Databricks. However, there are some new recommendations that you can consider for Databricks runtime versions v10-11-12. Here are some suggestions:

Use Databricks Delta 1.0.0 or higher - Databricks Delta has been updated and improved since its preview release. It is now recommended to use Delta version 1.0.0 or higher, which includes many stability and performance improvements.
Configure Spark shuffle partitions - Configuring the number of partitions in Spark shuffle can significantly improve the performance of your Spark jobs. You can set the number of shuffle partitions based on the size of your data and the size of your cluster.
Enable automatic coalescing of small files - Enabling automatic coalescing of small files can help reduce the number of small files in your Delta tables, which can improve query performance.
Use the optimal file format - Different file formats have different performance characteristics. For example, Delta tables are optimized for performance and reliability, whereas Parquet is optimized for storage efficiency. Consider using the file format that best meets your needs.
Use adaptive query execution - Adaptive query execution is a feature that can automatically adjust the execution plan of a query based on the characteristics of the data and the cluster. It can improve the performance of your Spark jobs in many cases.

Here is an updated template that includes these recommendations:

spark_session = SparkSession.builder \
    .config("spark.databricks.delta.retentionDurationCheck.enabled", "false") \
    .config("spark.sql.shuffle.partitions", "500") \
    .config("spark.databricks.delta.optimizeWrite.enabled", "true") \
    .config("spark.databricks.delta.autoCompact.enabled", "true") \
    .config("spark.databricks.delta.join.preferBroadcastHashJoin", "true") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

Please note that these recommendations may not be suitable for all use cases, so you should evaluate them based on your specific requirements and workload characteristics.

alejandrofm · ‎04-02-2023

Hi, will try that config! Only one question, when you say on the first point "Use Databricks Delta 1.0.0 or higher".

What do you mean? Should I upgrade the table manualñy? Didn't find related documentation.

Thanks!

Databricks

Understand if the configs I use to SparkSession.builder still make sense for Databricks 10+

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI