There are several possible ways to improve the performance of your Spark streaming job for ingesting a large volume of S3 files. Here are a few suggestions:
- Tune the spark.sql.shuffle.partitions config parameter:
By default, the number of shuffle partitions is set to 200, which may be too low for your workload. This parameter controls how many partitions Spark should use when shuffling data between stages, and too few partitions can lead to low parallelism and slow performance. You can try increasing this parameter to a higher number, based on the size of your data and the number of cores on your cluster nodes, to increase parallelism and improve performance.
For example:
spark.conf.set("spark.sql.shuffle.partitions", "1000")
- Coalesce the output dataframe before writing to Delta Lake:
When writing to Delta Lake, Spark first writes the output to a set of temporary files, and then merges them into the main table using a background job. If you have too many small files, this can lead to poor performance. To help mitigate this issue, you could coalesce the output dataframe to reduce the number of files created.
For example:
df.coalesce(16).write.mode("append").format("delta")...
- Simplify/ optimize your UDFs
Performing row level transformation using UDFs can be expensive, especially if the function is not optimized. You can try optimizing your UDFs by:
- Broadcasting small DataFrames, making them available on all nodes.
- Use vectorized UDFs
- Avoid Python UDFs if possible.
- Increase instance types, adjust autoscaling:
You may want to consider using larger instance types. Alternatively, you could adjust your autoscaling policy to dynamically scale the cluster size based on incoming workload. You can use Databricks Autoscaling feature to help maximize utilization and reduce overall cluster costs.
- Use Delta's auto-optimize feature
Consider turning on Delta Lake's auto-optimize features explicitly. As you keep ingesting data, the options like optimize/compact can improve query performance.
These are some of the best practices to improve the Spark streaming application performance for ingesting large volumes of data from S3 using AutoLoader.
I hope that helps!