โ02-28-2022 05:43 AM
Hi! I'm starting to test configs on DataBricks, for example, to avoid corrupting data if two processes try to write at the same time:
.config('spark.databricks.delta.multiClusterWrites.enabled', 'false')
Or if I need more partitions than default
.config('spark.databricks.adaptive.autoOptimizeShuffle.enabled', 'true')
Is there another recommended default setting? (then goes the tunning for each job)
Thanks!
โ02-28-2022 09:39 AM
Delta tables do have optimistic concurrency control. So if two processes are trying to write to the same table it does its best to handle both but if the transactions are conflicts then it will fail. You can also change the isolation levels if you want to enforce more control on that.
โ02-28-2022 09:39 AM
Delta tables do have optimistic concurrency control. So if two processes are trying to write to the same table it does its best to handle both but if the transactions are conflicts then it will fail. You can also change the isolation levels if you want to enforce more control on that.
โ03-01-2022 12:29 AM
Exactly. You can easy verify that as commits are written to separate files in delta log.
Regarding:
.config('spark.databricks.adaptive.autoOptimizeShuffle.enabled', 'true')
and other spark optimization solutions please watch databricks video about that https://www.youtube.com/watch?v=daXEp4HmS-E
โ03-17-2022 06:07 AM
It helped but still testing different configurations, thank you!
โ04-28-2022 09:27 AM
Hey there @Alejandro Martinezโ
Hope everything is going well.
Just wanted to see if you were able to find an answer to your question. If yes, would you be happy to let us know and mark it as best so that other members can find the solution more quickly?
Cheers!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group