โ06-25-2022 09:56 PM
Here's the code:
val result = spark
.createDataset(List("test"))
.rdd
.repartition(100000)
.map { _ =>
"test"
}
.collect()
.toList
println(result)
I write tests to test for correctness, so I wonder if there's a way to disable repartition in unit test because I don't care about repartition in a unit test.
โ06-27-2022 12:13 AM
Can you make the partitioning variable? Like with a parameter? Then you can pass any value you want, and for your unit test it could be 1 f.e.
โ06-27-2022 01:23 AM
We definitely could. But we'd like to know if there's a better way because adding a param litter our code.
โ06-27-2022 01:47 AM
removing the repartition is also possible. It does not influence the program logic, only the write operation.
But that might not be an option.
โ06-27-2022 07:43 AM
Repartition to 1 mln without logic it has to be slow... and then collect() which run on driver and shouldn't be run in production.
That code divides your dataset into one mln parts on workers (also causes unnecessary data exchange on workers), then transfers it back to the driver and merges them for the collect function. Esing RDD is also superfluous (you don't benefit from Adaptive Query Execution when you use RDD directly)
โ06-27-2022 02:06 PM
We want to switch to Dataset, but Dataset also has a problem of slow unit tests.
When we convert RDD to Dataset, the test takes 3-5x longer.
We try to investigate, and we think the Dataset planning is slow because our data job contains more than 100 joins.
I posted here before but it seems there's no solution to this: https://community.databricks.com/s/question/0D53f00001gEjCdCAK/converting-from-rdd-to-dataset-and-un...
โ08-10-2022 06:13 AM
@taninโ
If this helps, please upvote the answer
โ08-10-2022 10:47 AM
Thank you for the explanation. It is insightful.
I suppose this is more like a feature request then.
Right now we cannot use repartition(10000) in a unit test because it makes that test run a lot slower. Switching to Dataset also has the same issue with slow unit test.
This makes it harder to develop on Spark because the unit test becomes too slow for a complex Spark logic.
โ09-01-2022 08:20 PM
Hey there @taninโ
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group