06-25-2022 09:56 PM
Here's the code:
val result = spark
.createDataset(List("test"))
.rdd
.repartition(100000)
.map { _ =>
"test"
}
.collect()
.toList
println(result)
I write tests to test for correctness, so I wonder if there's a way to disable repartition in unit test because I don't care about repartition in a unit test.
06-27-2022 12:13 AM
Can you make the partitioning variable? Like with a parameter? Then you can pass any value you want, and for your unit test it could be 1 f.e.
06-27-2022 01:23 AM
We definitely could. But we'd like to know if there's a better way because adding a param litter our code.
06-27-2022 01:47 AM
removing the repartition is also possible. It does not influence the program logic, only the write operation.
But that might not be an option.
06-27-2022 07:43 AM
Repartition to 1 mln without logic it has to be slow... and then collect() which run on driver and shouldn't be run in production.
That code divides your dataset into one mln parts on workers (also causes unnecessary data exchange on workers), then transfers it back to the driver and merges them for the collect function. Esing RDD is also superfluous (you don't benefit from Adaptive Query Execution when you use RDD directly)
06-27-2022 02:06 PM
We want to switch to Dataset, but Dataset also has a problem of slow unit tests.
When we convert RDD to Dataset, the test takes 3-5x longer.
We try to investigate, and we think the Dataset planning is slow because our data job contains more than 100 joins.
I posted here before but it seems there's no solution to this: https://community.databricks.com/s/question/0D53f00001gEjCdCAK/converting-from-rdd-to-dataset-and-un...
08-10-2022 06:13 AM
@tanin
If this helps, please upvote the answer
08-10-2022 10:47 AM
Thank you for the explanation. It is insightful.
I suppose this is more like a feature request then.
Right now we cannot use repartition(10000) in a unit test because it makes that test run a lot slower. Switching to Dataset also has the same issue with slow unit test.
This makes it harder to develop on Spark because the unit test becomes too slow for a complex Spark logic.
09-01-2022 08:20 PM
Hey there @tanin
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group