Databricks Community

tanin · 10-25-2022

I profile it and it seems the slowness comes from Spark planning, especially for a more complex job (e.g. 100+ joins). Is there a way to speed it up (e.g. by disabling certain optimization)?

tanin · 06-25-2022

Here's the code:val result = spark .createDataset(List("test")) .rdd .repartition(100000) .map { _ => "test" } .collect() .toList println(result)I write tests to test for correctness, so I wonde...

tanin · 02-06-2022

I converted a data job fro RDD to Dataset, and I've found that, in prod, the data job runs faster, which is nice.But unit test runs 3x slower than before.My best guess is that Dataset spends time doing a lot of stuffs like encoding, optimizing, query...

tanin · 11-30-2022

This is a unit test in Scala/Spark that is not in notebooks. It's in our repo.

tanin · 08-10-2022

Thank you for the explanation. It is insightful.I suppose this is more like a feature request then.Right now we cannot use repartition(10000) in a unit test because it makes that test run a lot slower. Switching to Dataset also has the same issue wit...

tanin · 06-27-2022

We want to switch to Dataset, but Dataset also has a problem of slow unit tests.When we convert RDD to Dataset, the test takes 3-5x longer. We try to investigate, and we think the Dataset planning is slow because our data job contains more than 100 j...

tanin · 06-27-2022

We definitely could. But we'd like to know if there's a better way because adding a param litter our code.

Databricks Community

User Stats

User Activity

Does anybody feel the unit test on Dataset is slow? (much slower than RDD). This is in Scala.

Using .repartition(100000) causes the unit test to be extremely slow (>20 mins). Is there a way to speed it up?

Converting from RDD to Dataset, and unit test takes 3x slower. (but prod is faster)

Re: Does anybody feel the unit test on Dataset is slow? (much slower than RDD). This is in Scala.

Re: Using .repartition(100000) causes the unit test to be extremely slow (>20 mins). Is there a way to speed it up?

Re: Using .repartition(100000) causes the unit test to be extremely slow (>20 mins). Is there a way to speed it up?

Re: Using .repartition(100000) causes the unit test to be extremely slow (>20 mins). Is there a way to speed it up?