Databricks Community

tanin · ‎06-25-2022

Here's the code:

val result = spark
      .createDataset(List("test"))
      .rdd
      .repartition(100000)
      .map { _ =>
        "test"
      }
      .collect()
      .toList
 
    println(result)

I write tests to test for correctness, so I wonder if there's a way to disable repartition in unit test because I don't care about repartition in a unit test.

-werners- · ‎06-27-2022

Can you make the partitioning variable? Like with a parameter? Then you can pass any value you want, and for your unit test it could be 1 f.e.

tanin · ‎06-27-2022

We definitely could. But we'd like to know if there's a better way because adding a param litter our code.

-werners- · ‎06-27-2022

removing the repartition is also possible. It does not influence the program logic, only the write operation.

But that might not be an option.

Hubert-Dudek · ‎06-27-2022

Repartition to 1 mln without logic it has to be slow... and then collect() which run on driver and shouldn't be run in production.

That code divides your dataset into one mln parts on workers (also causes unnecessary data exchange on workers), then transfers it back to the driver and merges them for the collect function. Esing RDD is also superfluous (you don't benefit from Adaptive Query Execution when you use RDD directly)

tanin · ‎06-27-2022

We want to switch to Dataset, but Dataset also has a problem of slow unit tests.

When we convert RDD to Dataset, the test takes 3-5x longer.

We try to investigate, and we think the Dataset planning is slow because our data job contains more than 100 joins.

I posted here before but it seems there's no solution to this: https://community.databricks.com/s/question/0D53f00001gEjCdCAK/converting-from-rdd-to-dataset-and-un...

Deepak_Bhutada · ‎08-10-2022

@tanin

When you do repartition(10000), it will create 10000 tasks which will take time to launch especially if the number of cores available in your clusters is less. e.g. If you have 100 cores available in the cluster, it would have to run 100 times (100 * 100 = 10000) to complete that many tasks
Use the repartition number such that data block size on each task is distributed evenly and the block size is ~128MB
Avoid using collect as after doing repartition (which will distribute data amongst executors), collect will try to bring it back to a single driver which is a heavy shuffle operation and will use lot of IO

If this helps, please upvote the answer

tanin · ‎08-10-2022

Thank you for the explanation. It is insightful.

I suppose this is more like a feature request then.

Right now we cannot use repartition(10000) in a unit test because it makes that test run a lot slower. Switching to Dataset also has the same issue with slow unit test.

This makes it harder to develop on Spark because the unit test becomes too slow for a complex Spark logic.

Vidula · ‎09-01-2022

Hey there @tanin

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!