cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Using .repartition(100000) causes the unit test to be extremely slow (>20 mins). Is there a way to speed it up?

tanin
Contributor

Here's the code:

val result = spark
      .createDataset(List("test"))
      .rdd
      .repartition(100000)
      .map { _ =>
        "test"
      }
      .collect()
      .toList
 
    println(result)

I write tests to test for correctness, so I wonder if there's a way to disable repartition in unit test because I don't care about repartition in a unit test.

8 REPLIES 8

-werners-
Esteemed Contributor III

Can you make the partitioning variable? Like with a parameter? Then you can pass any value you want, and for your unit test it could be 1 f.e.

We definitely could. But we'd like to know if there's a better way because adding a param litter our code.

-werners-
Esteemed Contributor III

removing the repartition is also possible. It does not influence the program logic, only the write operation.

But that might not be an option.

Hubert-Dudek
Esteemed Contributor III

Repartition to 1 mln without logic it has to be slow... and then collect() which run on driver and shouldn't be run in production.

That code divides your dataset into one mln parts on workers (also causes unnecessary data exchange on workers), then transfers it back to the driver and merges them for the collect function. Esing RDD is also superfluous (you don't benefit from Adaptive Query Execution when you use RDD directly)

We want to switch to Dataset, but Dataset also has a problem of slow unit tests.

When we convert RDD to Dataset, the test takes 3-5x longer.

We try to investigate, and we think the Dataset planning is slow because our data job contains more than 100 joins.

I posted here before but it seems there's no solution to this: https://community.databricks.com/s/question/0D53f00001gEjCdCAK/converting-from-rdd-to-dataset-and-un...

Deepak_Bhutada
Contributor III

@taninโ€‹ 

  • When you do repartition(10000), it will create 10000 tasks which will take time to launch especially if the number of cores available in your clusters is less. e.g. If you have 100 cores available in the cluster, it would have to run 100 times (100 * 100 = 10000) to complete that many tasks
  • Use the repartition number such that data block size on each task is distributed evenly and the block size is ~128MB
  • Avoid using collect as after doing repartition (which will distribute data amongst executors), collect will try to bring it back to a single driver which is a heavy shuffle operation and will use lot of IO

If this helps, please upvote the answer

Thank you for the explanation. It is insightful.

I suppose this is more like a feature request then.

Right now we cannot use repartition(10000) in a unit test because it makes that test run a lot slower. Switching to Dataset also has the same issue with slow unit test.

This makes it harder to develop on Spark because the unit test becomes too slow for a complex Spark logic.

Vidula
Honored Contributor

Hey there @taninโ€‹ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group