cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

why spark very slow with large number of dataframe columns

z160896
New Contributor II

scala Spark App: I have a dataset of 130x14000. I read from a parquet file with SparkSession. Then used for Spark ML Random Forest model (using pipeline). It takes 7 hours to complete! for reading the parquet file takes about 1 minute. If I implement it, it takes less than 3 seconds to complete (from reading the data and completing the modeling). What could be the problem with Scala , Spark Dataframe implementation?

2 REPLIES 2

z160896
New Contributor II

Sorry I mean: when I implemented it in Python, it takes less than 3 seconds to complete (from reading the data and completing the modeling)

EliasHaydar
New Contributor II

I've already answered a similar question on StackOverflow so I'll repeat what a I said there.

The following may not solve your problem completely but it should give you some pointer to start.

The first problem that you are facing is the disproportion between the amount of data and the resources.

This means that since you are probably parallelizing a local collection (pandas dataframe), Spark will use the default parallelism configuration. Which is most likely to be resulting in

48
partitions with less than
0.5mb
per partition. (Spark doesn't do well with small files nor small partitions)

The second problem is related to expensive optimizations/approximations techniques used by Tree models in Spark.

Spark tree models use some tricks to optimally bucket continuous variables. With small data it is way cheaper to just get the exact splits. It mainly uses approximated quantiles in this case.

Usually, in a single machine framework scenario, like

scikit
, the tree model uses unique feature values for continuous features as splits candidates for the best fit calculation. Whereas in Apache Spark, the tree model uses quantiles for each feature as a split candidate.

Another case might be model tuning : You shouldn't forget as well that cross validation is a heavy and long tasks as it's proportional to the combination of your 3 hyper-parameters times the number of folds times the time spent to train each model (GridSearch approach). You might want to cache your data per example for a start but it will still not gain you much time. I believe that spark is an overkill for this amount of data. You might want to use scikit learn instead and maybe use spark-sklearn to distributed local model training.

Spark will learn each model separately and sequentially with the hypothesis that data is distributed and big.

You can of course optimize performance using columnar data based file formats like parquet and tuning spark itself, etc. it's too broad to talk about it here.

You can read more about tree models scalability with spark-mllib in this following blogpost :

  • Scalable Decision Trees in MLlib

Unfortunately, the description given in your question doesn't allow us to give further advices but I hope that these pointers might help.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.