cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimizing a batch load process, reading with the JDBC driver

huyd
New Contributor III

I am doing a batch load, using the JDBC driver from a database table. I am noticing in Sparkui, that there is both memory and disk spill, but only on one executor. I am also, noticing that when trying to use the JDBC parallel read, it seems to run slower, then leaving it to default.

Some details:

  • I have 4 workers, 8 GB
  • The source table is around 80 million rows
  • I am using a "dateloaded" as the partition column.
  •  sqlContext.setConf("spark.sql.shuffle.partitions","4"), set the shuffle partition size. Is it correct to set the shuffle the executor counts?
  • numPartitions=12, is it correct that it's ideal to have 3-4 task per executor?

The "dateloaded" is not a primary key, but is index. Is the spill a result of data skew? or have I set too few/many partitions for the shuffle or read?

0 REPLIES 0
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.