cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimizing a batch load process, reading with the JDBC driver

huyd
New Contributor III

I am doing a batch load, using the JDBC driver from a database table. I am noticing in Sparkui, that there is both memory and disk spill, but only on one executor. I am also, noticing that when trying to use the JDBC parallel read, it seems to run slower, then leaving it to default.

Some details:

  • I have 4 workers, 8 GB
  • The source table is around 80 million rows
  • I am using a "dateloaded" as the partition column.
  •  sqlContext.setConf("spark.sql.shuffle.partitions","4"), set the shuffle partition size. Is it correct to set the shuffle the executor counts?
  • numPartitions=12, is it correct that it's ideal to have 3-4 task per executor?

The "dateloaded" is not a primary key, but is index. Is the spill a result of data skew? or have I set too few/many partitions for the shuffle or read?

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now