Optimizing a batch load process, reading with the JDBC driver

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

I am doing a batch load, using the JDBC driver from a database table. I am noticing in Sparkui, that there is both memory and disk spill, but only on one executor. I am also, noticing that when trying to use the JDBC parallel read, it seems to run slower, then leaving it to default.

Some details:

I have 4 workers, 8 GB
The source table is around 80 million rows
I am using a "dateloaded" as the partition column.
sqlContext.setConf("spark.sql.shuffle.partitions","4"), set the shuffle partition size. Is it correct to set the shuffle the executor counts?
numPartitions=12, is it correct that it's ideal to have 3-4 task per executor?

The "dateloaded" is not a primary key, but is index. Is the spill a result of data skew? or have I set too few/many partitions for the shuffle or read?