Optimizing a batch load process, reading with the JDBC driver

- - Certifications
- - Learning Paths

- - Community Discussions
- - GenAI Insight Hub
- - Get Started Discussions

- - Get Started Resources
- - Events
- - Product Platform Updates
- - Support FAQs
- - Technical Blog
- - What's New in Databricks
- - Get Started Guides
- - Knowledge Sharing Hub
- - Announcements

- - Technical Councils
- - Private Groups
- - Skills@Scale

- - Databricks Community Champions
- - Khoros Community Forums Support (Not for Databricks Product Questions)
- - Databricks Community Code of Conduct
- - Community Newsletter

Data Engineering

I am doing a batch load, using the JDBC driver from a database table. I am noticing in Sparkui, that there is both memory and disk spill, but only on one executor. I am also, noticing that when trying to use the JDBC parallel read, it seems to run slower, then leaving it to default.

Some details:

I have 4 workers, 8 GB
The source table is around 80 million rows
I am using a "dateloaded" as the partition column.
sqlContext.setConf("spark.sql.shuffle.partitions","4"), set the shuffle partition size. Is it correct to set the shuffle the executor counts?
numPartitions=12, is it correct that it's ideal to have 3-4 task per executor?

The "dateloaded" is not a primary key, but is index. Is the spill a result of data skew? or have I set too few/many partitions for the shuffle or read?