Optimizing a batch load process, reading with the JDBC driver
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-22-2022 02:47 PM
I am doing a batch load, using the JDBC driver from a database table. I am noticing in Sparkui, that there is both memory and disk spill, but only on one executor. I am also, noticing that when trying to use the JDBC parallel read, it seems to run slower, then leaving it to default.
Some details:
- I have 4 workers, 8 GB
- The source table is around 80 million rows
- I am using a "dateloaded" as the partition column.
- sqlContext.setConf("spark.sql.shuffle.partitions","4"), set the shuffle partition size. Is it correct to set the shuffle the executor counts?
- numPartitions=12, is it correct that it's ideal to have 3-4 task per executor?
The "dateloaded" is not a primary key, but is index. Is the spill a result of data skew? or have I set too few/many partitions for the shuffle or read?
Labels:
- Labels:
-
Azure databricks
-
BatchJob
-
Data
-
Spark ui