Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-02-2023 07:56 AM
Hi, it is related to partitioning optimization. By default, the JDBC driver queries the source database with only a single thread. So write was from one partition as one partition was created, so it was using a single core. When you used pandas, it did some transformation/actions and divided your dataset into small partitions, and then every core was writing a chunk of your dataset (partition) to SQL. Please use the below options to optimize the read, as it is auto divided
# a column that can be used that has a uniformly distributed range of values that can be used for parallelization
.option("partitionColumn", "<partition_key>")
# lowest value to pull data for with the partitionColumn
.option("lowerBound", "<min_value>")
# max value to pull data for with the partitionColumn
.option("upperBound", "<max_value>")
# number of partitions to distribute the data into. Set it to number of cores on workers
.option("numPartitions", 8)
My blog: https://databrickster.medium.com/