Re: Why does chaining spark.read from one system/d...

Hubert-Dudek · ‎01-02-2023

Hi, it is related to partitioning optimization. By default, the JDBC driver queries the source database with only a single thread. So write was from one partition as one partition was created, so it was using a single core. When you used pandas, it did some transformation/actions and divided your dataset into small partitions, and then every core was writing a chunk of your dataset (partition) to SQL. Please use the below options to optimize the read, as it is auto divided

  # a column that can be used that has a uniformly distributed range of values that can be used for parallelization
  .option("partitionColumn", "<partition_key>")
  # lowest value to pull data for with the partitionColumn
  .option("lowerBound", "<min_value>")
  # max value to pull data for with the partitionColumn
  .option("upperBound", "<max_value>")
  # number of partitions to distribute the data into. Set it to number of cores on workers
  .option("numPartitions", 8)

My blog: https://databrickster.medium.com/

View solution in original post