<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Why does chaining spark.read from one system/driver and .write to another system/driver take so much longer than doing each piece individually? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/why-does-chaining-spark-read-from-one-system-driver-and-write-to/m-p/14117#M8666</link>
    <description>&lt;P&gt;You are welcome. Thank you for choosing my answer as the best one.&lt;/P&gt;</description>
    <pubDate>Tue, 03 Jan 2023 15:10:39 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2023-01-03T15:10:39Z</dc:date>
    <item>
      <title>Why does chaining spark.read from one system/driver and .write to another system/driver take so much longer than doing each piece individually?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-does-chaining-spark-read-from-one-system-driver-and-write-to/m-p/14114#M8663</link>
      <description>&lt;P&gt;i am reading data from IBM DB2 and saving into a MS SQL server (the first step is moving the code itself to databricks, and then we will move the databases to databricks itself). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Problem I'm running into is doing something like the below will take &amp;gt; 1 hour before I stop it, but doing each step individually (using a pandas dataframe in the middle) results in the same thing taking maybe 15-20 minutes. I was wondering why, and what I can do to avoid using pandas.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;code that doesn't work/takes forever:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;(
    (
        spark.read.format("jdbc")
        .option("driver", "com.ibm.db2.jcc.DB2Driver")
        .option("url", connection_url)
        .option("query", query)
        .load()
    )
    .write.format("jdbc")
    .option("url", sqlsUrl)
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .option("dbtable", table_name)
    .option("user", username)
    .option("password", password)
    .save(mode=mode)
)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 30 Dec 2022 18:56:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-does-chaining-spark-read-from-one-system-driver-and-write-to/m-p/14114#M8663</guid>
      <dc:creator>jonathan-dufaul</dc:creator>
      <dc:date>2022-12-30T18:56:02Z</dc:date>
    </item>
    <item>
      <title>Re: Why does chaining spark.read from one system/driver and .write to another system/driver take so much longer than doing each piece individually?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-does-chaining-spark-read-from-one-system-driver-and-write-to/m-p/14115#M8664</link>
      <description>&lt;P&gt;Hi, it is related to partitioning optimization. By default, the JDBC driver queries the source database with only a single thread. So write was from one partition as one partition was created, so it was using a single core. When you used pandas, it did some transformation/actions and divided your dataset into small partitions, and then every core was writing a chunk of your dataset (partition) to SQL. Please use the below options to optimize the read, as it is auto divided&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;  # a column that can be used that has a uniformly distributed range of values that can be used for parallelization
  .option("partitionColumn", "&amp;lt;partition_key&amp;gt;")
  # lowest value to pull data for with the partitionColumn
  .option("lowerBound", "&amp;lt;min_value&amp;gt;")
  # max value to pull data for with the partitionColumn
  .option("upperBound", "&amp;lt;max_value&amp;gt;")
  # number of partitions to distribute the data into. Set it to number of cores on workers
  .option("numPartitions", 8)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jan 2023 15:56:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-does-chaining-spark-read-from-one-system-driver-and-write-to/m-p/14115#M8664</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2023-01-02T15:56:11Z</dc:date>
    </item>
    <item>
      <title>Re: Why does chaining spark.read from one system/driver and .write to another system/driver take so much longer than doing each piece individually?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-does-chaining-spark-read-from-one-system-driver-and-write-to/m-p/14116#M8665</link>
      <description>&lt;P&gt;I am always amazed at how good the answers can be on this site. Thank you so so much.&lt;/P&gt;</description>
      <pubDate>Tue, 03 Jan 2023 14:28:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-does-chaining-spark-read-from-one-system-driver-and-write-to/m-p/14116#M8665</guid>
      <dc:creator>jonathan-dufaul</dc:creator>
      <dc:date>2023-01-03T14:28:07Z</dc:date>
    </item>
    <item>
      <title>Re: Why does chaining spark.read from one system/driver and .write to another system/driver take so much longer than doing each piece individually?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-does-chaining-spark-read-from-one-system-driver-and-write-to/m-p/14117#M8666</link>
      <description>&lt;P&gt;You are welcome. Thank you for choosing my answer as the best one.&lt;/P&gt;</description>
      <pubDate>Tue, 03 Jan 2023 15:10:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-does-chaining-spark-read-from-one-system-driver-and-write-to/m-p/14117#M8666</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2023-01-03T15:10:39Z</dc:date>
    </item>
  </channel>
</rss>

