<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: spark sql update really slow in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-sql-update-really-slow/m-p/23237#M16002</link>
    <description>&lt;P&gt;@Pat Sienkiewicz​&amp;nbsp;.  That's good tips.  Thanks.&lt;/P&gt;</description>
    <pubDate>Wed, 09 Nov 2022 02:26:23 GMT</pubDate>
    <dc:creator>gideont</dc:creator>
    <dc:date>2022-11-09T02:26:23Z</dc:date>
    <item>
      <title>spark sql update really slow</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-sql-update-really-slow/m-p/23235#M16000</link>
      <description>&lt;P&gt;I tried to use Spark as much as possible but experience some regression.  Hopefully to get some direction how to use it correctly.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I've created a Databricks table using spark.sql&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.sql('select * from example_view ') \
    .write \
    .mode('overwrite') \
    .saveAsTable('example_table')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;and then I need to patch some value&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%sql 
&amp;nbsp;
update example_table set create_date = '2022-02-16' where id = '123';
update example_table set create_date = '2022-02-17' where id = '124';
update example_table set create_date = '2022-02-18' where id = '125';
update example_table set create_date = '2022-02-19' where id = '126';&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;However, I found this awlfully slow since it created hundreds of spark jobs:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image.png"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1236i9D3E24F72DF2C3C8/image-size/large?v=v2&amp;amp;px=999" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;Why it Spark doing this and any suggestion how to improve my code?  Last thing I want to do is to convert it back to Pandas and update the cell values individually.    Any suggestion is appreciated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2022 04:11:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-sql-update-really-slow/m-p/23235#M16000</guid>
      <dc:creator>gideont</dc:creator>
      <dc:date>2022-11-08T04:11:31Z</dc:date>
    </item>
    <item>
      <title>Re: spark sql update really slow</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-sql-update-really-slow/m-p/23236#M16001</link>
      <description>&lt;P&gt;Hi, @Vincent Doe​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Updates are available in Delta tables, but under the hood you are updating parquet files, it means that each update needs to find the file where records are stored, then re-write the file to new version, and make new file current version. &lt;/P&gt;&lt;P&gt;In your case maybe you should try something like this:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;    spark.sql("""
select 
col1,
col2,
col3,
case 
when id = '123' then '2022-02-16'
when id = '124' then '2022-02-17'
end as create_date
...
 from example_view
""") \
        .write \
        .mode('overwrite') \
        .saveAsTable('example_table')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2022 08:00:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-sql-update-really-slow/m-p/23236#M16001</guid>
      <dc:creator>Pat</dc:creator>
      <dc:date>2022-11-08T08:00:14Z</dc:date>
    </item>
    <item>
      <title>Re: spark sql update really slow</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-sql-update-really-slow/m-p/23237#M16002</link>
      <description>&lt;P&gt;@Pat Sienkiewicz​&amp;nbsp;.  That's good tips.  Thanks.&lt;/P&gt;</description>
      <pubDate>Wed, 09 Nov 2022 02:26:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-sql-update-really-slow/m-p/23237#M16002</guid>
      <dc:creator>gideont</dc:creator>
      <dc:date>2022-11-09T02:26:23Z</dc:date>
    </item>
  </channel>
</rss>

