<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: OPTIMIZE in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/optimize/m-p/11036#M6077</link>
    <description>&lt;P&gt;that depends on the query, the table and what optimize you use (binning, z-order).&lt;/P&gt;&lt;P&gt;Delta lake by default collects statistics for the first 32 columns (which can be changed).&lt;/P&gt;&lt;P&gt;Building statistics for long strings is also more expensive than f.e. for integers.&lt;/P&gt;&lt;P&gt;Then there is also the fact that evaluating numbers is faster than strings.&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-copy-into#load-csv-files" alt="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-copy-into#load-csv-files" target="_blank"&gt;https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-copy-into#load-csv-files&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What could also play is auto scaling on your cluster, or spot instances which are abandoned etc.&lt;/P&gt;&lt;P&gt;So, not easy to pinpoint the difference.&lt;/P&gt;</description>
    <pubDate>Thu, 11 Nov 2021 07:54:16 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2021-11-11T07:54:16Z</dc:date>
    <item>
      <title>OPTIMIZE</title>
      <link>https://community.databricks.com/t5/data-engineering/optimize/m-p/11034#M6075</link>
      <description>&lt;P&gt;I have been testing OPTIMIZE a huge set of data (about 775 million rows) and getting mixed results. When I tried on a 'string' column, the query return in 2.5mins and using the same column as 'integer', using the same query, it return 9.7 seconds. Please advice. &lt;/P&gt;&lt;P&gt;I am using 9.1 LTS on the Azure environment.&lt;/P&gt;</description>
      <pubDate>Thu, 11 Nov 2021 02:03:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimize/m-p/11034#M6075</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-11-11T02:03:54Z</dc:date>
    </item>
    <item>
      <title>Re: OPTIMIZE</title>
      <link>https://community.databricks.com/t5/data-engineering/optimize/m-p/11036#M6077</link>
      <description>&lt;P&gt;that depends on the query, the table and what optimize you use (binning, z-order).&lt;/P&gt;&lt;P&gt;Delta lake by default collects statistics for the first 32 columns (which can be changed).&lt;/P&gt;&lt;P&gt;Building statistics for long strings is also more expensive than f.e. for integers.&lt;/P&gt;&lt;P&gt;Then there is also the fact that evaluating numbers is faster than strings.&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-copy-into#load-csv-files" alt="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-copy-into#load-csv-files" target="_blank"&gt;https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-copy-into#load-csv-files&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What could also play is auto scaling on your cluster, or spot instances which are abandoned etc.&lt;/P&gt;&lt;P&gt;So, not easy to pinpoint the difference.&lt;/P&gt;</description>
      <pubDate>Thu, 11 Nov 2021 07:54:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimize/m-p/11036#M6077</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-11-11T07:54:16Z</dc:date>
    </item>
    <item>
      <title>Re: OPTIMIZE</title>
      <link>https://community.databricks.com/t5/data-engineering/optimize/m-p/11037#M6078</link>
      <description>&lt;P&gt;@Werner Stinckens​&amp;nbsp; Thanks for your explanation. &lt;/P&gt;</description>
      <pubDate>Fri, 12 Nov 2021 05:52:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimize/m-p/11037#M6078</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-11-12T05:52:06Z</dc:date>
    </item>
  </channel>
</rss>

