<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: What is the best way to handle big data sets? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24976#M17385</link>
    <description>&lt;UL&gt;&lt;LI&gt;look for data skews; some partitions can be very big, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()) specially sql can divide it unequally to partitions (there are settings in conenctor lowerBound etc). You could try to have number of partitions as workers cores multiply by X (so they will be processed step by step in queue),&lt;/LI&gt;&lt;LI&gt;increase shuffle size &lt;I&gt;spark.sql.shuffle.partitions &lt;/I&gt;default is 200 try bigger, you should calculate it as data size divided by size of partition,&lt;/LI&gt;&lt;LI&gt;increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integrate datadog with cluster),&lt;/LI&gt;&lt;LI&gt;check wide transformations, ones which need to shuffle data between partitions, group them together to do one shuffle only,&lt;/LI&gt;&lt;LI&gt;if you need to filter data if possible do it after read from sql so it will be predicative push so it will add where in sql query,&lt;/LI&gt;&lt;LI&gt;make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors. Don't use collect etc.&lt;/LI&gt;&lt;LI&gt;Regarding infrastructure use more workers and check that your ADLS is connected via private link. Monitor save progress in folder. You can also use premium ADLS which is faster.&lt;/LI&gt;&lt;LI&gt;sometimes I process big data as stream as it is easier with big data sets, in that scenario you would need kafka (can be confluent cloud) between SQL and Databricks&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 22 Mar 2022 10:49:45 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2022-03-22T10:49:45Z</dc:date>
    <item>
      <title>What is the best way to handle big data sets?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24974#M17383</link>
      <description>&lt;P&gt;I'm trying to find the best strategy for handling big data sets. In this case I have something that is 450 million records. I'm pulling the data from SQL Server  very quickly but when I try to push the data to the Delta Table OR a Azure Container the compute resource locks up and never completes. I end up canceling the process after an hour. Looking at the logs it looks like the compute resource keeps hitting memory issues. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 22 Mar 2022 04:59:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24974#M17383</guid>
      <dc:creator>Chris_Shehu</dc:creator>
      <dc:date>2022-03-22T04:59:31Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to handle big data sets?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24975#M17384</link>
      <description>&lt;P&gt;@Christopher Shehu​&amp;nbsp; if you are seeing clusters are hitting memory limit, you may try increasing the cluster size. &lt;/P&gt;&lt;P&gt;Other points to consider:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Avoid memory intensive operations like:&lt;/LI&gt;&lt;LI&gt;collect()&lt;UL&gt;&lt;LI&gt;&amp;nbsp;operator, which brings a large amount of data to the driver.&lt;/LI&gt;&lt;LI&gt;Conversion of a large DataFrame to Pandas&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Please find more details here -&lt;/P&gt;&lt;P&gt;&lt;A href="https://kb.databricks.com/jobs/driver-unavailable.html" target="test_blank"&gt;https://kb.databricks.com/jobs/driver-unavailable.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt; You may consider reading this too -&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.microsoft.com/en-us/azure/databricks/kb/jobs/job-fails-maxresultsize-exception" target="test_blank"&gt;https://docs.microsoft.com/en-us/azure/databricks/kb/jobs/job-fails-maxresultsize-exception&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 22 Mar 2022 05:39:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24975#M17384</guid>
      <dc:creator>Atanu</dc:creator>
      <dc:date>2022-03-22T05:39:00Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to handle big data sets?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24976#M17385</link>
      <description>&lt;UL&gt;&lt;LI&gt;look for data skews; some partitions can be very big, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()) specially sql can divide it unequally to partitions (there are settings in conenctor lowerBound etc). You could try to have number of partitions as workers cores multiply by X (so they will be processed step by step in queue),&lt;/LI&gt;&lt;LI&gt;increase shuffle size &lt;I&gt;spark.sql.shuffle.partitions &lt;/I&gt;default is 200 try bigger, you should calculate it as data size divided by size of partition,&lt;/LI&gt;&lt;LI&gt;increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integrate datadog with cluster),&lt;/LI&gt;&lt;LI&gt;check wide transformations, ones which need to shuffle data between partitions, group them together to do one shuffle only,&lt;/LI&gt;&lt;LI&gt;if you need to filter data if possible do it after read from sql so it will be predicative push so it will add where in sql query,&lt;/LI&gt;&lt;LI&gt;make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors. Don't use collect etc.&lt;/LI&gt;&lt;LI&gt;Regarding infrastructure use more workers and check that your ADLS is connected via private link. Monitor save progress in folder. You can also use premium ADLS which is faster.&lt;/LI&gt;&lt;LI&gt;sometimes I process big data as stream as it is easier with big data sets, in that scenario you would need kafka (can be confluent cloud) between SQL and Databricks&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 22 Mar 2022 10:49:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24976#M17385</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-03-22T10:49:45Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to handle big data sets?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24977#M17386</link>
      <description>&lt;P&gt;This is helpful I think I need to look closer at the process and see what needs to be done. The Azure Databricks documentation on pyspark partitioning is lacking. &lt;/P&gt;</description>
      <pubDate>Tue, 22 Mar 2022 13:43:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24977#M17386</guid>
      <dc:creator>Chris_Shehu</dc:creator>
      <dc:date>2022-03-22T13:43:46Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to handle big data sets?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24978#M17387</link>
      <description>&lt;P&gt;Cherish your data. “Keep your raw data raw: don't manipulate it without having a copy,” says Teal. Visualize the information. Show your workflow. Use version control. Record metadata. Automate, automate, automate. Make computing time count. Capture your environment.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.livetheorangelife.us/" alt="https://www.livetheorangelife.us/" target="_blank"&gt;LiveTheOrangeLife.com&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Mar 2022 12:04:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/24978#M17387</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-03-25T12:04:26Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to handle big data sets?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/39655#M27048</link>
      <description>&lt;P&gt;I think you should consult experts in &lt;A href="https://tech-stack.com/services/big-data-and-analytics" target="_self"&gt;Big Data&lt;/A&gt; for advice on this issue&lt;/P&gt;</description>
      <pubDate>Fri, 11 Aug 2023 13:41:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-handle-big-data-sets/m-p/39655#M27048</guid>
      <dc:creator>Wilynan</dc:creator>
      <dc:date>2023-08-11T13:41:05Z</dc:date>
    </item>
  </channel>
</rss>

