<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: all-purpose compute for Oracle queries in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/all-purpose-compute-for-oracle-queries/m-p/97154#M39438</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/127154"&gt;@ElaPG1&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;While the cluster sounds like a pretty good one with Autoscaling, it depends on the workload too.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;The &lt;CODE&gt;Standard_D8s_v5&lt;/CODE&gt; instances you are using have 32GB memory and 8 cores. While these are generally good, you might want to experiment with different instance types that offer a better balance of CPU and memory for your specific workload. For example, instances with higher memory might help if your tasks are memory-intensive.&lt;/LI&gt;
&lt;LI&gt;Adjust the batch size for data extraction from Oracle. Larger batch sizes can reduce the number of round trips to the database, but they also require more memory.&lt;/LI&gt;
&lt;LI&gt;Ensure that the data extraction process is parallelized effectively. Use multiple connections to the Oracle database to extract data from different tables simultaneously.&lt;/LI&gt;
&lt;LI&gt;Check if you are setting the appropriate block sizes and compression codecs. For example, using &lt;CODE&gt;snappy&lt;/CODE&gt; compression can speed up the writing process.&lt;/LI&gt;
&lt;LI&gt;Partition the data appropriately when writing to Parquet files. This can improve both the writing and subsequent reading performance.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;I would also suggest you to go to the Spark UI and understand the stage/task which is taking more time, what operation it is performing and check the metrics both on the DAG, with additional metrics checkbox on the Spark UI.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You could also see the memory and CPU utilization.&lt;/P&gt;
&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Fri, 01 Nov 2024 05:46:43 GMT</pubDate>
    <dc:creator>NandiniN</dc:creator>
    <dc:date>2024-11-01T05:46:43Z</dc:date>
    <item>
      <title>all-purpose compute for Oracle queries</title>
      <link>https://community.databricks.com/t5/data-engineering/all-purpose-compute-for-oracle-queries/m-p/94195#M38834</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am looking for any guidelines, best practices regarding compute configuration for extracting data from Oracle db and saving it as parquet files. Right now I have a DBR workflow with for each task, concurrency = 31 (as I need to copy the data from 31 tables). I use Standard_D8s_v5 for both - worker and driver (32GB memory, 8 cores, min workers 2, max workers 31, enable autoscaling - checked). It takes over 1,5h to save the result from all 31 tables.&lt;/P&gt;&lt;P&gt;Any ideas what could potentially speed up the process?&lt;/P&gt;</description>
      <pubDate>Tue, 15 Oct 2024 19:59:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/all-purpose-compute-for-oracle-queries/m-p/94195#M38834</guid>
      <dc:creator>ElaPG1</dc:creator>
      <dc:date>2024-10-15T19:59:07Z</dc:date>
    </item>
    <item>
      <title>Re: all-purpose compute for Oracle queries</title>
      <link>https://community.databricks.com/t5/data-engineering/all-purpose-compute-for-oracle-queries/m-p/97154#M39438</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/127154"&gt;@ElaPG1&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;While the cluster sounds like a pretty good one with Autoscaling, it depends on the workload too.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;The &lt;CODE&gt;Standard_D8s_v5&lt;/CODE&gt; instances you are using have 32GB memory and 8 cores. While these are generally good, you might want to experiment with different instance types that offer a better balance of CPU and memory for your specific workload. For example, instances with higher memory might help if your tasks are memory-intensive.&lt;/LI&gt;
&lt;LI&gt;Adjust the batch size for data extraction from Oracle. Larger batch sizes can reduce the number of round trips to the database, but they also require more memory.&lt;/LI&gt;
&lt;LI&gt;Ensure that the data extraction process is parallelized effectively. Use multiple connections to the Oracle database to extract data from different tables simultaneously.&lt;/LI&gt;
&lt;LI&gt;Check if you are setting the appropriate block sizes and compression codecs. For example, using &lt;CODE&gt;snappy&lt;/CODE&gt; compression can speed up the writing process.&lt;/LI&gt;
&lt;LI&gt;Partition the data appropriately when writing to Parquet files. This can improve both the writing and subsequent reading performance.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;I would also suggest you to go to the Spark UI and understand the stage/task which is taking more time, what operation it is performing and check the metrics both on the DAG, with additional metrics checkbox on the Spark UI.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You could also see the memory and CPU utilization.&lt;/P&gt;
&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Fri, 01 Nov 2024 05:46:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/all-purpose-compute-for-oracle-queries/m-p/97154#M39438</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2024-11-01T05:46:43Z</dc:date>
    </item>
  </channel>
</rss>

