<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to Optimize Data Pipeline Development on Databricks for Large-Scale Workloads? in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141056#M11122</link>
    <description>&lt;P&gt;Yes, +1 to&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;This is the best doc to start with on the optimization part -&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;A href="https://www.databricks.com/discover/pages/optimize-data-workloads-guide" target="_blank"&gt;https://www.databricks.com/discover/pages/optimize-data-workloads-guide &lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 03 Dec 2025 18:44:23 GMT</pubDate>
    <dc:creator>iyashk-DB</dc:creator>
    <dc:date>2025-12-03T18:44:23Z</dc:date>
    <item>
      <title>How to Optimize Data Pipeline Development on Databricks for Large-Scale Workloads?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/140856#M11101</link>
      <description>&lt;P&gt;Hi everyone,&lt;BR /&gt;I’m working on building and optimizing data pipelines in Databricks, especially for large-scale workloads, and I want to learn from others who have hands-on experience with performance tuning, architecture decisions, and best practices.&lt;/P&gt;&lt;P&gt;I’d appreciate insights on the following:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Best practices for designing scalable pipelines in Databricks&lt;/LI&gt;&lt;LI&gt;How to optimize Spark jobs (partitioning, caching, cluster configs, shuffling, etc.)&lt;/LI&gt;&lt;LI&gt;Recommended cluster settings for heavy workloads&lt;/LI&gt;&lt;LI&gt;How to reduce runtime and cost while processing massive datasets&lt;/LI&gt;&lt;LI&gt;Tips for handling data skew, shuffle issues, and memory errors&lt;/LI&gt;&lt;LI&gt;Which Delta Lake features help most (Z-order, Optimize, Auto Compaction, etc.)&lt;/LI&gt;&lt;LI&gt;Workflow orchestration approaches — using Jobs, Workflows, or external tools&lt;/LI&gt;&lt;LI&gt;Monitoring &amp;amp; debugging strategies (metrics, logs, Ganglia, event logs)&lt;/LI&gt;&lt;LI&gt;Libraries, patterns, or design approaches that improved your pipeline performance&lt;/LI&gt;&lt;LI&gt;Common bottlenecks you've faced and how you solved them&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Basically, I’m asking experienced Databricks users to share optimization tips, common pitfalls, and real-world strategies that make large-scale data pipeline development more efficient.&lt;/P&gt;&lt;P&gt;Looking forward to your input and practical advice!&lt;/P&gt;</description>
      <pubDate>Tue, 02 Dec 2025 10:36:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/140856#M11101</guid>
      <dc:creator>tarunnagar</dc:creator>
      <dc:date>2025-12-02T10:36:09Z</dc:date>
    </item>
    <item>
      <title>Re: How to Optimize Data Pipeline Development on Databricks for Large-Scale Workloads?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/140991#M11112</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/191263"&gt;@tarunnagar&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;There's a really good guide prepared by Databricks about performance &lt;SPAN&gt;optimization and tuning that you can use. It shows all important aspect that you should have in mind&amp;nbsp; to have performant workloads&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;A href="https://www.databricks.com/discover/pages/optimize-data-workloads-guide#data-purging" target="_blank"&gt;Comprehensive Guide to Optimize Data Workloads | Databricks&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Also, you can take a look at recommendations in their docs:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/optimizations/" target="_blank"&gt;Optimization recommendations on Databricks | Databricks on AWS&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Dec 2025 09:50:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/140991#M11112</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-12-03T09:50:25Z</dc:date>
    </item>
    <item>
      <title>Re: How to Optimize Data Pipeline Development on Databricks for Large-Scale Workloads?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141056#M11122</link>
      <description>&lt;P&gt;Yes, +1 to&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;This is the best doc to start with on the optimization part -&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;A href="https://www.databricks.com/discover/pages/optimize-data-workloads-guide" target="_blank"&gt;https://www.databricks.com/discover/pages/optimize-data-workloads-guide &lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Dec 2025 18:44:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141056#M11122</guid>
      <dc:creator>iyashk-DB</dc:creator>
      <dc:date>2025-12-03T18:44:23Z</dc:date>
    </item>
    <item>
      <title>Re: How to Optimize Data Pipeline Development on Databricks for Large-Scale Workloads?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141108#M11127</link>
      <description>&lt;P&gt;Optimizing data pipeline development on Databricks for large-scale workloads involves a mix of architectural design, performance tuning, and automation:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Leverage Delta Lake:&lt;/STRONG&gt; Use Delta tables for ACID transactions, schema enforcement, and efficient updates/merges.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Partition and Cluster Data:&lt;/STRONG&gt; Partition large datasets intelligently (by date, region, etc.) and use Z-Ordering for faster queries.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Use Auto-scaling &amp;amp; Spot Instances:&lt;/STRONG&gt; Dynamically scale clusters based on workload to optimize performance and cost.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Optimize Spark Jobs:&lt;/STRONG&gt; Cache intermediate data, avoid shuffles when possible, and use efficient joins.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Orchestrate Pipelines:&lt;/STRONG&gt; Use Databricks Workflows or orchestration tools like Airflow for reliable and repeatable ETL processes.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Monitor &amp;amp; Profile:&lt;/STRONG&gt; Use Spark UI, Ganglia metrics, and Databricks monitoring to identify bottlenecks and optimize job performance.&lt;/P&gt;&lt;P&gt;In short, combine Delta Lake features, smart partitioning, job optimization, and monitoring to handle large-scale workloads efficiently on Databricks.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2025 07:30:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141108#M11127</guid>
      <dc:creator>Suheb</dc:creator>
      <dc:date>2025-12-04T07:30:35Z</dc:date>
    </item>
    <item>
      <title>Re: How to Optimize Data Pipeline Development on Databricks for Large-Scale Workloads?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141118#M11128</link>
      <description>&lt;P&gt;To optimize data pipeline development on Databricks for large-scale workloads, focus on efficient data processing and resource management. Leverage &lt;STRONG&gt;Apache Spark&lt;/STRONG&gt;'s distributed computing capabilities to handle massive datasets. Use &lt;STRONG&gt;Delta Lake&lt;/STRONG&gt; for reliable, ACID-compliant storage and faster query performance. Implement partitioning, caching, and parallel processing to improve speed and reduce latency. Automate scaling using Databricks' autoscaling clusters and optimize ETL jobs with optimized Spark configurations. Monitoring and fine-tuning resource usage further enhance pipeline efficiency and minimize costs.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2025 08:51:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141118#M11128</guid>
      <dc:creator>ShaneCorn</dc:creator>
      <dc:date>2025-12-04T08:51:35Z</dc:date>
    </item>
    <item>
      <title>Re: How to Optimize Data Pipeline Development on Databricks for Large-Scale Workloads?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141129#M11129</link>
      <description>&lt;P&gt;Optimizing Databricks pipelines for large-scale workloads mostly comes down to smart architecture + efficient Spark practices.&lt;/P&gt;&lt;P&gt;Key tips from real-world users:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;Use Delta Lake – for ACID transactions, incremental updates, and schema enforcement.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Partition &amp;amp; optimize storage – partition by high-cardinality columns, use Z-Ordering for faster queries.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Cache wisely – cache hot data when repeatedly accessed, but avoid over-caching large datasets.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Leverage auto-scaling clusters – Databricks clusters can scale dynamically to handle large jobs efficiently.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Optimize Spark configs – tune spark.sql.shuffle.partitions, memory fraction, and adaptive query execution.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Modular pipelines – break complex ETL into smaller, testable jobs; reuse notebooks or jobs where possible.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Monitor &amp;amp; profile – use the Spark UI and Databricks Job metrics to identify bottlenecks.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Use vectorized operations and built-in functions – avoid row-by-row UDFs when possible.&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Short take:&lt;BR /&gt;Use Delta Lake + smart partitioning + cluster autoscaling + Spark tuning and modular pipelines; profile and iterate to handle large-scale workloads efficiently.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2025 10:10:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141129#M11129</guid>
      <dc:creator>jameswood32</dc:creator>
      <dc:date>2025-12-04T10:10:55Z</dc:date>
    </item>
    <item>
      <title>Re: How to Optimize Data Pipeline Development on Databricks for Large-Scale Workloads?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141217#M11134</link>
      <description>&lt;P&gt;Thanks for sharing! I’ll check out the Databricks guide&lt;/P&gt;</description>
      <pubDate>Fri, 05 Dec 2025 07:24:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141217#M11134</guid>
      <dc:creator>tarunnagar</dc:creator>
      <dc:date>2025-12-05T07:24:29Z</dc:date>
    </item>
    <item>
      <title>Re: How to Optimize Data Pipeline Development on Databricks for Large-Scale Workloads?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141219#M11136</link>
      <description>&lt;P&gt;No problem. It's a great resource. If you will have any doubts just ask here &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 05 Dec 2025 08:07:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-to-optimize-data-pipeline-development-on-databricks-for/m-p/141219#M11136</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-12-05T08:07:46Z</dc:date>
    </item>
  </channel>
</rss>

