<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Lakehouse Federation - fetch size parameter for optimization in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/lakehouse-federation-fetch-size-parameter-for-optimization/m-p/138045#M50850</link>
    <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179916"&gt;@dbdev&lt;/a&gt;&amp;nbsp;, I did some digging and here are some suggestions.&lt;/P&gt;
&lt;P&gt;The `fetchSize` parameter in Lakehouse Federation is currently &lt;STRONG&gt;only available through SQL syntax&lt;/STRONG&gt;&amp;nbsp;using the `WITH` clause, as documented in the performance recommendations. Unfortunately, there is no native PySpark API to specify `fetchSize` when reading from federated catalogs directly.Workarounds for PySpark&lt;/P&gt;
&lt;P&gt;There are two approaches you can consider:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;1. Use spark.sql() with the WITH clause&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;You can execute SQL with the `fetchSize` parameter and convert the result to a PySpark DataFrame for further transformations:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;df = spark.sql("""&lt;BR /&gt;SELECT * &lt;BR /&gt;FROM mySqlCatalog.schema.table &lt;BR /&gt;WITH ('fetchSize' 100000)&lt;BR /&gt;""")&lt;/P&gt;
&lt;P&gt;# Now perform your PySpark transformations&lt;BR /&gt;df_transformed = df.filter(...).select(...)&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;This approach allows you to benefit from the `fetchSize` optimization while still performing column transformations in PySpark.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;2. Use JDBC format directly (with caveats)&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;While Lakehouse Federation uses JDBC under the hood, using the traditional `spark.read.format("jdbc")` approach bypasses the federation layer entirely. This means you would need to:&lt;/P&gt;
&lt;P&gt;- Provide direct JDBC connection strings&lt;BR /&gt;- Manage credentials separately (not using Unity Catalog connections)&lt;BR /&gt;- Lose the benefits of Unity Catalog governance&lt;/P&gt;
&lt;P&gt;This is generally not recommended if you're already using Lakehouse Federation.&lt;/P&gt;
&lt;H3&gt;Important Considerations&lt;/H3&gt;
&lt;P&gt;- &lt;STRONG&gt;Runtime requirements&lt;/STRONG&gt;: The `fetchSize` parameter with the `WITH` clause requires Databricks Runtime 16.1 or above, or SQL warehouses on Pro/Serverless using 2024.50 or later&lt;BR /&gt;- &lt;STRONG&gt;Recommended value&lt;/STRONG&gt;: Databricks recommends using a large `fetchSize` value (e.g., 100,000) to minimize round trips and improve performance&lt;BR /&gt;- &lt;STRONG&gt;Supported connectors&lt;/STRONG&gt;: This optimization works with JDBC-based connectors including MySQL, PostgreSQL, SQL Server, Azure Synapse, Oracle, Redshift, Teradata, and Salesforce Data 360&lt;/P&gt;
&lt;H3&gt;Additional Performance Options&lt;/H3&gt;
&lt;P&gt;If you need even better performance for large datasets, consider enabling &lt;STRONG&gt;parallel reads&lt;/STRONG&gt;&amp;nbsp;with additional parameters like `numPartitions`, `partitionColumn`, `lowerBound`, and `upperBound` (available in DBR 17.1+). For Snowflake specifically, you can use the `partition_size_in_mb` parameter instead.&lt;/P&gt;
&lt;P&gt;The SQL-then-transform approach is currently the most practical solution for combining `fetchSize` optimization with PySpark transformations when using Lakehouse Federation.&lt;/P&gt;
&lt;P&gt;Hope this helps, Louis.&lt;/P&gt;</description>
    <pubDate>Thu, 06 Nov 2025 22:56:39 GMT</pubDate>
    <dc:creator>Louis_Frolio</dc:creator>
    <dc:date>2025-11-06T22:56:39Z</dc:date>
    <item>
      <title>Lakehouse Federation - fetch size parameter for optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/lakehouse-federation-fetch-size-parameter-for-optimization/m-p/137529#M50756</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;We use lakehouse federation to connect to a database.&lt;BR /&gt;A performance recommendation is to use 'fetchSize':&lt;BR /&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/query-federation/performance-recommendations#set-the-size-of-batches-fetched-from-the-remote-database" target="_blank" rel="noopener"&gt;Lakehouse Federation performance recommendations - Azure Databricks | Microsoft Learn&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;SELECT * 
FROM mySqlCatalog.schema.table 
WITH ('fetchSize' 100000)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But as far as I can find it's only possible in SQL.&lt;/P&gt;&lt;P&gt;As we would like to do some further column transformations on it, we'd like to use pyspark.&lt;/P&gt;&lt;P&gt;I know you can do spark.sql("....") as well, but is there any way to specify the fetch size in pure pyspark, instead of SQL?&lt;/P&gt;</description>
      <pubDate>Tue, 04 Nov 2025 09:59:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/lakehouse-federation-fetch-size-parameter-for-optimization/m-p/137529#M50756</guid>
      <dc:creator>dbdev</dc:creator>
      <dc:date>2025-11-04T09:59:27Z</dc:date>
    </item>
    <item>
      <title>Re: Lakehouse Federation - fetch size parameter for optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/lakehouse-federation-fetch-size-parameter-for-optimization/m-p/137530#M50757</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179916"&gt;@dbdev&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;You can try to set fetchSize using spark.read.option as they suggested at below article:&lt;/P&gt;&lt;P&gt;&lt;A href="https://kb.databricks.com/dbsql/redshift-queries-using-lakehouse-federation-taking-longer-than-expected" target="_blank"&gt;Redshift queries using Lakehouse Federation taking longer than expected - Databricks&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 04 Nov 2025 10:04:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/lakehouse-federation-fetch-size-parameter-for-optimization/m-p/137530#M50757</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-11-04T10:04:58Z</dc:date>
    </item>
    <item>
      <title>Re: Lakehouse Federation - fetch size parameter for optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/lakehouse-federation-fetch-size-parameter-for-optimization/m-p/137571#M50765</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;I've seen that option, but I wasn't sure it would have effect as it's closely related to .format("jdbc"), which I don't use. The link explicitly uses the .format("jdbc") .&lt;BR /&gt;&lt;BR /&gt;I know lakehouse federation underlyingly is just jdbc,&amp;nbsp; but it does not specifiy in the documentation that it passes on these options.&lt;/P&gt;</description>
      <pubDate>Tue, 04 Nov 2025 13:03:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/lakehouse-federation-fetch-size-parameter-for-optimization/m-p/137571#M50765</guid>
      <dc:creator>dbdev</dc:creator>
      <dc:date>2025-11-04T13:03:05Z</dc:date>
    </item>
    <item>
      <title>Re: Lakehouse Federation - fetch size parameter for optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/lakehouse-federation-fetch-size-parameter-for-optimization/m-p/138045#M50850</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179916"&gt;@dbdev&lt;/a&gt;&amp;nbsp;, I did some digging and here are some suggestions.&lt;/P&gt;
&lt;P&gt;The `fetchSize` parameter in Lakehouse Federation is currently &lt;STRONG&gt;only available through SQL syntax&lt;/STRONG&gt;&amp;nbsp;using the `WITH` clause, as documented in the performance recommendations. Unfortunately, there is no native PySpark API to specify `fetchSize` when reading from federated catalogs directly.Workarounds for PySpark&lt;/P&gt;
&lt;P&gt;There are two approaches you can consider:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;1. Use spark.sql() with the WITH clause&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;You can execute SQL with the `fetchSize` parameter and convert the result to a PySpark DataFrame for further transformations:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;df = spark.sql("""&lt;BR /&gt;SELECT * &lt;BR /&gt;FROM mySqlCatalog.schema.table &lt;BR /&gt;WITH ('fetchSize' 100000)&lt;BR /&gt;""")&lt;/P&gt;
&lt;P&gt;# Now perform your PySpark transformations&lt;BR /&gt;df_transformed = df.filter(...).select(...)&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;This approach allows you to benefit from the `fetchSize` optimization while still performing column transformations in PySpark.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;2. Use JDBC format directly (with caveats)&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;While Lakehouse Federation uses JDBC under the hood, using the traditional `spark.read.format("jdbc")` approach bypasses the federation layer entirely. This means you would need to:&lt;/P&gt;
&lt;P&gt;- Provide direct JDBC connection strings&lt;BR /&gt;- Manage credentials separately (not using Unity Catalog connections)&lt;BR /&gt;- Lose the benefits of Unity Catalog governance&lt;/P&gt;
&lt;P&gt;This is generally not recommended if you're already using Lakehouse Federation.&lt;/P&gt;
&lt;H3&gt;Important Considerations&lt;/H3&gt;
&lt;P&gt;- &lt;STRONG&gt;Runtime requirements&lt;/STRONG&gt;: The `fetchSize` parameter with the `WITH` clause requires Databricks Runtime 16.1 or above, or SQL warehouses on Pro/Serverless using 2024.50 or later&lt;BR /&gt;- &lt;STRONG&gt;Recommended value&lt;/STRONG&gt;: Databricks recommends using a large `fetchSize` value (e.g., 100,000) to minimize round trips and improve performance&lt;BR /&gt;- &lt;STRONG&gt;Supported connectors&lt;/STRONG&gt;: This optimization works with JDBC-based connectors including MySQL, PostgreSQL, SQL Server, Azure Synapse, Oracle, Redshift, Teradata, and Salesforce Data 360&lt;/P&gt;
&lt;H3&gt;Additional Performance Options&lt;/H3&gt;
&lt;P&gt;If you need even better performance for large datasets, consider enabling &lt;STRONG&gt;parallel reads&lt;/STRONG&gt;&amp;nbsp;with additional parameters like `numPartitions`, `partitionColumn`, `lowerBound`, and `upperBound` (available in DBR 17.1+). For Snowflake specifically, you can use the `partition_size_in_mb` parameter instead.&lt;/P&gt;
&lt;P&gt;The SQL-then-transform approach is currently the most practical solution for combining `fetchSize` optimization with PySpark transformations when using Lakehouse Federation.&lt;/P&gt;
&lt;P&gt;Hope this helps, Louis.&lt;/P&gt;</description>
      <pubDate>Thu, 06 Nov 2025 22:56:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/lakehouse-federation-fetch-size-parameter-for-optimization/m-p/138045#M50850</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-11-06T22:56:39Z</dc:date>
    </item>
  </channel>
</rss>

