cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Lakehouse Federation - fetch size parameter for optimization

dbdev
Contributor

Hi,

We use lakehouse federation to connect to a database.
A performance recommendation is to use 'fetchSize':
Lakehouse Federation performance recommendations - Azure Databricks | Microsoft Learn

 

SELECT * 
FROM mySqlCatalog.schema.table 
WITH ('fetchSize' 100000)

 

But as far as I can find it's only possible in SQL.

As we would like to do some further column transformations on it, we'd like to use pyspark.

I know you can do spark.sql("....") as well, but is there any way to specify the fetch size in pure pyspark, instead of SQL?

3 REPLIES 3

szymon_dybczak
Esteemed Contributor III

Hi @dbdev ,

You can try to set fetchSize using spark.read.option as they suggested at below article:

Redshift queries using Lakehouse Federation taking longer than expected - Databricks

Hi @szymon_dybczak ,
I've seen that option, but I wasn't sure it would have effect as it's closely related to .format("jdbc"), which I don't use. The link explicitly uses the .format("jdbc") .

I know lakehouse federation underlyingly is just jdbc,  but it does not specifiy in the documentation that it passes on these options.

Louis_Frolio
Databricks Employee
Databricks Employee

Hello @dbdev , I did some digging and here are some suggestions.

The `fetchSize` parameter in Lakehouse Federation is currently only available through SQL syntax using the `WITH` clause, as documented in the performance recommendations. Unfortunately, there is no native PySpark API to specify `fetchSize` when reading from federated catalogs directly.Workarounds for PySpark

There are two approaches you can consider:

1. Use spark.sql() with the WITH clause

You can execute SQL with the `fetchSize` parameter and convert the result to a PySpark DataFrame for further transformations:

```python
df = spark.sql("""
SELECT *
FROM mySqlCatalog.schema.table
WITH ('fetchSize' 100000)
""")

# Now perform your PySpark transformations
df_transformed = df.filter(...).select(...)
```

This approach allows you to benefit from the `fetchSize` optimization while still performing column transformations in PySpark.

2. Use JDBC format directly (with caveats)

While Lakehouse Federation uses JDBC under the hood, using the traditional `spark.read.format("jdbc")` approach bypasses the federation layer entirely. This means you would need to:

- Provide direct JDBC connection strings
- Manage credentials separately (not using Unity Catalog connections)
- Lose the benefits of Unity Catalog governance

This is generally not recommended if you're already using Lakehouse Federation.

Important Considerations

- Runtime requirements: The `fetchSize` parameter with the `WITH` clause requires Databricks Runtime 16.1 or above, or SQL warehouses on Pro/Serverless using 2024.50 or later
- Recommended value: Databricks recommends using a large `fetchSize` value (e.g., 100,000) to minimize round trips and improve performance
- Supported connectors: This optimization works with JDBC-based connectors including MySQL, PostgreSQL, SQL Server, Azure Synapse, Oracle, Redshift, Teradata, and Salesforce Data 360

Additional Performance Options

If you need even better performance for large datasets, consider enabling parallel reads with additional parameters like `numPartitions`, `partitionColumn`, `lowerBound`, and `upperBound` (available in DBR 17.1+). For Snowflake specifically, you can use the `partition_size_in_mb` parameter instead.

The SQL-then-transform approach is currently the most practical solution for combining `fetchSize` optimization with PySpark transformations when using Lakehouse Federation.

Hope this helps, Louis.

Lakehouse Federation in Databricks | Query Federation in Databricks | Query External Data Source in Databricks Video explains - What is Lakehouse Federation in Databricks? What is Query Federation in Databricks? How to setup Lakehouse Federation in Databricks? How to query external database in ...

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now