Databricks Community

dbdev · 4 weeks ago

Hi,

We use lakehouse federation to connect to a database.
A performance recommendation is to use 'fetchSize':
Lakehouse Federation performance recommendations - Azure Databricks | Microsoft Learn

SELECT * 
FROM mySqlCatalog.schema.table 
WITH ('fetchSize' 100000)

But as far as I can find it's only possible in SQL.

As we would like to do some further column transformations on it, we'd like to use pyspark.

I know you can do spark.sql("....") as well, but is there any way to specify the fetch size in pure pyspark, instead of SQL?

Louis_Frolio · 3 weeks ago

Hello @dbdev , I did some digging and here are some suggestions.

The `fetchSize` parameter in Lakehouse Federation is currently only available through SQL syntax using the `WITH` clause, as documented in the performance recommendations. Unfortunately, there is no native PySpark API to specify `fetchSize` when reading from federated catalogs directly.Workarounds for PySpark

There are two approaches you can consider:

1. Use spark.sql() with the WITH clause

You can execute SQL with the `fetchSize` parameter and convert the result to a PySpark DataFrame for further transformations:

```python
df = spark.sql("""
SELECT *
FROM mySqlCatalog.schema.table
WITH ('fetchSize' 100000)
""")

# Now perform your PySpark transformations
df_transformed = df.filter(...).select(...)
```

This approach allows you to benefit from the `fetchSize` optimization while still performing column transformations in PySpark.

2. Use JDBC format directly (with caveats)

While Lakehouse Federation uses JDBC under the hood, using the traditional `spark.read.format("jdbc")` approach bypasses the federation layer entirely. This means you would need to:

- Provide direct JDBC connection strings
- Manage credentials separately (not using Unity Catalog connections)
- Lose the benefits of Unity Catalog governance

This is generally not recommended if you're already using Lakehouse Federation.

Important Considerations

- Runtime requirements: The `fetchSize` parameter with the `WITH` clause requires Databricks Runtime 16.1 or above, or SQL warehouses on Pro/Serverless using 2024.50 or later
- Recommended value: Databricks recommends using a large `fetchSize` value (e.g., 100,000) to minimize round trips and improve performance
- Supported connectors: This optimization works with JDBC-based connectors including MySQL, PostgreSQL, SQL Server, Azure Synapse, Oracle, Redshift, Teradata, and Salesforce Data 360

Additional Performance Options

If you need even better performance for large datasets, consider enabling parallel reads with additional parameters like `numPartitions`, `partitionColumn`, `lowerBound`, and `upperBound` (available in DBR 17.1+). For Snowflake specifically, you can use the `partition_size_in_mb` parameter instead.

The SQL-then-transform approach is currently the most practical solution for combining `fetchSize` optimization with PySpark transformations when using Lakehouse Federation.

Hope this helps, Louis.

View solution in original post

szymon_dybczak · 4 weeks ago

Hi @dbdev ,

You can try to set fetchSize using spark.read.option as they suggested at below article:

Redshift queries using Lakehouse Federation taking longer than expected - Databricks

dbdev · 4 weeks ago

Hi @szymon_dybczak ,
I've seen that option, but I wasn't sure it would have effect as it's closely related to .format("jdbc"), which I don't use. The link explicitly uses the .format("jdbc") .

I know lakehouse federation underlyingly is just jdbc, but it does not specifiy in the documentation that it passes on these options.

Louis_Frolio · 3 weeks ago

Hello @dbdev , I did some digging and here are some suggestions.

The `fetchSize` parameter in Lakehouse Federation is currently only available through SQL syntax using the `WITH` clause, as documented in the performance recommendations. Unfortunately, there is no native PySpark API to specify `fetchSize` when reading from federated catalogs directly.Workarounds for PySpark

There are two approaches you can consider:

1. Use spark.sql() with the WITH clause

You can execute SQL with the `fetchSize` parameter and convert the result to a PySpark DataFrame for further transformations:

```python
df = spark.sql("""
SELECT *
FROM mySqlCatalog.schema.table
WITH ('fetchSize' 100000)
""")

# Now perform your PySpark transformations
df_transformed = df.filter(...).select(...)
```

This approach allows you to benefit from the `fetchSize` optimization while still performing column transformations in PySpark.

2. Use JDBC format directly (with caveats)

While Lakehouse Federation uses JDBC under the hood, using the traditional `spark.read.format("jdbc")` approach bypasses the federation layer entirely. This means you would need to:

- Provide direct JDBC connection strings
- Manage credentials separately (not using Unity Catalog connections)
- Lose the benefits of Unity Catalog governance

This is generally not recommended if you're already using Lakehouse Federation.

Important Considerations

- Runtime requirements: The `fetchSize` parameter with the `WITH` clause requires Databricks Runtime 16.1 or above, or SQL warehouses on Pro/Serverless using 2024.50 or later
- Recommended value: Databricks recommends using a large `fetchSize` value (e.g., 100,000) to minimize round trips and improve performance
- Supported connectors: This optimization works with JDBC-based connectors including MySQL, PostgreSQL, SQL Server, Azure Synapse, Oracle, Redshift, Teradata, and Salesforce Data 360

Additional Performance Options

If you need even better performance for large datasets, consider enabling parallel reads with additional parameters like `numPartitions`, `partitionColumn`, `lowerBound`, and `upperBound` (available in DBR 17.1+). For Snowflake specifically, you can use the `partition_size_in_mb` parameter instead.

The SQL-then-transform approach is currently the most practical solution for combining `fetchSize` optimization with PySpark transformations when using Lakehouse Federation.

Hope this helps, Louis.