4 weeks ago
Hi,
We use lakehouse federation to connect to a database.
A performance recommendation is to use 'fetchSize':
Lakehouse Federation performance recommendations - Azure Databricks | Microsoft Learn
SELECT *
FROM mySqlCatalog.schema.table
WITH ('fetchSize' 100000)
But as far as I can find it's only possible in SQL.
As we would like to do some further column transformations on it, we'd like to use pyspark.
I know you can do spark.sql("....") as well, but is there any way to specify the fetch size in pure pyspark, instead of SQL?
3 weeks ago
Hello @dbdev , I did some digging and here are some suggestions.
The `fetchSize` parameter in Lakehouse Federation is currently only available through SQL syntax using the `WITH` clause, as documented in the performance recommendations. Unfortunately, there is no native PySpark API to specify `fetchSize` when reading from federated catalogs directly.Workarounds for PySpark
There are two approaches you can consider:
1. Use spark.sql() with the WITH clause
You can execute SQL with the `fetchSize` parameter and convert the result to a PySpark DataFrame for further transformations:
```python
df = spark.sql("""
SELECT *
FROM mySqlCatalog.schema.table
WITH ('fetchSize' 100000)
""")
# Now perform your PySpark transformations
df_transformed = df.filter(...).select(...)
```
This approach allows you to benefit from the `fetchSize` optimization while still performing column transformations in PySpark.
2. Use JDBC format directly (with caveats)
While Lakehouse Federation uses JDBC under the hood, using the traditional `spark.read.format("jdbc")` approach bypasses the federation layer entirely. This means you would need to:
- Provide direct JDBC connection strings
- Manage credentials separately (not using Unity Catalog connections)
- Lose the benefits of Unity Catalog governance
This is generally not recommended if you're already using Lakehouse Federation.
- Runtime requirements: The `fetchSize` parameter with the `WITH` clause requires Databricks Runtime 16.1 or above, or SQL warehouses on Pro/Serverless using 2024.50 or later
- Recommended value: Databricks recommends using a large `fetchSize` value (e.g., 100,000) to minimize round trips and improve performance
- Supported connectors: This optimization works with JDBC-based connectors including MySQL, PostgreSQL, SQL Server, Azure Synapse, Oracle, Redshift, Teradata, and Salesforce Data 360
If you need even better performance for large datasets, consider enabling parallel reads with additional parameters like `numPartitions`, `partitionColumn`, `lowerBound`, and `upperBound` (available in DBR 17.1+). For Snowflake specifically, you can use the `partition_size_in_mb` parameter instead.
The SQL-then-transform approach is currently the most practical solution for combining `fetchSize` optimization with PySpark transformations when using Lakehouse Federation.
Hope this helps, Louis.
4 weeks ago
Hi @dbdev ,
You can try to set fetchSize using spark.read.option as they suggested at below article:
Redshift queries using Lakehouse Federation taking longer than expected - Databricks
4 weeks ago - last edited 4 weeks ago
Hi @szymon_dybczak ,
I've seen that option, but I wasn't sure it would have effect as it's closely related to .format("jdbc"), which I don't use. The link explicitly uses the .format("jdbc") .
I know lakehouse federation underlyingly is just jdbc, but it does not specifiy in the documentation that it passes on these options.
3 weeks ago
Hello @dbdev , I did some digging and here are some suggestions.
The `fetchSize` parameter in Lakehouse Federation is currently only available through SQL syntax using the `WITH` clause, as documented in the performance recommendations. Unfortunately, there is no native PySpark API to specify `fetchSize` when reading from federated catalogs directly.Workarounds for PySpark
There are two approaches you can consider:
1. Use spark.sql() with the WITH clause
You can execute SQL with the `fetchSize` parameter and convert the result to a PySpark DataFrame for further transformations:
```python
df = spark.sql("""
SELECT *
FROM mySqlCatalog.schema.table
WITH ('fetchSize' 100000)
""")
# Now perform your PySpark transformations
df_transformed = df.filter(...).select(...)
```
This approach allows you to benefit from the `fetchSize` optimization while still performing column transformations in PySpark.
2. Use JDBC format directly (with caveats)
While Lakehouse Federation uses JDBC under the hood, using the traditional `spark.read.format("jdbc")` approach bypasses the federation layer entirely. This means you would need to:
- Provide direct JDBC connection strings
- Manage credentials separately (not using Unity Catalog connections)
- Lose the benefits of Unity Catalog governance
This is generally not recommended if you're already using Lakehouse Federation.
- Runtime requirements: The `fetchSize` parameter with the `WITH` clause requires Databricks Runtime 16.1 or above, or SQL warehouses on Pro/Serverless using 2024.50 or later
- Recommended value: Databricks recommends using a large `fetchSize` value (e.g., 100,000) to minimize round trips and improve performance
- Supported connectors: This optimization works with JDBC-based connectors including MySQL, PostgreSQL, SQL Server, Azure Synapse, Oracle, Redshift, Teradata, and Salesforce Data 360
If you need even better performance for large datasets, consider enabling parallel reads with additional parameters like `numPartitions`, `partitionColumn`, `lowerBound`, and `upperBound` (available in DBR 17.1+). For Snowflake specifically, you can use the `partition_size_in_mb` parameter instead.
The SQL-then-transform approach is currently the most practical solution for combining `fetchSize` optimization with PySpark transformations when using Lakehouse Federation.
Hope this helps, Louis.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now