Hello @dbdev , I did some digging and here are some suggestions.
The `fetchSize` parameter in Lakehouse Federation is currently only available through SQL syntax using the `WITH` clause, as documented in the performance recommendations. Unfortunately, there is no native PySpark API to specify `fetchSize` when reading from federated catalogs directly.Workarounds for PySpark
There are two approaches you can consider:
1. Use spark.sql() with the WITH clause
You can execute SQL with the `fetchSize` parameter and convert the result to a PySpark DataFrame for further transformations:
```python
df = spark.sql("""
SELECT *
FROM mySqlCatalog.schema.table
WITH ('fetchSize' 100000)
""")
# Now perform your PySpark transformations
df_transformed = df.filter(...).select(...)
```
This approach allows you to benefit from the `fetchSize` optimization while still performing column transformations in PySpark.
2. Use JDBC format directly (with caveats)
While Lakehouse Federation uses JDBC under the hood, using the traditional `spark.read.format("jdbc")` approach bypasses the federation layer entirely. This means you would need to:
- Provide direct JDBC connection strings
- Manage credentials separately (not using Unity Catalog connections)
- Lose the benefits of Unity Catalog governance
This is generally not recommended if you're already using Lakehouse Federation.
Important Considerations
- Runtime requirements: The `fetchSize` parameter with the `WITH` clause requires Databricks Runtime 16.1 or above, or SQL warehouses on Pro/Serverless using 2024.50 or later
- Recommended value: Databricks recommends using a large `fetchSize` value (e.g., 100,000) to minimize round trips and improve performance
- Supported connectors: This optimization works with JDBC-based connectors including MySQL, PostgreSQL, SQL Server, Azure Synapse, Oracle, Redshift, Teradata, and Salesforce Data 360
Additional Performance Options
If you need even better performance for large datasets, consider enabling parallel reads with additional parameters like `numPartitions`, `partitionColumn`, `lowerBound`, and `upperBound` (available in DBR 17.1+). For Snowflake specifically, you can use the `partition_size_in_mb` parameter instead.
The SQL-then-transform approach is currently the most practical solution for combining `fetchSize` optimization with PySpark transformations when using Lakehouse Federation.
Hope this helps, Louis.