Saturday - last edited Saturday
Hi Team,
I have some confusion regarding the best approach for ingesting data from Snowflake into Databricks using custom SQL queries.
While evaluating the available options, I found multiple approaches:
From my understanding, the Snowflake Spark Connector appears to provide better performance for data ingestion. However, I noticed some conflicting guidance between the Databricks and Snowflake documentation. Due to the confusion I am looking forward to use JDBC
Databricks documentation states:
"The legacy query federation documentation has been retired and might not be updated. The configurations mentioned in this content are not officially endorsed or tested by Databricks. If Lakehouse Federation supports your source database, Databricks recommends using that instead."
On the other hand, Snowflake documentation indicates that the Snowflake Connector for Spark is supported on Databricks Runtime 4.2 and above.
This has left me with a few questions:
References:
1. Configuring Snowflake for Spark in Databricks | Snowflake Documentation
2. Read and write data from Snowflake | Databricks on AWS
Any guidance or best practices would be greatly appreciated.
Thanks!
Saturday
Hey @aharisaibabu , I did a little research and here is what I discovered:
Context first. The short answer is that the Snowflake Spark Connector still works and Snowflake still documents it, but it is no longer the strategic Databricks path. Treat it as tactical. Treat Lakehouse Federation plus Lakeflow Connect as the direction to build on.
Let me clear up the confusion first, because it matters. The disclaimer you quoted is not about the Query Federation bullet in your list. It is the archive banner on the Snowflake Spark Connector page itself (now at /archive/connectors/snowflake, marked Experimental). Databricks calls the old Spark connector approach "legacy query federation," which is a different thing from today's Lakehouse Federation (a Unity Catalog connection plus a foreign catalog). So the doc is not telling you to avoid federation. It is telling you to move off the Spark connector and onto Lakehouse Federation.
Now to your four questions.
Is it still recommended and supported? It is usable, yes. Snowflake maintains the connector (net.snowflake:spark-snowflake). But Databricks has archived the page, flagged it Experimental, and points you to Lakehouse Federation. So: supported by Snowflake, de-emphasized by Databricks.
Known limitations and concerns. A few to keep in mind:
For a custom-query pull, Spark connector or JDBC? If those are your only two choices, prefer the Spark connector over raw JDBC. It parallelizes through Snowflake stage unloads and pushes down by default, whereas plain JDBC is single-connection unless you manually set partitionColumn, lowerBound, upperBound, and numPartitions. But the real Databricks answer is "neither for the long term." Run your custom SQL against a Snowflake foreign catalog through Lakehouse Federation (filters, joins, aggregates, and window functions push down) and materialize into Delta with CTAS.
The preferred long-term architecture. This is the documented decision rule:
For your specific goal (custom-query incremental ingestion) here is the path that works today. There is not a dedicated GA Snowflake connector in Lakeflow Connect yet (it is listed as upcoming), but Lakeflow Connect's query-based connectors support all Lakehouse Federation data sources through foreign catalog ingestion. So you can do this in two steps: (1) create a Snowflake Lakehouse Federation connection and foreign catalog, then (2) build a query-based ingestion pipeline against it, using a cursor column (a single monotonically increasing column like updated_at or row_id) for incremental loads. That gets you serverless, UC-governed ingestion with no gateway and no staging volume.
Two honest caveats. Query-based connectors are in Preview so confirm the current release status before you put a production SLA on it. And for very large one-shot bulk extracts, the Spark connector's parallel stage unload can still beat Federation's JDBC reads. So benchmark if raw full-load throughput is your bottleneck, but weigh that against losing Databricks support and UC governance.
Takeaway: the Spark connector still works, but treat it as tactical only. Build on Lakehouse Federation now, and use Lakeflow Connect query-based ingestion through the Snowflake foreign catalog for your scheduled incremental loads. Don't anchor a new long-term architecture to the archived connector.
Cheers, Lou.
Sources I referenced for my research:
Saturday
Hi Louis,
Thanks for the detailed explanation and research.
I have a follow-up question regarding Lakehouse Federation. When I review the documentation, Databricks describes Query Federation as being intended for scenarios such as on-demand reporting, proof-of-concept work, exploratory ETL, and incremental migrations. The documentation also explicitly states that Query Federation is meant for use cases where "you don't want to ingest data into Databricks."
Given that guidance, I'm trying to reconcile it with the recommendation to use Lakehouse Federation as part of a long-term ingestion strategy from Snowflake to Databricks.
For a production use case where the objective is to ingest large volumes of data from Snowflake into Delta tables using custom SQL queries and scheduled pipelines, would you still recommend the Lakehouse Federation + Lakeflow Connect approach over the Snowflake Spark Connector?
My concern is that Federation appears to be positioned primarily as a query/access layer rather than a high-volume ingestion mechanism. Is the expectation that Lakeflow Connect effectively addresses that gap and therefore becomes the preferred long-term architecture, or are there scenarios where the Snowflake Spark Connector remains the better choice for large-scale ingestion workloads?
I'd appreciate your perspective on how Databricks intends these technologies to be used together for enterprise-scale ingestion patterns.
References:
1. https://docs.databricks.com/aws/en/query-federation/database-federation
Saturday
Thanks for raising this question. I recently evaluated similar options for Snowflake-to-Databricks ingestion and would like to share my perspective.
From my understanding, the choice depends on whether your primary focus is performance, ease of management, or long-term architecture.
Yes, the Snowflake Spark Connector is still actively supported by Snowflake and remains one of the most commonly used approaches for moving data between Snowflake and Spark-based platforms, including Databricks.
I believe the Databricks documentation you referenced is primarily referring to legacy query federation guidance rather than the Snowflake Spark Connector itself.
Snowflake continues to maintain and document the connector:
Snowflake Connector for Spark:
https://docs.snowflake.com/en/user-guide/spark-connector
Databricks also provides guidance on reading and writing Snowflake data:
https://docs.databricks.com/aws/en/archive/connectors/snowflake
For large-scale ingestion workloads, I would generally prefer the Snowflake Spark Connector over JDBC because:
Better parallelism for data movement
Query pushdown support
Higher throughput for larger datasets
Optimized integration with Spark workloads
JDBC works well for:
Smaller datasets
Metadata queries
Administrative operations
Simpler integrations where performance is not a primary concern
For custom-query-based ingestion specifically, the Snowflake Spark Connector supports executing custom SQL while still benefiting from connector optimizations.
A few areas to consider:
Connector version compatibility with Spark/Scala runtime versions
Dependency management
Network connectivity and security configuration between Snowflake and Databricks
Additional operational overhead compared to fully managed ingestion solutions
That said, many production implementations continue to use the connector successfully for high-volume ingestion.
If I were designing a new solution today, I would consider:
Short to Medium Term
Snowflake → Snowflake Spark Connector → Databricks Delta Tables
Long Term
Evaluate Lakeflow Connect as it matures and reaches GA, since Databricks is clearly investing in managed ingestion and replication capabilities.
Lakeflow Connect documentation:
https://docs.databricks.com/en/ingestion/lakeflow-connect/index.html
For a production-grade ingestion framework that requires custom SQL extraction and high-volume data movement, I would currently choose the Snowflake Spark Connector over JDBC.
JDBC is certainly a viable option, but in most cases I would view it as a simpler connectivity mechanism rather than the preferred approach for large-scale ingestion workloads.
Would be interested to hear from others who have recently compared Snowflake Spark Connector, JDBC, and Lakeflow Connect in production environments.
Saturday
Hi Souryabarnwal,
Thank you for sharing your perspective on both connectors. Your comparison of the Snowflake Spark Connector and JDBC was very helpful.