Databricks Community

aharisaibabu · 3 weeks ago

Hi Team,

I have some confusion regarding the best approach for ingesting data from Snowflake into Databricks using custom SQL queries.

While evaluating the available options, I found multiple approaches:

Snowflake Spark Connector
JDBC
Query Federation
Lakeflow Connect (currently in Preview)

From my understanding, the Snowflake Spark Connector appears to provide better performance for data ingestion. However, I noticed some conflicting guidance between the Databricks and Snowflake documentation. Due to the confusion I am looking forward to use JDBC

Databricks documentation states:

"The legacy query federation documentation has been retired and might not be updated. The configurations mentioned in this content are not officially endorsed or tested by Databricks. If Lakehouse Federation supports your source database, Databricks recommends using that instead."

On the other hand, Snowflake documentation indicates that the Snowflake Connector for Spark is supported on Databricks Runtime 4.2 and above.

This has left me with a few questions:

Is the Snowflake Spark Connector still considered a recommended and supported approach for reading data from Snowflake into Databricks?
Are there any known limitations or concerns when using the Snowflake Spark Connector with current Databricks runtimes?
For custom-query-based ingestion from Snowflake, would Databricks recommend using the Snowflake Spark Connector or JDBC?
What is the preferred long-term architecture considering future support and performance?

References:

1. Configuring Snowflake for Spark in Databricks | Snowflake Documentation

2. Read and write data from Snowflake | Databricks on AWS

Any guidance or best practices would be greatly appreciated.

Thanks!

Louis_Frolio · 3 weeks ago

Hey @aharisaibabu , I did a little research and here is what I discovered:

Context first. The short answer is that the Snowflake Spark Connector still works and Snowflake still documents it, but it is no longer the strategic Databricks path. Treat it as tactical. Treat Lakehouse Federation plus Lakeflow Connect as the direction to build on.

Let me clear up the confusion first, because it matters. The disclaimer you quoted is not about the Query Federation bullet in your list. It is the archive banner on the Snowflake Spark Connector page itself (now at /archive/connectors/snowflake, marked Experimental). Databricks calls the old Spark connector approach "legacy query federation," which is a different thing from today's Lakehouse Federation (a Unity Catalog connection plus a foreign catalog). So the doc is not telling you to avoid federation. It is telling you to move off the Spark connector and onto Lakehouse Federation.

Now to your four questions.

Is it still recommended and supported? It is usable, yes. Snowflake maintains the connector (net.snowflake:spark-snowflake). But Databricks has archived the page, flagged it Experimental, and points you to Lakehouse Federation. So: supported by Snowflake, de-emphasized by Databricks.
Known limitations and concerns. A few to keep in mind:

Databricks will not test or troubleshoot it. You own the library version, and the bundled one can be stale on older runtimes like 10.4 LTS.
The config syntax differs between DBR 11.3 LTS and up versus 10.4 and below.
Write column order is not preserved, so you have to use columnmap.
Types can shift on round trips (INTEGER to NUMBER or DECIMAL).
Identifiers come back uppercase by default, so your schema gets uppercased.
Jobs running longer than 36 hours should use an external location to exchange data.
It is a direct format("snowflake") read, so it is not natively governed by Unity Catalog (no UC lineage or access control over the external data). Put credentials in secrets.

For a custom-query pull, Spark connector or JDBC? If those are your only two choices, prefer the Spark connector over raw JDBC. It parallelizes through Snowflake stage unloads and pushes down by default, whereas plain JDBC is single-connection unless you manually set partitionColumn, lowerBound, upperBound, and numPartitions. But the real Databricks answer is "neither for the long term." Run your custom SQL against a Snowflake foreign catalog through Lakehouse Federation (filters, joins, aggregates, and window functions push down) and materialize into Delta with CTAS.
The preferred long-term architecture. This is the documented decision rule:

For live, no-copy access, ad-hoc reporting, a proof of concept, exploratory ETL, or UC governance, use Lakehouse Federation. Snowflake is a supported source. It requires DBR 13.3 LTS and up, or pro/serverless SQL.
For persistent, scheduled, high-volume, lower-latency managed ingestion, use Lakeflow Connect. Databricks states that when a source supports both, Lakeflow Connect is preferred when volume and latency matter.

For your specific goal (custom-query incremental ingestion) here is the path that works today. There is not a dedicated GA Snowflake connector in Lakeflow Connect yet (it is listed as upcoming), but Lakeflow Connect's query-based connectors support all Lakehouse Federation data sources through foreign catalog ingestion. So you can do this in two steps: (1) create a Snowflake Lakehouse Federation connection and foreign catalog, then (2) build a query-based ingestion pipeline against it, using a cursor column (a single monotonically increasing column like updated_at or row_id) for incremental loads. That gets you serverless, UC-governed ingestion with no gateway and no staging volume.

Two honest caveats. Query-based connectors are in Preview so confirm the current release status before you put a production SLA on it. And for very large one-shot bulk extracts, the Spark connector's parallel stage unload can still beat Federation's JDBC reads. So benchmark if raw full-load throughput is your bottleneck, but weigh that against losing Databricks support and UC governance.

Takeaway: the Spark connector still works, but treat it as tactical only. Build on Lakehouse Federation now, and use Lakeflow Connect query-based ingestion through the Snowflake foreign catalog for your scheduled incremental loads. Don't anchor a new long-term architecture to the archived connector.

Cheers, Lou.

Sources I referenced for my research:

Read and write data from Snowflake, archived and Experimental (the page with the disclaimer): https://docs.databricks.com/aws/en/archive/connectors/snowflake
What is Lakehouse Federation? https://docs.databricks.com/aws/en/query-federation/ . What is query federation? https://docs.databricks.com/aws/en/query-federation/database-federation . Run federated queries on Snowflake: https://docs.databricks.com/aws/en/query-federation/snowflake
What is Lakeflow Connect? (Federation versus Lakeflow decision rule): https://docs.databricks.com/aws/en/ingestion/overview . Managed connectors in Lakeflow Connect: https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/
Query-based connectors overview: https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/query-based-overview . Create a query-based ingestion pipeline: https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/query-based-pipeline
Configuring Snowflake for Spark in Databricks (Snowflake docs): https://docs.snowflake.com/en/user-guide/spark-connector-databricks

aharisaibabu · 3 weeks ago

Hi Louis,

Thanks for the detailed explanation and research.

I have a follow-up question regarding Lakehouse Federation. When I review the documentation, Databricks describes Query Federation as being intended for scenarios such as on-demand reporting, proof-of-concept work, exploratory ETL, and incremental migrations. The documentation also explicitly states that Query Federation is meant for use cases where "you don't want to ingest data into Databricks."

Given that guidance, I'm trying to reconcile it with the recommendation to use Lakehouse Federation as part of a long-term ingestion strategy from Snowflake to Databricks.

For a production use case where the objective is to ingest large volumes of data from Snowflake into Delta tables using custom SQL queries and scheduled pipelines, would you still recommend the Lakehouse Federation + Lakeflow Connect approach over the Snowflake Spark Connector?

My concern is that Federation appears to be positioned primarily as a query/access layer rather than a high-volume ingestion mechanism. Is the expectation that Lakeflow Connect effectively addresses that gap and therefore becomes the preferred long-term architecture, or are there scenarios where the Snowflake Spark Connector remains the better choice for large-scale ingestion workloads?

I'd appreciate your perspective on how Databricks intends these technologies to be used together for enterprise-scale ingestion patterns.

References:
1. https://docs.databricks.com/aws/en/query-federation/database-federation

souryabarnwal · 3 weeks ago

Thanks for raising this question. I recently evaluated similar options for Snowflake-to-Databricks ingestion and would like to share my perspective.

From my understanding, the choice depends on whether your primary focus is performance, ease of management, or long-term architecture.

1. Is the Snowflake Spark Connector still supported and recommended?

Yes, the Snowflake Spark Connector is still actively supported by Snowflake and remains one of the most commonly used approaches for moving data between Snowflake and Spark-based platforms, including Databricks.

I believe the Databricks documentation you referenced is primarily referring to legacy query federation guidance rather than the Snowflake Spark Connector itself.

Snowflake continues to maintain and document the connector:

Snowflake Connector for Spark:
https://docs.snowflake.com/en/user-guide/spark-connector

Databricks also provides guidance on reading and writing Snowflake data:
https://docs.databricks.com/aws/en/archive/connectors/snowflake

2. Snowflake Spark Connector vs JDBC

For large-scale ingestion workloads, I would generally prefer the Snowflake Spark Connector over JDBC because:

Better parallelism for data movement
Query pushdown support
Higher throughput for larger datasets
Optimized integration with Spark workloads

JDBC works well for:

Smaller datasets
Metadata queries
Administrative operations
Simpler integrations where performance is not a primary concern

For custom-query-based ingestion specifically, the Snowflake Spark Connector supports executing custom SQL while still benefiting from connector optimizations.

3. Any limitations with the Snowflake Spark Connector?

A few areas to consider:

Connector version compatibility with Spark/Scala runtime versions
Dependency management
Network connectivity and security configuration between Snowflake and Databricks
Additional operational overhead compared to fully managed ingestion solutions

That said, many production implementations continue to use the connector successfully for high-volume ingestion.

4. What would I recommend for the long term?

If I were designing a new solution today, I would consider:

Short to Medium Term
Snowflake → Snowflake Spark Connector → Databricks Delta Tables

Long Term
Evaluate Lakeflow Connect as it matures and reaches GA, since Databricks is clearly investing in managed ingestion and replication capabilities.

Lakeflow Connect documentation:
https://docs.databricks.com/en/ingestion/lakeflow-connect/index.html

My Personal Take

For a production-grade ingestion framework that requires custom SQL extraction and high-volume data movement, I would currently choose the Snowflake Spark Connector over JDBC.

JDBC is certainly a viable option, but in most cases I would view it as a simpler connectivity mechanism rather than the preferred approach for large-scale ingestion workloads.

Would be interested to hear from others who have recently compared Snowflake Spark Connector, JDBC, and Lakeflow Connect in production environments.