cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks SQL Warehouse fails when streaming ~53M rows via Python (token/session expiry)

Rose_15
New Contributor II

Hello Team,

I am facing a consistent issue when streaming a large table (~53 million rows) from a Databricks SQL Warehouse using Python (databricks-sql-connector) with OAuth authentication.

I execute a single long-running query and fetch data in batches (50,000 rows) using cursor.fetchmany(), loading the data into an external database (cockroach dB). The job runs successfully for some time but always fails after ~55–65 minutes.

Typical errors:

Token exchange failed, using external token: 'access_token'
ThriftBackend.attempt_request: Exception
databricks.sql.exc.RequestError: Error during request to server

In some runs, I also see:

CloudFetch download slower than threshold: 0.08 MB/s (threshold: 0.1 MB/s)

I have already tried:

  • Proactive OAuth token refresh and caching

  • Smaller batch sizes

  • Reusing database connections

  • Multiple retries

However, once the failure occurs, the active cursor/result set becomes invalid and the query cannot continue. Refreshing the token does not help.

This appears to be related to SQL Warehouse session lifetime and long-running result set streaming, possibly exacerbated by CloudFetch download time.

Is streaming very large result sets (10M+ rows) via Databricks SQL Warehouse supported?
Is the recommended approach to use COPY INTO / UNLOAD to external storage instead of Python streaming?

Any clarification or official guidance would be appreciated.

Thank you.

4 REPLIES 4

Sanjeeb2024
Valued Contributor

Hi @Rose_15 53M rows via streaming looks big number. What is the use case, if the use case is streaming possibly, you need find a way to extract small batches ( like incremental) and load the data. The use case more look like a batch load. 

Please provide more details on the use case so that we can explore optimal solution.

Sanjeeb Mohapatra

Rose_15
New Contributor II

Hello Sanjeeb,

Thank you for your response.

This is indeed a batch data movement use case, not real-time streaming. The requirement is to perform a one-time / periodic bulk load of a large table (~53M rows) from Databricks into an external OLTP database (CockroachDB) using Python.

There is no requirement for low-latency or continuous streaming. The Python-based approach was chosen only as an extraction mechanism, not for real-time consumption.

We attempted:

  • Batch fetching (fetchmany) with small batch sizes

  • OAuth token refresh

  • Connection reuse

However, the job consistently fails after ~1 hour, indicating limitations with long-running cursors in Databricks SQL Warehouses.

Could you please confirm:

  1. Whether COPY INTO/UNLOAD to cloud storage is the recommended approach for large batch exports from Databricks?

  2. If Python-based extraction is required, is range/partition-based querying the only supported workaround?

  3. Are there any documented limits or best practices regarding maximum result set sizes or cursor lifetimes for SQL Warehouses?

Your guidance on the officially supported pattern for this use case would be very helpful.

Thank you,

Sanjeeb2024
Valued Contributor

Hi @Rose_15 - Thanks for the details. It is better to do a planning like number of tables, size and number of records and better to extract the files to a cloud storage and reload the data using any mechanism. Once your extraction is complete, you will not depend upon source server and always load the file in databricks tables. If possible, you can explore the approach to extract it in .parquet file which will reduce the size as well.

 

Sanjeeb Mohapatra

SteveOstrowski
Databricks Employee
Databricks Employee
Hi @Rose_15,

The behavior you are seeing is expected for a result set of this size fetched over a single long-lived cursor. Here is what is happening and several approaches to resolve it.


WHY THE FAILURE OCCURS

The Databricks SQL Connector for Python uses a Thrift-based session to the SQL warehouse. When you call cursor.execute(), the warehouse runs the query and stages the result set. You then pull rows incrementally with fetchmany(). Two things work against you at the 53M-row scale:

1. OAuth token lifetime: The default OAuth access token expires after roughly 60 minutes. Even with proactive refresh, the connector's internal Thrift session may not seamlessly pick up the new token mid-fetch. Once the session is invalidated, the cursor and its associated result set become unrecoverable.

2. CloudFetch presigned URL expiry: When CloudFetch is enabled (the default), the warehouse returns presigned cloud storage URLs for result chunks. These URLs have their own expiration window. If your client cannot download a chunk before its URL expires, you get the "CloudFetch download slower than threshold" warning followed by a hard failure.

The combination of these two factors is why you consistently hit a wall at the 55 to 65 minute mark, and why retrying on the same cursor does not help.


RECOMMENDED APPROACH: PARTITION THE EXTRACT ON THE SERVER SIDE

Rather than pulling 53M rows through a single cursor, break the work into smaller, independently resumable queries. This avoids both the token-expiry and URL-expiry windows entirely.

Option A: Range-based partitioning using a key column

If your table has an integer or date column you can partition on (for example, an id or created_date column), issue multiple queries with WHERE clauses:

from databricks import sql
import os

def get_connection():
return sql.connect(
server_hostname=os.getenv("DATABRICKS_HOST"),
http_path=os.getenv("DATABRICKS_HTTP_PATH"),
auth_type="databricks-oauth"
)

partitions = [
(0, 999999),
(1000000, 1999999),
(2000000, 2999999),
# ... generate ranges to cover the full key space
]

for low, high in partitions:
conn = get_connection()
cursor = conn.cursor()
cursor.execute(
f"SELECT * FROM my_table WHERE id BETWEEN {low} AND {high}"
)

while True:
batch = cursor.fetchmany(50000)
if not batch:
break
write_to_cockroachdb(batch)

cursor.close()
conn.close()

Each partition gets a fresh connection and token, so no single query runs longer than a few minutes.

Option B: Export to cloud storage, then load externally

This is the most robust pattern for very large extracts. Write the data to cloud storage as Parquet files, then load from there into CockroachDB:

Step 1, inside a Databricks notebook or SQL warehouse:

CREATE TABLE my_catalog.my_schema.export_table
USING PARQUET
LOCATION 's3://my-bucket/exports/my_table/'
AS SELECT * FROM my_table;

Or, if you prefer to keep it as a simple write:

df = spark.table("my_table")
df.write.mode("overwrite").parquet("s3://my-bucket/exports/my_table/")

Step 2, from your Python application, read the Parquet files from S3 (using pyarrow, pandas, or DuckDB) and load into CockroachDB. This completely decouples the Databricks session from the downstream insert workload.


ADDITIONAL TUNING IF YOU STAY WITH THE CONNECTOR

If you prefer to keep using the connector directly (for smaller tables or after partitioning), these settings help:

1. Use fetchmany_arrow() instead of fetchmany(). The Arrow-based path is significantly faster for large result sets and reduces serialization overhead:

cursor.execute("SELECT * FROM my_table WHERE id BETWEEN 0 AND 999999")
while True:
batch = cursor.fetchmany_arrow(50000)
if batch.num_rows == 0:
break
# Convert to pandas or process directly
df = batch.to_pandas()
write_to_cockroachdb(df)

2. Increase arraysize on the cursor to improve internal buffering:

cursor.arraysize = 100000

3. If CloudFetch is causing download-speed warnings on your network, you can disable it and fall back to inline result transfer:

conn = sql.connect(
server_hostname=os.getenv("DATABRICKS_HOST"),
http_path=os.getenv("DATABRICKS_HTTP_PATH"),
auth_type="databricks-oauth",
use_cloud_fetch=False
)

This trades off peak throughput for more predictable behavior on constrained networks.


DOCUMENTATION REFERENCES

- Databricks SQL Connector for Python: https://docs.databricks.com/aws/en/dev-tools/python-sql-connector.html
- SQL warehouse configuration (Auto Stop, sizing): https://docs.databricks.com/aws/en/sql/admin/sql-endpoints.html
- COPY INTO for loading data: https://docs.databricks.com/aws/en/sql/language-manual/delta-copy-into.html


To directly answer your closing questions: the connector can handle large result sets, but a single cursor fetch over 53M rows will likely exceed the OAuth/session lifetime. Partitioning the query into smaller ranges (Option A) or staging to cloud storage (Option B) are the two production-proven patterns for this scale.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.