cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks SQL Warehouse fails when streaming ~53M rows via Python (token/session expiry)

Rose_15
New Contributor II

Hello Team,

I am facing a consistent issue when streaming a large table (~53 million rows) from a Databricks SQL Warehouse using Python (databricks-sql-connector) with OAuth authentication.

I execute a single long-running query and fetch data in batches (50,000 rows) using cursor.fetchmany(), loading the data into an external database (cockroach dB). The job runs successfully for some time but always fails after ~55–65 minutes.

Typical errors:

Token exchange failed, using external token: 'access_token'
ThriftBackend.attempt_request: Exception
databricks.sql.exc.RequestError: Error during request to server

In some runs, I also see:

CloudFetch download slower than threshold: 0.08 MB/s (threshold: 0.1 MB/s)

I have already tried:

  • Proactive OAuth token refresh and caching

  • Smaller batch sizes

  • Reusing database connections

  • Multiple retries

However, once the failure occurs, the active cursor/result set becomes invalid and the query cannot continue. Refreshing the token does not help.

This appears to be related to SQL Warehouse session lifetime and long-running result set streaming, possibly exacerbated by CloudFetch download time.

Is streaming very large result sets (10M+ rows) via Databricks SQL Warehouse supported?
Is the recommended approach to use COPY INTO / UNLOAD to external storage instead of Python streaming?

Any clarification or official guidance would be appreciated.

Thank you.

3 REPLIES 3

Sanjeeb2024
Contributor III

Hi @Rose_15 53M rows via streaming looks big number. What is the use case, if the use case is streaming possibly, you need find a way to extract small batches ( like incremental) and load the data. The use case more look like a batch load. 

Please provide more details on the use case so that we can explore optimal solution.

Sanjeeb Mohapatra

Rose_15
New Contributor II

Hello Sanjeeb,

Thank you for your response.

This is indeed a batch data movement use case, not real-time streaming. The requirement is to perform a one-time / periodic bulk load of a large table (~53M rows) from Databricks into an external OLTP database (CockroachDB) using Python.

There is no requirement for low-latency or continuous streaming. The Python-based approach was chosen only as an extraction mechanism, not for real-time consumption.

We attempted:

  • Batch fetching (fetchmany) with small batch sizes

  • OAuth token refresh

  • Connection reuse

However, the job consistently fails after ~1 hour, indicating limitations with long-running cursors in Databricks SQL Warehouses.

Could you please confirm:

  1. Whether COPY INTO/UNLOAD to cloud storage is the recommended approach for large batch exports from Databricks?

  2. If Python-based extraction is required, is range/partition-based querying the only supported workaround?

  3. Are there any documented limits or best practices regarding maximum result set sizes or cursor lifetimes for SQL Warehouses?

Your guidance on the officially supported pattern for this use case would be very helpful.

Thank you,

Sanjeeb2024
Contributor III

Hi @Rose_15 - Thanks for the details. It is better to do a planning like number of tables, size and number of records and better to extract the files to a cloud storage and reload the data using any mechanism. Once your extraction is complete, you will not depend upon source server and always load the file in databricks tables. If possible, you can explore the approach to extract it in .parquet file which will reduce the size as well.

 

Sanjeeb Mohapatra