cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to retrieve all rows of delta table using SQL endpoint of Interactive Cluster

fly_high_five
New Contributor II

Hi,

I am trying to query a table using JDBC endpoint of Interactive Cluster. I am connected to JDBC endpoint using DBeaver. When I export a small subset of data 2000-8000 rows, it works fine and export the data. However, when I try to export all rows of table, it fails by saying request timed out. My table has almost 1 million rows.

I tried same using Python application by trying to export data in CSV and got the same result i.e. it succeeds for small number of rows but fails for entire table.

Can someone try this and tell if they are experiencing same issue and help me understand why I'm experiencing this issue? Note - I'm using non-UC ADB (Hive_metastore)

 

import csv
from databricks import sql

with sql.connect(server_hostname="databricks_hostname",
                 http_path="sql/protocolv1/o/databricks_details",
                 auth_type="databricks-oauth") as connection:

    # Open a cursor
    with connection.cursor() as cursor:
        # Execute query to fetch 20,000 rows
        cursor.execute("SELECT * FROM schema.table_name LIMIT 2000000")
        result = cursor.fetchall()

        # Save results to a CSV file
        with open("output.csv", mode="w", newline="", encoding="utf-8") as file:
            writer = csv.writer(file)
           
            # Write header (column names)
            column_names = [desc[0] for desc in cursor.description]
            writer.writerow(column_names)
           
            # Write data rows
            writer.writerows(result)

print("Data saved to output.csv")
1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

If this is related to following topic: Exposing Data for Consumers in non-UC ADB - Databricks Community - 130954
Then:

"Looking at the error I noticed the weird name of the storage account. So now I'm thinking that your jdbc connection is using CloudFetch.

"Cloud Fetch, a capability that fetches query results through the cloud storage that is set up in your Azure Databricks deployment.

Query results are uploaded to an internal DBFS storage location as Arrow-serialized files of up to 20 MB. When the driver sends fetch requests after query completion, Azure Databricks generates and returns shared access signatures to the uploaded files. The JDBC driver then uses the URLs to download the results directly from DBFS."

szymon_dybczak_0-1757070663568.png

Can you add following option to your JDBC url? It will disable cloudFetch. Once you set this option try again.

EnableQueryResultDownload=0

View solution in original post

5 REPLIES 5

fly_high_five
New Contributor II

Hi 

I am using latest JDBC driver 2.7.3 https://www.databricks.com/spark/jdbc-drivers-archive

And my JDBC url comes from JDBC endpoint of Interactive Cluster.

jdbc:databricks://adb-{workspace_id}.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/{workspace_id}/{cluster_id};AuthMech=11;Auth_Flow=2;TokenCachePassPhrase=U2MToken;EnableTokenCache=0

szymon_dybczak
Esteemed Contributor III

If this is related to following topic: Exposing Data for Consumers in non-UC ADB - Databricks Community - 130954
Then:

"Looking at the error I noticed the weird name of the storage account. So now I'm thinking that your jdbc connection is using CloudFetch.

"Cloud Fetch, a capability that fetches query results through the cloud storage that is set up in your Azure Databricks deployment.

Query results are uploaded to an internal DBFS storage location as Arrow-serialized files of up to 20 MB. When the driver sends fetch requests after query completion, Azure Databricks generates and returns shared access signatures to the uploaded files. The JDBC driver then uses the URLs to download the results directly from DBFS."

szymon_dybczak_0-1757070663568.png

Can you add following option to your JDBC url? It will disable cloudFetch. Once you set this option try again.

EnableQueryResultDownload=0

Hi @szymon_dybczak ,
Your recommendation worked. I modified JDBC url to 

jdbc:databricks://adb-{workspace_id}.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/{workspace_id}/{cluster_id};AuthMech=11;Auth_Flow=2;TokenCachePassPhrase=U2MToken;EnableTokenCache=0;EnableQueryResultDownload=0.

I tried another solution which looks essentially same what @szymon_dybczak  you recommended. Solution is attached. 

szymon_dybczak
Esteemed Contributor III

Great that it worked @fly_high_five . I'm happy that I could help.

WiliamRosa
New Contributor III

Hi @fly_high_five,

I found these references about this situation, see if they help you: increase the SocketTimeout in JDBC (Databricks KB “Best practices when using JDBC with Databricks SQL” – https://kb.databricks.com/dbsql/job-timeout-when-connecting-to-a-sql-endpoint-over-jdbc
) and replace fetchall() with fetchmany() while adjusting cursor.arraysize (Databricks SQL Connector for Python docs – https://docs.databricks.com/en/dev-tools/python-sql-connector.html
).

 

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now