cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Improve dowload speed or see download progress Python-Databricks SQL

alejandrofm
Valued Contributor

Hi! I'm using the code from here to execute a query on Databricks, it goes flawlessly, can follow it from the Spark UI, etc. The problem here is at the moment it seems the download of the result (spark is idle, there is a green check in the query history window, not showing total time and I don't have the data locally).

The issue here is that I don't know how to improve this, nor how much data is downloading, what percentage of it I got, what's remaining, etc. Even I don't find Ganglia to look at the network performance to understand what is happening.

Any idea of how to improve the download rate (or how to diagnose a little to improve it)

Another thing is that the endpoint is running and charging me while this happens, so thinking in download the result to S3 and getting it from there...but want to solve the monitor issue so I can understand how to improve this.

Thanks!

7 REPLIES 7

AmanSehgal
Honored Contributor III

Could you please add the link from where you got the code? I think you forgot to add it.

Also could you please add some screenshots of what you can see and what you're trying to achieve?

Sure! I'm using this example:

https://docs.databricks.com/dev-tools/python-sql-connector.html#language-SQL%C2%A0warehouse

Attached is the image of what I'm looking at. It was canceled by timeout, but can't find a way to know if my local connection is slow, databricks is slow, or the node is slow. Need to have a little more information to improve these queries.

thanks!!!

AmanSehgal
Honored Contributor III

Icons in query history explained:

  • Blue dotted circle means query is running
  • Hourglass icon means the query is queued
  • Red square means the query faile
  • Green tick means the query execution was successful

When a query execution finishes, it'll either have a red square or a green tick. Click on your query to see the metrics that opens up on right side of your screen. In the IO section you'll see:

  • Rows returned
  • Rows read
  • Bytes read
  • Bytes read from cache
  • Bytes written

In your python code, when you call `cursor.fetchall()`, then you can check how many rows did you receive and map that to query history output.

result = cursor.fetchall()
print(len(result))

Also, once your query execution finishes, the warehouse will shut down after defined minutes of inactivity or you can shut it down manually.

Yes, the query was successful, but I still don't have the data available (maybe is downloading), I need to know the size of the data being downloaded or the progress of that download.

Just tried increasing the capacity of the cluster x2 and x4, the speed of the "download" decreases, but I don't understand why. It makes no sense that with my home connection the size of the cluster affect the download speed (once the results are available).

Thanks!

AmanSehgal
Honored Contributor III

Did you check the metrics on DB SQL as advised ? how much data or how many rows are you expecting?

Sorry, didn't see the "expand" button. In that list of fields, should I look for "bytes written"?

And with print(len(result)) I'll know the size but once downloaded, I need to know how it goes, how many bytes I downloaded, and how many remaining...something like that so the user can know if need to cancel because it's trying to download 3Tb or it's downloading 200MB and is just a slow connection.

Thanks!

Vidula
Honored Contributor

Hi @Alejandro Martinez​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group