Databricks Community

alejandrofm · ‎08-31-2022

Hi! I'm using the code from here to execute a query on Databricks, it goes flawlessly, can follow it from the Spark UI, etc. The problem here is at the moment it seems the download of the result (spark is idle, there is a green check in the query history window, not showing total time and I don't have the data locally).

The issue here is that I don't know how to improve this, nor how much data is downloading, what percentage of it I got, what's remaining, etc. Even I don't find Ganglia to look at the network performance to understand what is happening.

Any idea of how to improve the download rate (or how to diagnose a little to improve it)

Another thing is that the endpoint is running and charging me while this happens, so thinking in download the result to S3 and getting it from there...but want to solve the monitor issue so I can understand how to improve this.

Thanks!

AmanSehgal · ‎08-31-2022

Could you please add the link from where you got the code? I think you forgot to add it.

Also could you please add some screenshots of what you can see and what you're trying to achieve?

alejandrofm · ‎09-01-2022

Sure! I'm using this example:

https://docs.databricks.com/dev-tools/python-sql-connector.html#language-SQL%C2%A0warehouse

Attached is the image of what I'm looking at. It was canceled by timeout, but can't find a way to know if my local connection is slow, databricks is slow, or the node is slow. Need to have a little more information to improve these queries.

thanks!!!

AmanSehgal · ‎09-01-2022

Icons in query history explained:

Blue dotted circle means query is running
Hourglass icon means the query is queued
Red square means the query faile
Green tick means the query execution was successful

When a query execution finishes, it'll either have a red square or a green tick. Click on your query to see the metrics that opens up on right side of your screen. In the IO section you'll see:

Rows returned
Rows read
Bytes read
Bytes read from cache
Bytes written

In your python code, when you call `cursor.fetchall()`, then you can check how many rows did you receive and map that to query history output.

result = cursor.fetchall()
print(len(result))

Also, once your query execution finishes, the warehouse will shut down after defined minutes of inactivity or you can shut it down manually.

alejandrofm · ‎09-01-2022

Yes, the query was successful, but I still don't have the data available (maybe is downloading), I need to know the size of the data being downloaded or the progress of that download.

Just tried increasing the capacity of the cluster x2 and x4, the speed of the "download" decreases, but I don't understand why. It makes no sense that with my home connection the size of the cluster affect the download speed (once the results are available).

Thanks!

AmanSehgal · ‎09-01-2022

Did you check the metrics on DB SQL as advised ? how much data or how many rows are you expecting?

alejandrofm · ‎09-01-2022

Sorry, didn't see the "expand" button. In that list of fields, should I look for "bytes written"?

And with print(len(result)) I'll know the size but once downloaded, I need to know how it goes, how many bytes I downloaded, and how many remaining...something like that so the user can know if need to cancel because it's trying to download 3Tb or it's downloading 200MB and is just a slow connection.

Thanks!

Vidula · ‎09-17-2022

Hi @Alejandro Martinez

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!