08-31-2022 02:12 PM
Hi! I'm using the code from here to execute a query on Databricks, it goes flawlessly, can follow it from the Spark UI, etc. The problem here is at the moment it seems the download of the result (spark is idle, there is a green check in the query history window, not showing total time and I don't have the data locally).
The issue here is that I don't know how to improve this, nor how much data is downloading, what percentage of it I got, what's remaining, etc. Even I don't find Ganglia to look at the network performance to understand what is happening.
Any idea of how to improve the download rate (or how to diagnose a little to improve it)
Another thing is that the endpoint is running and charging me while this happens, so thinking in download the result to S3 and getting it from there...but want to solve the monitor issue so I can understand how to improve this.
Thanks!
08-31-2022 05:15 PM
Could you please add the link from where you got the code? I think you forgot to add it.
Also could you please add some screenshots of what you can see and what you're trying to achieve?
09-01-2022 06:02 AM
Sure! I'm using this example:
https://docs.databricks.com/dev-tools/python-sql-connector.html#language-SQL%C2%A0warehouse
Attached is the image of what I'm looking at. It was canceled by timeout, but can't find a way to know if my local connection is slow, databricks is slow, or the node is slow. Need to have a little more information to improve these queries.
thanks!!!
09-01-2022 06:49 AM
Icons in query history explained:
When a query execution finishes, it'll either have a red square or a green tick. Click on your query to see the metrics that opens up on right side of your screen. In the IO section you'll see:
In your python code, when you call `cursor.fetchall()`, then you can check how many rows did you receive and map that to query history output.
result = cursor.fetchall()
print(len(result))
Also, once your query execution finishes, the warehouse will shut down after defined minutes of inactivity or you can shut it down manually.
09-01-2022 07:47 AM
Yes, the query was successful, but I still don't have the data available (maybe is downloading), I need to know the size of the data being downloaded or the progress of that download.
Just tried increasing the capacity of the cluster x2 and x4, the speed of the "download" decreases, but I don't understand why. It makes no sense that with my home connection the size of the cluster affect the download speed (once the results are available).
Thanks!
09-01-2022 10:10 AM
Did you check the metrics on DB SQL as advised ? how much data or how many rows are you expecting?
09-01-2022 10:37 AM
Sorry, didn't see the "expand" button. In that list of fields, should I look for "bytes written"?
And with print(len(result)) I'll know the size but once downloaded, I need to know how it goes, how many bytes I downloaded, and how many remaining...something like that so the user can know if need to cancel because it's trying to download 3Tb or it's downloading 200MB and is just a slow connection.
Thanks!
09-17-2022 12:53 AM
Hi @Alejandro Martinez
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group