cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I view the query history, duration, etc for all users

alejandrofm
Valued Contributor

Hi! I have some jobs that stay idle for some time when getting data from a S3 mount on DBFS, this are all SQL queries on Delta, how can I know where is the bottle neck, duration, cue? to diagnose the slow spark performance that I think is on the process of getting the data.

And also to send recomendations to users publishing jobs with SQL bad practices.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

alejandrofm
Valued Contributor

We found out we were regeneratig the symlink manifest for all the partitions on this case. And for some reason it was executed twice, at start and end of the job.

delta_table.generate('symlink_format_manifest')

We configured the table with:

ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)

and removed the symlink creation on the job and all worked great from 2hs to 30 min.

Thanks!

View solution in original post

11 REPLIES 11

AmanSehgal
Honored Contributor III

Could you please give an example of your issue? As it's not clear where exactly are you facing issue.

Is the issue with loading files to S3 and processing them or you've bunch of queries that take longer than usual?

To view query history on databricks, go to query history tab in your workspace.

alejandrofm
Valued Contributor

The issue appears to be when I query the Delta Lake from a Databricks spark job, I say appears to be because in not so complex processes when the first step is an SQL query the cluster stays idle for 10+ minutes (no data in or out, almost no CPU, no ram changes).

The S3/AWS reference was to tell you on which cloud service we are working.

When I go into the Query History tab don't see anything related to other users, maybe I'm the only one executing queries via Databricks at this point, but the Spark job is making the queries, so I think there should be a log of those, Am I wrong?

Thanks!

AmanSehgal
Honored Contributor III

Could you share a screenshot?

When you say you're querying delta lake from Databricks spark job, do you mean to say that you're using Databricks Jobs to run a notebook with an SQL in it?

alejandrofm
Valued Contributor

I have a spark job on PySpark it runs a query like this but much complex:

data = spark.sql(  "SELECT * FROM schema.table INNER JOIN...WHERE...etc etc ec")

agains the Delta Lake, do some heavy process on the result, and the store it on another table.

I named a lot of things :P, a screenshot of what process will be useful? Thanks

AmanSehgal
Honored Contributor III

Probably a screesnhot of any logs or maybe something that shows cluster stays idle for 10+ minutes (no data in or out, almost no CPU, no ram changes) when you submit the query

alejandrofm
Valued Contributor

Sure! this is an example from Ganglia, first full idle, no readings, nothing, then it started to read/process data (images are not in order)

imageimage

AmanSehgal
Honored Contributor III

If you're running the spark job from within the notebook, then you won't see any SQL queries in Query History tab.

The Query History tab shows queries that were executed using SQL Endpoints and not via clusters.

Next, the issue that you're facing could be due to small cluster size. Say, you start a cluster with 2 nodes and give 8 nodes as upper limit. When you run a query, the cluster might undergo resizing to add more nodes for serving multiple JOINS in your query. You can check that from Event log tab of a cluster. If that's the case, then probably start your cluster with 4 nodes at least. Also check the worker type and driver type of cluster that you're using.

Could you please do following:

Run the query for first time. It might take 10 minutes as you say. The moment you get the output, re-run the query immediately and check how much time it takes. Is it less as compared to previous run?

alejandrofm
Valued Contributor

As you can see on the second screenshot after 40 minutes with 2 nodes it start receiving data and then it scales to 8 very fast. I don't think that this is slow scaling problem (either way I just configure the minimum to 4 instead of two to see if there are any changes)

Will try the second part of your answer tomorrow,

thank you!

Atanu
Esteemed Contributor

@Alejandro Martinez​ was following up. Are you able to get the desired scenario after your test?

alejandrofm
Valued Contributor

We found out we were regeneratig the symlink manifest for all the partitions on this case. And for some reason it was executed twice, at start and end of the job.

delta_table.generate('symlink_format_manifest')

We configured the table with:

ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)

and removed the symlink creation on the job and all worked great from 2hs to 30 min.

Thanks!

Hi @Alejandro Martinez​ ,

Thank you for providing the solution to this issue. It would help if you mark it a best response.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group