Databricks

Raymond_Garcia · ‎06-24-2022

Hello, I have a data bricks question. A Dataframe job that writes in an s3 bucket usually takes 8 minutes to finish, but now it takes from 8 to 9 hours to complete. Does anybody have some clues about this behavior?

the data frame size is about 300 or 400 records

it is a simple query in a delta table:

val results = spark
.table("table")
.filter()
.filter(by_date)
.drop(some_columns")
.select(a_struct_field)
.withColumn("image", image) 
 
listofString.foreach { mystring =>
  println(s"start writing .json to S3 for ${results}")
  results
  .filter($"struct.field.result" === results)
  .coalesce(1)
  .write
  .mode(SaveMode.Overwrite)
  .json(s"${filePath}/temp_${results}")
  println(s"complete writing .json to S3 for ${results}")
}

Thanks in advance

Raymond_Garcia · ‎06-29-2022

Hello, I was able to reduce the time significantly. I used the OPTIMIZE keyword before starting processing.

Thanks!

View solution in original post

Kaniz · ‎06-27-2022

Hi @Raymond Garcia , Here are the top 5 things we see that can significantly impact the performance customers get from Databricks. Please have a read and let us know how it helps you.

https://databricks.com/blog/2022/03/10/top-5-databricks-performance-tips.html

Raymond_Garcia · ‎06-28-2022

Hi thanks, I will check them out, and I will let you know. 🙂

Raymond_Garcia · ‎06-29-2022

Hello, I was able to reduce the time significantly. I used the OPTIMIZE keyword before starting processing.

Thanks!

Databricks

Databricks Job is slower.

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI