06-24-2022 11:18 AM
Hello, I have a data bricks question. A Dataframe job that writes in an s3 bucket usually takes 8 minutes to finish, but now it takes from 8 to 9 hours to complete. Does anybody have some clues about this behavior?
the data frame size is about 300 or 400 records
it is a simple query in a delta table:
val results = spark
.table("table")
.filter()
.filter(by_date)
.drop(some_columns")
.select(a_struct_field)
.withColumn("image", image)
listofString.foreach { mystring =>
println(s"start writing .json to S3 for ${results}")
results
.filter($"struct.field.result" === results)
.coalesce(1)
.write
.mode(SaveMode.Overwrite)
.json(s"${filePath}/temp_${results}")
println(s"complete writing .json to S3 for ${results}")
}
Thanks in advance
06-29-2022 08:27 AM
Hello, I was able to reduce the time significantly. I used the OPTIMIZE keyword before starting processing.
Thanks!
06-27-2022 06:00 AM
Hi @Raymond Garcia , Here are the top 5 things we see that can significantly impact the performance customers get from Databricks. Please have a read and let us know how it helps you.
06-28-2022 02:35 PM
Hi thanks, I will check them out, and I will let you know. 🙂
06-29-2022 08:27 AM
Hello, I was able to reduce the time significantly. I used the OPTIMIZE keyword before starting processing.
Thanks!
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.