02-11-2023 08:57 AM
Hi,
I am trying to write the contents of a dataframe into a parquet table using the command below.
df.write.mode("overwrite").format("parquet").saveAsTable("sample_parquet_table")
The dataframe contains an extract from one of our source systems, which happens to be a Postgres database, and was prepared using a SQL statement. The data count would approximately be around 0.3M records. The target table is parquet and I have tried writing in overwrite mode.
The problem is, this statement keeps on running with no progress and automatically gets timed out after hours. As part of our requirement, we can afford a maximum of ~10mins to get this written into the target.
Is there a way to improve the performance? Or atleast understand where the problem lies? The target can be changed to a "delta" and can also be partitioned if needed.
02-23-2023 10:28 AM
I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta
02-12-2023 06:21 AM
I think you can create partitioned and store it as delta table and optimize table using Zorder,
May i know your cluster configurations as well ?
02-23-2023 10:28 AM
I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta
03-12-2024 05:28 AM
With regard to point below that has been accepted as a solution
"I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta"
The fundamental considerations for optimizing write operations, especially those involving shuffling and partitioning, remain similar to Parquet for Delta. When you use partitionedBy during a write operation in Spark, it involves a shuffle operation to redistribute the data across the specified partitions. This is true for both Parquet and Delta tables because they both rely on Spark engine for data processing. My inclination would be to observe from Spark UI (4040), and the staging tab where insert (job) are taking longest . Evaluate the metrics related to tasks, such as input/output data size, shuffle read/write, and CPU time. Both SQL and executors tabs will also help in pinpointing the issue. You can also use compression like SNAPPY to reduce volume of writes.
HTH
02-12-2023 11:15 PM
Hi @Souradipta Sen
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
10-21-2023 05:36 PM
Thanks for telling me, I didn't even know about it.
10-23-2023 07:45 PM
Many students are faced with the fact that they do not know where to turn for help with writing an essay. Sometimes it takes most of the day and is really tedious. I can only emphasize from myself that now all this is a solvable task. For example, you can hire professionals who write my essay for me https://ukwritings.com/write-my-essay and get an amazing result very quickly and easily. I hope you will find this useful.
03-10-2024 11:33 AM
Imho this issue may cause by SQL query which generate your DF - Queries are laze operation and starts when you need data - in this case - when you write DF to table(0.3M rows is nothing for Spark). So it's not write cause this issue but query - rewrite it for performance and all will work fast.
Have a nice day!
03-10-2024 01:31 PM
Hi,
I agree with the reply around the benefits of Delta tables, specifically Delta brings additional features,
such as ACID transactions and schema evolution. However, I am not sure whether the problem below and I quote "The problem is, this statement keeps on running with no progress and automatically gets timed out after hours. As part of our requirement, we can afford a maximum of ~10mins to get this written into the target." is going to be reduced etc. The fundamental considerations for optimizing write operations,
especially those involving shuffling and partitioning, remain similar to Parquet for Delta. When you use partitionedBy during a write operation in Spark, it involves a shuffle operation to redistribute the data across the specified partitions. This is true for both Parquet and Delta tables because they both rely on Spark engine for data processing. My inclination would be to observe from Spark UI (4040), and the staging tab where insert (job) are taking longest . Evaluate the metrics related to tasks, such as input/output data size, shuffle read/write, and CPU time. Both SQL and executors tabs will also help in pinpointing the issue. You can also use compression like SNAPPY to reduce volume of writes.
df.write.option("compression", "snappy").mode("overwrite").format("parquet").saveAsTable("table_name")
HTH
"
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group