topic Re: Performance enhancement while writing dataframes into Parquet tables in Data Engineering

Performance enhancement while writing dataframes into Parquet tables

Sen — Sat, 11 Feb 2023 16:57:24 GMT

Hi,

I am trying to write the contents of a dataframe into a parquet table using the command below.

df.write.mode("overwrite").format("parquet").saveAsTable("sample_parquet_table")

The dataframe contains an extract from one of our source systems, which happens to be a Postgres database, and was prepared using a SQL statement. The data count would approximately be around 0.3M records. The target table is parquet and I have tried writing in overwrite mode.

The problem is, this statement keeps on running with no progress and automatically gets timed out after hours. As part of our requirement, we can afford a maximum of ~10mins to get this written into the target.

Is there a way to improve the performance? Or atleast understand where the problem lies? The target can be changed to a "delta" and can also be partitioned if needed.

Re: Performance enhancement while writing dataframes into Parquet tables

mk1987c — Sun, 12 Feb 2023 14:21:12 GMT

I think you can create partitioned and store it as delta table and optimize table using Zorder,

May i know your cluster configurations as well ?

Re: Performance enhancement while writing dataframes into Parquet tables

Anonymous — Mon, 13 Feb 2023 07:15:40 GMT

Hi @Souradipta Sen

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Re: Performance enhancement while writing dataframes into Parquet tables

jose_gonzalez — Thu, 23 Feb 2023 18:28:33 GMT

I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta

Re: Performance enhancement while writing dataframes into Parquet tables

Raluka — Sun, 22 Oct 2023 00:36:59 GMT

Thanks for telling me, I didn't even know about it.

Re: Performance enhancement while writing dataframes into Parquet tables

Raluka — Tue, 24 Oct 2023 02:45:05 GMT

Many students are faced with the fact that they do not know where to turn for help with writing an essay. Sometimes it takes most of the day and is really tedious. I can only emphasize from myself that now all this is a solvable task. For example, you can hire professionals who write my essay for me https://ukwritings.com/write-my-essay and get an amazing result very quickly and easily. I hope you will find this useful.

Re: Performance enhancement while writing dataframes into Parquet tables

Kyrylo_Ozz — Sun, 10 Mar 2024 18:33:34 GMT

Imho this issue may cause by SQL query which generate your DF - Queries are laze operation and starts when you need data - in this case - when you write DF to table(0.3M rows is nothing for Spark). So it's not write cause this issue but query - rewrite it for performance and all will work fast.

Have a nice day!

Re: Performance enhancement while writing dataframes into Parquet tables

MichTalebzadeh — Sun, 10 Mar 2024 20:31:14 GMT

Hi,

I agree with the reply around the benefits of Delta tables, specifically Delta brings additional features,
such as ACID transactions and schema evolution. However, I am not sure whether the problem below and I quote "The problem is, this statement keeps on running with no progress and automatically gets timed out after hours. As part of our requirement, we can afford a maximum of ~10mins to get this written into the target." is going to be reduced etc. The fundamental considerations for optimizing write operations,
especially those involving shuffling and partitioning, remain similar to Parquet for Delta. When you use partitionedBy during a write operation in Spark, it involves a shuffle operation to redistribute the data across the specified partitions. This is true for both Parquet and Delta tables because they both rely on Spark engine for data processing. My inclination would be to observe from Spark UI (4040), and the staging tab where insert (job) are taking longest . Evaluate the metrics related to tasks, such as input/output data size, shuffle read/write, and CPU time. Both SQL and executors tabs will also help in pinpointing the issue. You can also use compression like SNAPPY to reduce volume of writes.

df.write.option("compression", "snappy").mode("overwrite").format("parquet").saveAsTable("table_name")

HTH

Re: Performance enhancement while writing dataframes into Parquet tables

MichTalebzadeh — Tue, 12 Mar 2024 12:28:32 GMT

With regard to point below that has been accepted as a solution
"I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta"

The fundamental considerations for optimizing write operations, especially those involving shuffling and partitioning, remain similar to Parquet for Delta. When you use partitionedBy during a write operation in Spark, it involves a shuffle operation to redistribute the data across the specified partitions. This is true for both Parquet and Delta tables because they both rely on Spark engine for data processing. My inclination would be to observe from Spark UI (4040), and the staging tab where insert (job) are taking longest . Evaluate the metrics related to tasks, such as input/output data size, shuffle read/write, and CPU time. Both SQL and executors tabs will also help in pinpointing the issue. You can also use compression like SNAPPY to reduce volume of writes.

HTH

Re: Performance enhancement while writing dataframes into Parquet tables

jhoon — Tue, 03 Dec 2024 14:47:24 GMT

Great discussion on performance optimization! Managing technical projects like these alongside academic work can be demanding. If you need expert academic support to free up time for your professional pursuits, Dissertation Help Services is here to assist. Balance your workload effectively and keep achieving excellence!

Re: Performance enhancement while writing dataframes into Parquet tables

BobClarke — Mon, 30 Jun 2025 18:24:07 GMT

I am Bob Clarke marketing manager of virtual assistants Pakistan and I help companies hire amazon virtual assistants who manage product listings order processing and inventory updates. Our trained staff improves efficiency and boosts sales. We support your store around the clock.