cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Performance enhancement while writing dataframes into Parquet tables

Sen
New Contributor

Hi,

I am trying to write the contents of a dataframe into a parquet table using the command below.

df.write.mode("overwrite").format("parquet").saveAsTable("sample_parquet_table")

The dataframe contains an extract from one of our source systems, which happens to be a Postgres database, and was prepared using a SQL statement. The data count would approximately be around 0.3M records. The target table is parquet and I have tried writing in overwrite mode.

The problem is, this statement keeps on running with no progress and automatically gets timed out after hours. As part of our requirement, we can afford a maximum of ~10mins to get this written into the target.

Is there a way to improve the performance? Or atleast understand where the problem lies? The target can be changed to a "delta" and can also be partitioned if needed.

1 ACCEPTED SOLUTION

Accepted Solutions

I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta

View solution in original post

9 REPLIES 9

mk1987c
New Contributor III

I think you can create partitioned and store it as delta table and optimize table using Zorder,

May i know your cluster configurations as well ?

I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta

With regard to point below that has been accepted as a solution
"I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta"

The fundamental considerations for optimizing write operations, especially those involving shuffling and partitioning, remain similar to Parquet for Delta. When you use partitionedBy during a write operation in Spark, it involves a shuffle operation to redistribute the data across the specified partitions. This is true for both Parquet and Delta tables because they both rely on Spark engine for data processing. My inclination would be to observe from Spark UI (4040), and the staging tab where insert (job) are taking longest . Evaluate the metrics related to tasks, such as input/output data size, shuffle read/write, and CPU time. Both SQL and executors tabs will also help in pinpointing the issue. You can also use compression like SNAPPY to reduce volume of writes.

HTH

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

Anonymous
Not applicable

Hi @Souradipta Sen​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Raluka
New Contributor III

Thanks for telling me, I didn't even know about it.

Raluka
New Contributor III

Many students are faced with the fact that they do not know where to turn for help with writing an essay. Sometimes it takes most of the day and is really tedious. I can only emphasize from myself that now all this is a solvable task. For example, you can hire professionals who write my essay for me https://ukwritings.com/write-my-essay  and get an amazing result very quickly and easily. I hope you will find this useful.

Kyrylo_Ozz
New Contributor II

Imho this issue may cause by SQL query which generate your DF - Queries are laze operation and starts when you need data - in this case - when you write DF to table(0.3M rows is nothing for Spark). So it's not write cause this issue but query - rewrite it for performance and all will work fast. 

Have a nice day! 

MichTalebzadeh
Valued Contributor

Hi,

I agree with the reply around the benefits of Delta tables, specifically Delta brings additional features,
such as ACID transactions and schema evolution. However, I am not sure whether the problem below and I quote "The problem is, this statement keeps on running with no progress and automatically gets timed out after hours. As part of our requirement, we can afford a maximum of ~10mins to get this written into the target." is going to be reduced etc. The fundamental considerations for optimizing write operations,
especially those involving shuffling and partitioning, remain similar to Parquet for Delta. When you use partitionedBy during a write operation in Spark, it involves a shuffle operation to redistribute the data across the specified partitions. This is true for both Parquet and Delta tables because they both rely on Spark engine for data processing. My inclination would be to observe from Spark UI (4040), and the staging tab where insert (job) are taking longest . Evaluate the metrics related to tasks, such as input/output data size, shuffle read/write, and CPU time. Both SQL and executors tabs will also help in pinpointing the issue. You can also use compression like SNAPPY to reduce volume of writes.

df.write.option("compression", "snappy").mode("overwrite").format("parquet").saveAsTable("table_name")


HTH

 

 

"

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

jhoon
New Contributor

Great discussion on performance optimization! Managing technical projects like these alongside academic work can be demanding. If you need expert academic support to free up time for your professional pursuits, Dissertation Help Services is here to assist. Balance your workload effectively and keep achieving excellence!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group