Databricks

Sen · ‎02-11-2023

Hi,

I am trying to write the contents of a dataframe into a parquet table using the command below.

df.write.mode("overwrite").format("parquet").saveAsTable("sample_parquet_table")

The dataframe contains an extract from one of our source systems, which happens to be a Postgres database, and was prepared using a SQL statement. The data count would approximately be around 0.3M records. The target table is parquet and I have tried writing in overwrite mode.

The problem is, this statement keeps on running with no progress and automatically gets timed out after hours. As part of our requirement, we can afford a maximum of ~10mins to get this written into the target.

Is there a way to improve the performance? Or atleast understand where the problem lies? The target can be changed to a "delta" and can also be partitioned if needed.

jose_gonzalez · ‎02-23-2023

I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta

View solution in original post

mk1987c · ‎02-12-2023

I think you can create partitioned and store it as delta table and optimize table using Zorder,

May i know your cluster configurations as well ?

jose_gonzalez · ‎02-23-2023

I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta

MichTalebzadeh · ‎03-12-2024

With regard to point below that has been accepted as a solution
"I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta"

The fundamental considerations for optimizing write operations, especially those involving shuffling and partitioning, remain similar to Parquet for Delta. When you use partitionedBy during a write operation in Spark, it involves a shuffle operation to redistribute the data across the specified partitions. This is true for both Parquet and Delta tables because they both rely on Spark engine for data processing. My inclination would be to observe from Spark UI (4040), and the staging tab where insert (job) are taking longest . Evaluate the metrics related to tasks, such as input/output data size, shuffle read/write, and CPU time. Both SQL and executors tabs will also help in pinpointing the issue. You can also use compression like SNAPPY to reduce volume of writes.

HTH

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

Anonymous · ‎02-12-2023

Hi @Souradipta Sen

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Raluka · ‎10-21-2023

Thanks for telling me, I didn't even know about it.

Raluka · ‎10-23-2023

Many students are faced with the fact that they do not know where to turn for help with writing an essay. Sometimes it takes most of the day and is really tedious. I can only emphasize from myself that now all this is a solvable task. For example, you can hire professionals who write my essay for me https://ukwritings.com/write-my-essay and get an amazing result very quickly and easily. I hope you will find this useful.

Kyrylo_Ozz · ‎03-10-2024

Imho this issue may cause by SQL query which generate your DF - Queries are laze operation and starts when you need data - in this case - when you write DF to table(0.3M rows is nothing for Spark). So it's not write cause this issue but query - rewrite it for performance and all will work fast.

Have a nice day!

MichTalebzadeh · ‎03-10-2024

Hi,

I agree with the reply around the benefits of Delta tables, specifically Delta brings additional features,
such as ACID transactions and schema evolution. However, I am not sure whether the problem below and I quote "The problem is, this statement keeps on running with no progress and automatically gets timed out after hours. As part of our requirement, we can afford a maximum of ~10mins to get this written into the target." is going to be reduced etc. The fundamental considerations for optimizing write operations,
especially those involving shuffling and partitioning, remain similar to Parquet for Delta. When you use partitionedBy during a write operation in Spark, it involves a shuffle operation to redistribute the data across the specified partitions. This is true for both Parquet and Delta tables because they both rely on Spark engine for data processing. My inclination would be to observe from Spark UI (4040), and the staging tab where insert (job) are taking longest . Evaluate the metrics related to tasks, such as input/output data size, shuffle read/write, and CPU time. Both SQL and executors tabs will also help in pinpointing the issue. You can also use compression like SNAPPY to reduce volume of writes.

df.write.option("compression", "snappy").mode("overwrite").format("parquet").saveAsTable("table_name")

HTH

"

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

Databricks

Performance enhancement while writing dataframes into Parquet tables

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI