cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

spark sql update really slow

gideont
New Contributor III

I tried to use Spark as much as possible but experience some regression. Hopefully to get some direction how to use it correctly.

I've created a Databricks table using spark.sql

spark.sql('select * from example_view ') \
    .write \
    .mode('overwrite') \
    .saveAsTable('example_table')

and then I need to patch some value

%sql 
 
update example_table set create_date = '2022-02-16' where id = '123';
update example_table set create_date = '2022-02-17' where id = '124';
update example_table set create_date = '2022-02-18' where id = '125';
update example_table set create_date = '2022-02-19' where id = '126';

However, I found this awlfully slow since it created hundreds of spark jobs:

image.pngWhy it Spark doing this and any suggestion how to improve my code? Last thing I want to do is to convert it back to Pandas and update the cell values individually. Any suggestion is appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

Pat
Honored Contributor III

Hi, @Vincent Doe​ ,

Updates are available in Delta tables, but under the hood you are updating parquet files, it means that each update needs to find the file where records are stored, then re-write the file to new version, and make new file current version.

In your case maybe you should try something like this:

    spark.sql("""
select 
col1,
col2,
col3,
case 
when id = '123' then '2022-02-16'
when id = '124' then '2022-02-17'
end as create_date
...
 from example_view
""") \
        .write \
        .mode('overwrite') \
        .saveAsTable('example_table')

View solution in original post

3 REPLIES 3

Pat
Honored Contributor III

Hi, @Vincent Doe​ ,

Updates are available in Delta tables, but under the hood you are updating parquet files, it means that each update needs to find the file where records are stored, then re-write the file to new version, and make new file current version.

In your case maybe you should try something like this:

    spark.sql("""
select 
col1,
col2,
col3,
case 
when id = '123' then '2022-02-16'
when id = '124' then '2022-02-17'
end as create_date
...
 from example_view
""") \
        .write \
        .mode('overwrite') \
        .saveAsTable('example_table')

gideont
New Contributor III

@Pat Sienkiewicz​ . That's good tips. Thanks.

Kaniz
Community Manager
Community Manager

Hi @Vincent Doe​ ​, It would mean a lot if you could select the "Best Answer" to help others find the correct answer faster.

This makes that answer appear right after the question, so it's easier to find within a thread.

It also helps us mark the question as answered so we can have more eyes helping others with unanswered questions.

Can I count on you?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.