02-09-2023 05:25 AM
I need to read/query table A, manipulate/modify the data and insert the new data into Table A again.
I considered using :
Cur_Actual = spark.sql("Select * from Table A")
currAct_Rows = Cur_Actual.rdd.collect()
for row in currAct_Rows:
do_somthing(row)
But that doesn't allow me to change the data, for example:
row.DATE = date_add(row.DATE, 1)
And then I don't understand how I would insert the new data into TABLE A.
Andy advice would be appreciated.
02-09-2023 05:44 AM
OK.
Basically you should never loop over a dataframe because that renders the distributed capacity of Spark useless.
what you should do is:
There are some interesting tutorials on the databricks website which give an introduction to spark/databricks.
02-09-2023 05:29 AM
Hard to tell without some context. I suppose Table A is a hive table based on delta or parquet?
If so, this can easily be achieved with a withColumn statement and overwrite of the data (or write a merge statement, or even a update for delta lake).
02-09-2023 05:40 AM
Table A is a Delta table. I get this:
Cur_Actual.write.format('delta').mode('append').save('/location/Table A')
But as I understand it, one cannot loop over a DF, and hence the data is changed with the .collect() function to a collection.
This data needs to be modified and written back - but how,,?
02-09-2023 05:44 AM
OK.
Basically you should never loop over a dataframe because that renders the distributed capacity of Spark useless.
what you should do is:
There are some interesting tutorials on the databricks website which give an introduction to spark/databricks.
02-09-2023 11:06 AM
You can use withColumn() for the transformations and then write data this can be append, overwrite, merge .
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group