02-09-2023 05:25 AM
I need to read/query table A, manipulate/modify the data and insert the new data into Table A again.
I considered using :
Cur_Actual = spark.sql("Select * from Table A")
currAct_Rows = Cur_Actual.rdd.collect()
for row in currAct_Rows:
do_somthing(row)
But that doesn't allow me to change the data, for example:
row.DATE = date_add(row.DATE, 1)
And then I don't understand how I would insert the new data into TABLE A.
Andy advice would be appreciated.
02-09-2023 05:44 AM
OK.
Basically you should never loop over a dataframe because that renders the distributed capacity of Spark useless.
what you should do is:
There are some interesting tutorials on the databricks website which give an introduction to spark/databricks.
02-09-2023 05:29 AM
Hard to tell without some context. I suppose Table A is a hive table based on delta or parquet?
If so, this can easily be achieved with a withColumn statement and overwrite of the data (or write a merge statement, or even a update for delta lake).
02-09-2023 05:40 AM
Table A is a Delta table. I get this:
Cur_Actual.write.format('delta').mode('append').save('/location/Table A')
But as I understand it, one cannot loop over a DF, and hence the data is changed with the .collect() function to a collection.
This data needs to be modified and written back - but how,,?
02-09-2023 05:44 AM
OK.
Basically you should never loop over a dataframe because that renders the distributed capacity of Spark useless.
what you should do is:
There are some interesting tutorials on the databricks website which give an introduction to spark/databricks.
02-09-2023 11:06 AM
You can use withColumn() for the transformations and then write data this can be append, overwrite, merge .
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now