cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Manipulating Data - using Notebooks

StevenW
New Contributor III

I need to read/query table A, manipulate/modify the data and insert the new data into Table A again.

I considered using :

Cur_Actual = spark.sql("Select * from Table A")

currAct_Rows = Cur_Actual.rdd.collect()

for row in currAct_Rows:

do_somthing(row)

But that doesn't allow me to change the data, for example:

row.DATE = date_add(row.DATE, 1)

And then I don't understand how I would insert the new data into TABLE A.

Andy advice would be appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

OK.

Basically you should never loop over a dataframe because that renders the distributed capacity of Spark useless.

what you should do is:

  1. read the delta table into a dataframe with spark.read.table(table)
  2. then do your transformations. Updating a column is done with the withColumn() statement. There are tons of other functions of course.
  3. finally write the data. This can be either in append (as you did), merge (upsert) or overwrite (replace all).

There are some interesting tutorials on the databricks website which give an introduction to spark/databricks.

View solution in original post

4 REPLIES 4

-werners-
Esteemed Contributor III

Hard to tell without some context. I suppose Table A is a hive table based on delta or parquet?

If so, this can easily be achieved with a withColumn statement and overwrite of the data (or write a merge statement, or even a update for delta lake).

StevenW
New Contributor III

Table A is a Delta table. I get this:

Cur_Actual.write.format('delta').mode('append').save('/location/Table A')

But as I understand it, one cannot loop over a DF, and hence the data is changed with the .collect() function to a collection.

This data needs to be modified and written back - but how,,?

-werners-
Esteemed Contributor III

OK.

Basically you should never loop over a dataframe because that renders the distributed capacity of Spark useless.

what you should do is:

  1. read the delta table into a dataframe with spark.read.table(table)
  2. then do your transformations. Updating a column is done with the withColumn() statement. There are tons of other functions of course.
  3. finally write the data. This can be either in append (as you did), merge (upsert) or overwrite (replace all).

There are some interesting tutorials on the databricks website which give an introduction to spark/databricks.

Manoj12421
Valued Contributor II

You can use withColumn() for the transformations and then write data this can be append, overwrite, merge .

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!