<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Manipulating Data - using Notebooks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9672#M4999</link>
    <description>&lt;P&gt;You can use withColumn() for the transformations  and then write data this can be append, overwrite, merge .&lt;/P&gt;</description>
    <pubDate>Thu, 09 Feb 2023 19:06:52 GMT</pubDate>
    <dc:creator>Manoj12421</dc:creator>
    <dc:date>2023-02-09T19:06:52Z</dc:date>
    <item>
      <title>Manipulating Data - using Notebooks</title>
      <link>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9668#M4995</link>
      <description>&lt;P&gt;I need to read/query table A, manipulate/modify the data and insert the new data into Table A again.&lt;/P&gt;&lt;P&gt;I considered using :&lt;/P&gt;&lt;P&gt;Cur_Actual = spark.sql("Select * from Table A")&lt;/P&gt;&lt;P&gt;currAct_Rows = Cur_Actual.rdd.collect()&lt;/P&gt;&lt;P&gt;for row in currAct_Rows:&lt;/P&gt;&lt;P&gt;    do_somthing(row)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But that doesn't allow me to change the data, for example:&lt;/P&gt;&lt;P&gt;   row.DATE = date_add(row.DATE, 1)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;And then I don't understand how I would insert the new data into TABLE A.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Andy advice would be appreciated.&lt;/P&gt;</description>
      <pubDate>Thu, 09 Feb 2023 13:25:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9668#M4995</guid>
      <dc:creator>StevenW</dc:creator>
      <dc:date>2023-02-09T13:25:07Z</dc:date>
    </item>
    <item>
      <title>Re: Manipulating Data - using Notebooks</title>
      <link>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9669#M4996</link>
      <description>&lt;P&gt;Hard to tell without some context.  I suppose Table A is a hive table based on delta or parquet?&lt;/P&gt;&lt;P&gt;If so, this can easily be achieved with a withColumn statement and overwrite of the data (or write a merge statement, or even a update for delta lake).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Feb 2023 13:29:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9669#M4996</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-02-09T13:29:48Z</dc:date>
    </item>
    <item>
      <title>Re: Manipulating Data - using Notebooks</title>
      <link>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9670#M4997</link>
      <description>&lt;P&gt;Table A is a Delta table. I get this:&lt;/P&gt;&lt;P&gt;Cur_Actual.write.format('delta').mode('append').save('/location/Table A')&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But as I understand it, one cannot loop over a DF, and hence the data is changed with the .collect() function to a collection.&lt;/P&gt;&lt;P&gt;This data needs to be modified and written back - but how,,?&lt;/P&gt;</description>
      <pubDate>Thu, 09 Feb 2023 13:40:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9670#M4997</guid>
      <dc:creator>StevenW</dc:creator>
      <dc:date>2023-02-09T13:40:27Z</dc:date>
    </item>
    <item>
      <title>Re: Manipulating Data - using Notebooks</title>
      <link>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9671#M4998</link>
      <description>&lt;P&gt;OK. &lt;/P&gt;&lt;P&gt;Basically you should never loop over a dataframe because that renders the distributed capacity of Spark useless.&lt;/P&gt;&lt;P&gt;what you should do is:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;read the delta table into a dataframe with spark.read.table(table)&lt;/LI&gt;&lt;LI&gt;then do your transformations. Updating a column is done with the withColumn() statement.  There are tons of other functions of course.&lt;/LI&gt;&lt;LI&gt;finally write the data.  This can be either in append (as you did), merge (upsert) or overwrite (replace all).&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;There are some interesting tutorials on the databricks website which give an introduction to spark/databricks.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Feb 2023 13:44:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9671#M4998</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-02-09T13:44:04Z</dc:date>
    </item>
    <item>
      <title>Re: Manipulating Data - using Notebooks</title>
      <link>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9672#M4999</link>
      <description>&lt;P&gt;You can use withColumn() for the transformations  and then write data this can be append, overwrite, merge .&lt;/P&gt;</description>
      <pubDate>Thu, 09 Feb 2023 19:06:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/manipulating-data-using-notebooks/m-p/9672#M4999</guid>
      <dc:creator>Manoj12421</dc:creator>
      <dc:date>2023-02-09T19:06:52Z</dc:date>
    </item>
  </channel>
</rss>

