<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Any on please suggest how we can effectively loop through PySpark Dataframe . in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19408#M12993</link>
    <description>&lt;P&gt;Hi @Werner Stinckens​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In my case there is no common path, the file path column has different paths in a storage container.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Do we have any other way&lt;/P&gt;</description>
    <pubDate>Thu, 01 Dec 2022 13:44:04 GMT</pubDate>
    <dc:creator>Ancil</dc:creator>
    <dc:date>2022-12-01T13:44:04Z</dc:date>
    <item>
      <title>Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19402#M12987</link>
      <description>&lt;P&gt;Scenario: I Have a dataframe with more than 1000 rows, each row having a file path and result data column. I need to loop through each row and write files to the file path, with data from the result column.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;what is the easiest and time effective way   to do this?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried with collect and it's taking long time.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;And I tried UDF methods but getting below error&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1070i0FA54B9743651792/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 12:59:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19402#M12987</guid>
      <dc:creator>Ancil</dc:creator>
      <dc:date>2022-12-01T12:59:35Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19403#M12988</link>
      <description>&lt;P&gt;Hi @Ancil P A​&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is your data in the result column a json value or how is it ?&lt;/P&gt;&lt;P&gt;From your question, I understood that you have two columns in your df, 1 column is the file path and the other column is data. &lt;/P&gt;&lt;P&gt;Also please post what udf you are trying to build so that if your approach is useful, fix can be done on that.&lt;/P&gt;&lt;P&gt;Cheers.. &lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 13:06:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19403#M12988</guid>
      <dc:creator>UmaMahesh1</dc:creator>
      <dc:date>2022-12-01T13:06:50Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19404#M12989</link>
      <description>&lt;P&gt;Is it an option to write is as a single parquet file, but partitioned?&lt;/P&gt;&lt;P&gt;Like that physically the paths of the partitions are different, but they all belong to the same parquet file.&lt;/P&gt;&lt;P&gt;The key is to avoid loops.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 13:06:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19404#M12989</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-12-01T13:06:56Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19405#M12990</link>
      <description>&lt;P&gt;Hi @Uma Maheswara Rao Desula​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In the result column have result json data , but column type is string.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please find below screen shot for UDF&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1061iFCA8251CC8D5BE35/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;Once I called below line am getting below error&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;input_data_df = input_data_df.withColumn("is_file_created",write_files_udf(col("file_path"),col("data_after_grammar_correction")))&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1083i6BCAC130E1DAC180/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 13:25:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19405#M12990</guid>
      <dc:creator>Ancil</dc:creator>
      <dc:date>2022-12-01T13:25:47Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19406#M12991</link>
      <description>&lt;P&gt;Hi @Werner Stinckens​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My use case is to write text files with how many rows in dataframe.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For example, if I have 100 rows, then I need to write 100 files in the specified location.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 13:28:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19406#M12991</guid>
      <dc:creator>Ancil</dc:creator>
      <dc:date>2022-12-01T13:28:43Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19407#M12992</link>
      <description>&lt;P&gt;yes exactly, that is what partitioning does.&lt;/P&gt;&lt;P&gt;all you need is a common path where you will write all those files, and partition on the part that is not common.&lt;/P&gt;&lt;P&gt;f.e.&lt;/P&gt;&lt;P&gt;/path/to/file1|&amp;lt;data&amp;gt;&lt;/P&gt;&lt;P&gt;/path/to/file2|&amp;lt;data&amp;gt;&lt;/P&gt;&lt;P&gt;the common part(/path/to), you use as target location.&lt;/P&gt;&lt;P&gt;The changing part (file1, file2) you use as partition column&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;so it will become:&lt;/P&gt;&lt;P&gt;df.write.partitionBy(&amp;lt;fileCol&amp;gt;).parquet(&amp;lt;commonPath&amp;gt;)&lt;/P&gt;&lt;P&gt;Spark will write a file (or even more than 1) per partition.&lt;/P&gt;&lt;P&gt;If you want only one single file you also have to repartition by filecol.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 13:37:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19407#M12992</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-12-01T13:37:50Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19408#M12993</link>
      <description>&lt;P&gt;Hi @Werner Stinckens​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In my case there is no common path, the file path column has different paths in a storage container.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Do we have any other way&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 13:44:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19408#M12993</guid>
      <dc:creator>Ancil</dc:creator>
      <dc:date>2022-12-01T13:44:04Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19409#M12994</link>
      <description>&lt;P&gt;afaik partitioning is the only way to write to multiple locations in parallel.&lt;/P&gt;&lt;P&gt;This &lt;A href="https://stackoverflow.com/questions/73409103/can-i-write-multiple-dataframes-in-parallel-in-spark" alt="https://stackoverflow.com/questions/73409103/can-i-write-multiple-dataframes-in-parallel-in-spark" target="_blank"&gt;SO thread &lt;/A&gt;perhaps has a way.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 13:51:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19409#M12994</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-12-01T13:51:37Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19410#M12995</link>
      <description>&lt;P&gt;Thanks a lot, let me check &lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 14:02:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19410#M12995</guid>
      <dc:creator>Ancil</dc:creator>
      <dc:date>2022-12-01T14:02:37Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19411#M12996</link>
      <description>&lt;P&gt;Hi @Werner Stinckens​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;After partitioning also am getting below error. Do u have any about this error&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1075i25ED6F06A4B2A6C7/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt; &lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 16:36:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19411#M12996</guid>
      <dc:creator>Ancil</dc:creator>
      <dc:date>2022-12-01T16:36:19Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19412#M12997</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;I agree with Werners, try to avoid loop with Pyspark Dataframe.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If your dataframe is small, as you said, only about 1000 rows, you may consider to use Pandas.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks.​&lt;/P&gt;</description>
      <pubDate>Fri, 02 Dec 2022 03:28:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19412#M12997</guid>
      <dc:creator>NhatHoang</dc:creator>
      <dc:date>2022-12-02T03:28:07Z</dc:date>
    </item>
    <item>
      <title>Re: Any on please suggest how we can effectively loop through PySpark Dataframe .</title>
      <link>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19413#M12998</link>
      <description>&lt;P&gt;Hi @Nhat Hoang​&amp;nbsp;&lt;/P&gt;&lt;P&gt;The size may vary it may be up to 1 lakh, I will check with pandas&lt;/P&gt;</description>
      <pubDate>Fri, 02 Dec 2022 04:34:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/any-on-please-suggest-how-we-can-effectively-loop-through/m-p/19413#M12998</guid>
      <dc:creator>Ancil</dc:creator>
      <dc:date>2022-12-02T04:34:42Z</dc:date>
    </item>
  </channel>
</rss>

