<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Writing Spark data frame to ADLS is taking Huge time when Data Frame is of Text data. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31475#M22919</link>
    <description>&lt;P&gt;Spark data frame with text data when schema is in Struct type spark is taking too much time to write / save / push data to ADLS or SQL Db or download as csv.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image.png"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2160i3320FD9DDACA7787/image-size/large?v=v2&amp;amp;px=999" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 18 Jan 2022 11:07:25 GMT</pubDate>
    <dc:creator>Santosh09</dc:creator>
    <dc:date>2022-01-18T11:07:25Z</dc:date>
    <item>
      <title>Writing Spark data frame to ADLS is taking Huge time when Data Frame is of Text data.</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31475#M22919</link>
      <description>&lt;P&gt;Spark data frame with text data when schema is in Struct type spark is taking too much time to write / save / push data to ADLS or SQL Db or download as csv.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image.png"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2160i3320FD9DDACA7787/image-size/large?v=v2&amp;amp;px=999" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Jan 2022 11:07:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31475#M22919</guid>
      <dc:creator>Santosh09</dc:creator>
      <dc:date>2022-01-18T11:07:25Z</dc:date>
    </item>
    <item>
      <title>Re: Writing Spark data frame to ADLS is taking Huge time when Data Frame is of Text data.</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31477#M22921</link>
      <description>&lt;P&gt;Can you share your code? and provide more details like size of detaset, cluster configuration. I don't also understand "Text data" as it seems as more complex data type. &lt;/P&gt;</description>
      <pubDate>Tue, 18 Jan 2022 20:51:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31477#M22921</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-01-18T20:51:51Z</dc:date>
    </item>
    <item>
      <title>Re: Writing Spark data frame to ADLS is taking Huge time when Data Frame is of Text data.</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31478#M22922</link>
      <description>&lt;P&gt;I’m using YakeKeywordExtraction from SparkNLP to extract keywords, I’m facing an issue in saving result (spark data frame) to ADLS gen1 delta tables from Azure Databricks. Data frame comprise of strings in Struct schema format and I’m converting the struct schema to normal format by exploding and extracting required data. When I try to save this data frame to any of the target data sources ADLS/DB/toPandas/CSV. Max No of rows present in data frame would be 20 with 7 columns. The computational time for this notebook is 10min. But when the Final Df is ready saving&amp;nbsp;the extracted data is taking close to 55hrs. I have tried to curb this time by implementing all types of optimization techniques listed out in various forums/communities like using execution.arrow.pyspark, RDD’s etc. nothing worked.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Code to Explode results:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;scores = result \
    .selectExpr("explode(arrays_zip(keywords.result, keywords.metadata)) as resultTuples") \
    .selectExpr("resultTuples['0'] as keyword", "resultTuples['1'].score as score")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Code to write to ADLS: &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;scores.write.format("delta").save("path/to/adls/folder/result")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jan 2022 05:54:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31478#M22922</guid>
      <dc:creator>Santosh09</dc:creator>
      <dc:date>2022-01-19T05:54:58Z</dc:date>
    </item>
    <item>
      <title>Re: Writing Spark data frame to ADLS is taking Huge time when Data Frame is of Text data.</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31479#M22923</link>
      <description>&lt;P&gt;It's still hard to figure out exactly what's wrong, but my guess is the explode is creating a huge dataframe that's not able to fit into memory.  It largely depends on how many rows you have and the size of the struct.  if you have 100 rows and the struct is length/size 100 then you get 100x100 rows.  &lt;/P&gt;</description>
      <pubDate>Thu, 20 Jan 2022 00:21:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31479#M22923</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-20T00:21:47Z</dc:date>
    </item>
    <item>
      <title>Re: Writing Spark data frame to ADLS is taking Huge time when Data Frame is of Text data.</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31480#M22924</link>
      <description>&lt;P&gt;@shiva Santosh​&amp;nbsp;&lt;/P&gt;&lt;P&gt;Have to checked the count of the dataframe that you are trying to save to ADLS?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As @Joseph Kambourakis​&amp;nbsp; mentioned the explode can result in 1-many rows, better to check data frame count and see if Spark OOMs in the workspace.&lt;/P&gt;</description>
      <pubDate>Mon, 14 Mar 2022 15:27:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-spark-data-frame-to-adls-is-taking-huge-time-when-data/m-p/31480#M22924</guid>
      <dc:creator>User16764241763</dc:creator>
      <dc:date>2022-03-14T15:27:45Z</dc:date>
    </item>
  </channel>
</rss>

