<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Fail to write large dataframe in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/fail-to-write-large-dataframe/m-p/55422#M2007</link>
    <description>&lt;P&gt;The error is most likely caused by the row-per-row processing (due to the use of spacy).&lt;BR /&gt;Like that you bypass the parallel processing capabilities of spark.&lt;BR /&gt;A solution would by to not use a loop.&amp;nbsp; But in your case, using spacy, that does not seem possible ( I looked online but everybody seems to use a UDF and an iterator so that won't solve your issue).&lt;BR /&gt;Is it an option to use another NLP library that can run on pyspark like SparkNLP?&lt;/P&gt;</description>
    <pubDate>Mon, 18 Dec 2023 10:41:02 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2023-12-18T10:41:02Z</dc:date>
    <item>
      <title>Fail to write large dataframe</title>
      <link>https://community.databricks.com/t5/get-started-discussions/fail-to-write-large-dataframe/m-p/55318#M1999</link>
      <description>&lt;P&gt;Hi all, we have a issue while trying to write a quite large data frame, close to 35 million records. We try to write it as parquet and also table and none work. But writing a small chink (10k records) is working. Basically we have some text on which we apply a spacy model and after we create a new data frame.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def process_partition(iterator):
    import spacy
    from scispacy.umls_linking import UmlsEntityLinker
    core_sci = spacy.load("en_core_sci_sm", disable=['tok2vec','tagger','parser', 'senter', 'attribute_ruler', 'lemmatizer'])
    if component_name not in core_sci.pipe_names:
        print("add pipe")
        core_sci.add_pipe(component_name, config={"resolve_abbreviations": True, "linker_name": "umls"})
    for row in iterator:
        result = process_canonical_names(row.abstract, core_sci)
        yield Row(pmid=row.pmid, canonical_names=result["cn"], concept_ids=result["concept_id"])&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But when we run on all the data we get the next errors:&lt;/P&gt;&lt;P&gt;df.write.mode("overwrite").parquet(mount_dir)&lt;/P&gt;&lt;P&gt;&lt;FONT size="2"&gt;&lt;SPAN class=""&gt;org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 (TID 77) (10.80.246.21 executor 7): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to dbfs:/mnt/&lt;/SPAN&gt;...&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;and as table df.coalesce(20).write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(table_name)&lt;/P&gt;&lt;P&gt;&lt;FONT size="2"&gt;&lt;SPAN class=""&gt;org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in stage 2.0 failed 4 times, most recent failure: Lost task 11.3 in stage 2.0 (TID 49) (10.80.233.136 executor 8): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to dbfs:/user/hive/warehouse/&lt;/SPAN&gt;...&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;We are using databricks &lt;SPAN class=""&gt;13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)&amp;nbsp; and node type r5d.xlarge up to 10 workers also we have metastore enable and Unity Catalog.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Cosmin_2-1702640369404.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/5607i444C6D3FDDE06064/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Cosmin_2-1702640369404.png" alt="Cosmin_2-1702640369404.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;And we are wondering if what we are doing is wrong or we are using wrong resources or is something else. Any suggestion might help. Thank you!&lt;/P&gt;</description>
      <pubDate>Fri, 15 Dec 2023 11:40:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/fail-to-write-large-dataframe/m-p/55318#M1999</guid>
      <dc:creator>Cosmin</dc:creator>
      <dc:date>2023-12-15T11:40:57Z</dc:date>
    </item>
    <item>
      <title>Re: Fail to write large dataframe</title>
      <link>https://community.databricks.com/t5/get-started-discussions/fail-to-write-large-dataframe/m-p/55422#M2007</link>
      <description>&lt;P&gt;The error is most likely caused by the row-per-row processing (due to the use of spacy).&lt;BR /&gt;Like that you bypass the parallel processing capabilities of spark.&lt;BR /&gt;A solution would by to not use a loop.&amp;nbsp; But in your case, using spacy, that does not seem possible ( I looked online but everybody seems to use a UDF and an iterator so that won't solve your issue).&lt;BR /&gt;Is it an option to use another NLP library that can run on pyspark like SparkNLP?&lt;/P&gt;</description>
      <pubDate>Mon, 18 Dec 2023 10:41:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/fail-to-write-large-dataframe/m-p/55422#M2007</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-12-18T10:41:02Z</dc:date>
    </item>
    <item>
      <title>Re: Fail to write large dataframe</title>
      <link>https://community.databricks.com/t5/get-started-discussions/fail-to-write-large-dataframe/m-p/55426#M2008</link>
      <description>&lt;P&gt;Unfortunately we have to use the spacy models. Another approach that are we thinking is to deploy the models (methods that we use) as a API and do http request from spark. Can this approach work?&lt;/P&gt;</description>
      <pubDate>Mon, 18 Dec 2023 11:07:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/fail-to-write-large-dataframe/m-p/55426#M2008</guid>
      <dc:creator>Cosmin</dc:creator>
      <dc:date>2023-12-18T11:07:24Z</dc:date>
    </item>
    <item>
      <title>Re: Fail to write large dataframe</title>
      <link>https://community.databricks.com/t5/get-started-discussions/fail-to-write-large-dataframe/m-p/55432#M2009</link>
      <description>&lt;P&gt;That could work, but you will have to create a UDF.&lt;BR /&gt;Check &lt;A href="https://stackoverflow.com/questions/67204599/parallel-rest-api-request-using-sparkdatabricks" target="_self"&gt;this SO topic for more info&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Dec 2023 12:28:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/fail-to-write-large-dataframe/m-p/55432#M2009</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-12-18T12:28:14Z</dc:date>
    </item>
  </channel>
</rss>

