<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Random failure in the loop in pyspark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/102408#M41097</link>
    <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/89478"&gt;@MuthuLakshmi&lt;/a&gt;&amp;nbsp;thank you for your response. No I don't use df.cache() anywhere in the code. Yet I tried uncaching intermediate table which is read and updated within the loop, but it didn't help:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;spark.catalog.uncacheTable("IM_AWL_Products")&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;I don't want to disable cluster level caching because we have an entire team running their code in same cluster. So my preference is to solve this more within the code.&lt;/P&gt;</description>
    <pubDate>Tue, 17 Dec 2024 15:42:37 GMT</pubDate>
    <dc:creator>bcsalay</dc:creator>
    <dc:date>2024-12-17T15:42:37Z</dc:date>
    <item>
      <title>Random failure in the loop in pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/101582#M40732</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm encountering an issue in a pyspark code, where I'm calculating certain information monthly in a loop. The flow is pretty much as:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Read input and create/read intermediate parquet files,&lt;/LI&gt;&lt;LI&gt;Upsert records in intermediate parquet files with the monthly information from input&lt;/LI&gt;&lt;LI&gt;Append records to an output parquet file&lt;/LI&gt;&lt;LI&gt;Proceed to next month within the loop&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Which 75% of the time fails at some random month, and 25% it succeeds. I just cannot figure out what causes the failure at 75% of the time. When it succeeds the run takes around ~40 mins and it processes around 180 months.&lt;/P&gt;&lt;P&gt;Here's the error log:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Py4JJavaError: An error occurred while calling o71002.parquet.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 7856.0 failed 4 times, most recent failure: Lost task 1.3 in stage 7856.0 (TID 77757) (10.2.222.135 executor 12): org.apache.spark.SparkFileNotFoundException: Operation failed: "The specified path does not exist.", 404, GET, &lt;A href="https://xxx.dfs.core.windows.net/teamdata/Dev/Zones/Product/IM_AWL_Products/part-00001-tid-5889256371779762713-986e97fc-c887-4700-b842-6d4bd41c45d4-76340-1.c000.snappy.parquet?timeout=90" target="_blank" rel="noopener"&gt;https://xxx.dfs.core.windows.net/teamdata/Dev/Zones/Product/IM_AWL_Products/part-00001-tid-5889256371779762713-986e97fc-c887-4700-b842-6d4bd41c45d4-76340-1.c000.snappy.parquet?timeout=90&lt;/A&gt;, PathNotFound,&amp;nbsp; "The specified path does not exist. RequestId:ac6649bc-201f-000b-47f0-4a9296000000 Time:2024-12-10T10:47:46.7936012Z". [DEFAULT_FILE_NOT_FOUND] It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If disk cache is stale or the underlying files have been removed, you can invalidate disk cache manually by restarting the cluster. SQLSTATE: 42K03&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Tue, 10 Dec 2024 10:53:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/101582#M40732</guid>
      <dc:creator>bcsalay</dc:creator>
      <dc:date>2024-12-10T10:53:09Z</dc:date>
    </item>
    <item>
      <title>Re: Random failure in the loop in pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/101586#M40734</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/135726"&gt;@bcsalay&lt;/a&gt;&amp;nbsp;This could possibly be a cache issue.&amp;nbsp;&lt;BR /&gt;Are you doing dataframe.cache() anywhere in your code? if so please follow that by using unpersist()&lt;BR /&gt;&lt;BR /&gt;Also can you use the below config at cluster level&lt;BR /&gt;&lt;SPAN&gt;spark.databricks.io.cache.enabled false&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Dec 2024 12:19:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/101586#M40734</guid>
      <dc:creator>MuthuLakshmi</dc:creator>
      <dc:date>2024-12-10T12:19:30Z</dc:date>
    </item>
    <item>
      <title>Re: Random failure in the loop in pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/101653#M40762</link>
      <description>&lt;P&gt;Can you show some code to get the gist of what the code does? Are the parquet files accessed as a catalog table? Could it be that some other job makes changes to input tables?&lt;/P&gt;</description>
      <pubDate>Tue, 10 Dec 2024 19:27:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/101653#M40762</guid>
      <dc:creator>JacekLaskowski</dc:creator>
      <dc:date>2024-12-10T19:27:21Z</dc:date>
    </item>
    <item>
      <title>Re: Random failure in the loop in pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/102408#M41097</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/89478"&gt;@MuthuLakshmi&lt;/a&gt;&amp;nbsp;thank you for your response. No I don't use df.cache() anywhere in the code. Yet I tried uncaching intermediate table which is read and updated within the loop, but it didn't help:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;spark.catalog.uncacheTable("IM_AWL_Products")&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;I don't want to disable cluster level caching because we have an entire team running their code in same cluster. So my preference is to solve this more within the code.&lt;/P&gt;</description>
      <pubDate>Tue, 17 Dec 2024 15:42:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/102408#M41097</guid>
      <dc:creator>bcsalay</dc:creator>
      <dc:date>2024-12-17T15:42:37Z</dc:date>
    </item>
    <item>
      <title>Re: Random failure in the loop in pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/102410#M41098</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/7083"&gt;@JacekLaskowski&lt;/a&gt;&amp;nbsp;thank you for your response. No it is not a catalog table and not accessed/used by another job. I tried to explain above what the code does operationally, giving some more context: it is a development code, which processes historical data to create a certain business logic, and outputs are used to define a flag during statistical modelling, that's pretty much it. So it is not implemented anywhere yet, not a production code, just manually triggered by me or my team to create outputs.&lt;/P&gt;&lt;P&gt;I needed to write it with a loop because this is more convenient once this code is running in production, since business logic is built backward looking indefinitely in history, suppose a flag created in 2015 can impact next month's decision. Therefore code aggregates all historical information a row per product, and reads/updates it each month.&lt;/P&gt;&lt;P&gt;Hope this gives more clarity.&lt;/P&gt;</description>
      <pubDate>Tue, 17 Dec 2024 16:02:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/random-failure-in-the-loop-in-pyspark/m-p/102410#M41098</guid>
      <dc:creator>bcsalay</dc:creator>
      <dc:date>2024-12-17T16:02:35Z</dc:date>
    </item>
  </channel>
</rss>

