<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Read/Write concurrency issue in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/read-write-concurrency-issue/m-p/32257#M23521</link>
    <description>&lt;P&gt;Hi. &lt;/P&gt;&lt;P&gt;I assume that it can be concurrency issue. (a Read thread from Databricks and a Write thread from another system)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;From the start:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;B&gt;I read 12-16 csv files (approximately 250Mb each of them) to dataframe.&lt;/B&gt; df = spark.read.option("header", "False").option("delimiter", ',').option('quote','"').option("multiLine","true").option("escape", "\"").option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss'Z'").schema(schema).csv(partition_list)&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Print count of rows. &lt;/B&gt;print(df.count())&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Save dataframe to database. &lt;/B&gt;df.write.format('delta').mode('overwrite').option("overwriteSchema","true").saveAsTable(f"{db_name}.{table_name}")&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;This process is running once a day. &lt;/P&gt;&lt;P&gt;Sometimes I receive this error: "An error occurred while calling oXXXX.saveAsTable" (First 2 steps always work correct). &lt;/P&gt;&lt;P&gt;There is one important moment: when I read these files from ADLS, some of them can be overwritten by another system (according to file's LastModified date in storage).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I will add error output in attachment. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Do you know what can occur this error and how it can be solved?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 08 Sep 2022 15:16:22 GMT</pubDate>
    <dc:creator>APol</dc:creator>
    <dc:date>2022-09-08T15:16:22Z</dc:date>
    <item>
      <title>Read/Write concurrency issue</title>
      <link>https://community.databricks.com/t5/data-engineering/read-write-concurrency-issue/m-p/32257#M23521</link>
      <description>&lt;P&gt;Hi. &lt;/P&gt;&lt;P&gt;I assume that it can be concurrency issue. (a Read thread from Databricks and a Write thread from another system)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;From the start:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;B&gt;I read 12-16 csv files (approximately 250Mb each of them) to dataframe.&lt;/B&gt; df = spark.read.option("header", "False").option("delimiter", ',').option('quote','"').option("multiLine","true").option("escape", "\"").option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss'Z'").schema(schema).csv(partition_list)&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Print count of rows. &lt;/B&gt;print(df.count())&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Save dataframe to database. &lt;/B&gt;df.write.format('delta').mode('overwrite').option("overwriteSchema","true").saveAsTable(f"{db_name}.{table_name}")&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;This process is running once a day. &lt;/P&gt;&lt;P&gt;Sometimes I receive this error: "An error occurred while calling oXXXX.saveAsTable" (First 2 steps always work correct). &lt;/P&gt;&lt;P&gt;There is one important moment: when I read these files from ADLS, some of them can be overwritten by another system (according to file's LastModified date in storage).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I will add error output in attachment. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Do you know what can occur this error and how it can be solved?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 08 Sep 2022 15:16:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-write-concurrency-issue/m-p/32257#M23521</guid>
      <dc:creator>APol</dc:creator>
      <dc:date>2022-09-08T15:16:22Z</dc:date>
    </item>
    <item>
      <title>Re: Read/Write concurrency issue</title>
      <link>https://community.databricks.com/t5/data-engineering/read-write-concurrency-issue/m-p/32258#M23522</link>
      <description>&lt;P&gt;The error message shows:&lt;/P&gt;&lt;P&gt;Caused by: java.lang.IllegalStateException: Error reading from input&lt;/P&gt;&lt;P&gt;	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:84)&lt;/P&gt;&lt;P&gt;	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:203)&lt;/P&gt;&lt;P&gt;	at com.univocity.parsers.common.input.AbstractCharInputReader.nextChar(AbstractCharInputReader.java:280)&lt;/P&gt;&lt;P&gt;	at com.univocity.parsers.common.input.DefaultCharAppender.appendUntil(DefaultCharAppender.java:292)&lt;/P&gt;&lt;P&gt;	at com.univocity.parsers.common.input.ExpandingCharAppender.appendUntil(ExpandingCharAppender.java:177)&lt;/P&gt;&lt;P&gt;	at com.univocity.parsers.csv.CsvParser.parseSingleDelimiterRecord(CsvParser.java:194)&lt;/P&gt;&lt;P&gt;	at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:109)&lt;/P&gt;&lt;P&gt;	at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:581)&lt;/P&gt;&lt;P&gt;	... 34 more&lt;/P&gt;&lt;P&gt;Caused by: java.io.IOException: java.io.IOException: Operation failed: "The condition specified using HTTP conditional header(s) is not met.", 412, GET, &lt;A href="https://ACCOUNT_NAME.dfs.core.windows.net/CONTAINER_NAME/INSTANCE_NAME/Tables/Custom/FOLDER_NAME/file_00002.csv?timeout=90" target="test_blank"&gt;https://ACCOUNT_NAME.dfs.core.windows.net/CONTAINER_NAME/INSTANCE_NAME/Tables/Custom/FOLDER_NAME/file_00002.csv?timeout=90&lt;/A&gt;, ConditionNotMet, "The condition specified using HTTP conditional header(s) is not met. RequestId:d4a3e6af-701f-003e-3590-b7b51a000000 Time:2022-08-24T08:03:57.9309350Z"&lt;/P&gt;&lt;P&gt;	at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.ReadBufferWorker.run(ReadBufferWorker.java:77)&lt;/P&gt;&lt;P&gt;	... 1 more&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is a 412 error message. Could you open a support ticket and share the error message? The Storage team should be able to get the logs and provide more information on why this is happening&lt;/P&gt;</description>
      <pubDate>Mon, 31 Oct 2022 17:06:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-write-concurrency-issue/m-p/32258#M23522</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-10-31T17:06:08Z</dc:date>
    </item>
    <item>
      <title>Re: Read/Write concurrency issue</title>
      <link>https://community.databricks.com/t5/data-engineering/read-write-concurrency-issue/m-p/32259#M23523</link>
      <description>&lt;P&gt;Hi @Anastasiia Polianska​,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I agree, it looks like a concurrency issue. Very possibly this concurrency problem will be caused by an erroneous ETAG in the HTTP call to the Azure Storage API (https://azure.microsoft.com/de-de/blog/managing-concurrency-in-microsoft-azure-storage-2/)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The concurrency behavior can be configured according to the hadoop-azure library documentation. It is the library used to access ADLS (abfss)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html#Server_Options" target="test_blank"&gt;https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html#Server_Options&lt;/A&gt; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Surely these links will help you understand/solve your problem:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;A href="https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html" target="test_blank"&gt;https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Hadoop-ABFS"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1523i6A13D676ECE6972C/image-size/large?v=v2&amp;amp;px=999" role="button" title="Hadoop-ABFS" alt="Hadoop-ABFS" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;GitHub: &lt;A href="https://github.com/apache/hadoop/tree/rel/release-3.3.4/hadoop-tools/hadoop-azure" target="test_blank"&gt;https://github.com/apache/hadoop/tree/rel/release-3.3.4/hadoop-tools/hadoop-azure&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;Fernando Arribas.&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jan 2023 22:02:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-write-concurrency-issue/m-p/32259#M23523</guid>
      <dc:creator>FerArribas</dc:creator>
      <dc:date>2023-01-02T22:02:49Z</dc:date>
    </item>
  </channel>
</rss>

