topic Read/Write concurrency issue in Data Engineering

Read/Write concurrency issue

APol — Thu, 08 Sep 2022 15:16:22 GMT

Hi.

I assume that it can be concurrency issue. (a Read thread from Databricks and a Write thread from another system)

From the start:

I read 12-16 csv files (approximately 250Mb each of them) to dataframe. df = spark.read.option("header", "False").option("delimiter", ',').option('quote','"').option("multiLine","true").option("escape", "\"").option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss'Z'").schema(schema).csv(partition_list)
Print count of rows. print(df.count())
Save dataframe to database. df.write.format('delta').mode('overwrite').option("overwriteSchema","true").saveAsTable(f"{db_name}.{table_name}")

This process is running once a day.

Sometimes I receive this error: "An error occurred while calling oXXXX.saveAsTable" (First 2 steps always work correct).

There is one important moment: when I read these files from ADLS, some of them can be overwritten by another system (according to file's LastModified date in storage).

I will add error output in attachment.

Do you know what can occur this error and how it can be solved?

Re: Read/Write concurrency issue

jose_gonzalez — Mon, 31 Oct 2022 17:06:08 GMT

The error message shows:

Caused by: java.lang.IllegalStateException: Error reading from input

at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:84)

at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:203)

at com.univocity.parsers.common.input.AbstractCharInputReader.nextChar(AbstractCharInputReader.java:280)

at com.univocity.parsers.common.input.DefaultCharAppender.appendUntil(DefaultCharAppender.java:292)

at com.univocity.parsers.common.input.ExpandingCharAppender.appendUntil(ExpandingCharAppender.java:177)

at com.univocity.parsers.csv.CsvParser.parseSingleDelimiterRecord(CsvParser.java:194)

at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:109)

at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:581)

... 34 more

Caused by: java.io.IOException: java.io.IOException: Operation failed: "The condition specified using HTTP conditional header(s) is not met.", 412, GET, https://ACCOUNT_NAME.dfs.core.windows.net/CONTAINER_NAME/INSTANCE_NAME/Tables/Custom/FOLDER_NAME/file_00002.csv?timeout=90, ConditionNotMet, "The condition specified using HTTP conditional header(s) is not met. RequestId:d4a3e6af-701f-003e-3590-b7b51a000000 Time:2022-08-24T08:03:57.9309350Z"

at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.ReadBufferWorker.run(ReadBufferWorker.java:77)

... 1 more

This is a 412 error message. Could you open a support ticket and share the error message? The Storage team should be able to get the logs and provide more information on why this is happening

Re: Read/Write concurrency issue

FerArribas — Mon, 02 Jan 2023 22:02:49 GMT

Hi @Anastasiia Polianska,

I agree, it looks like a concurrency issue. Very possibly this concurrency problem will be caused by an erroneous ETAG in the HTTP call to the Azure Storage API (https://azure.microsoft.com/de-de/blog/managing-concurrency-in-microsoft-azure-storage-2/)

The concurrency behavior can be configured according to the hadoop-azure library documentation. It is the library used to access ADLS (abfss)

https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html#Server_Options

Surely these links will help you understand/solve your problem:

https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html

GitHub: https://github.com/apache/hadoop/tree/rel/release-3.3.4/hadoop-tools/hadoop-azure

Thanks.

Fernando Arribas.