topic Re: Databricks fails writing after writing ~30 files in Data Engineering

Databricks fails writing after writing ~30 files

pine — Mon, 15 Nov 2021 12:39:32 GMT

Good day,

Copy of https://stackoverflow.com/questions/69974301/looping-through-files-in-databricks-fails

I got 100 files of csv data on adls-gen1 store. I want to do some processing to them and save results to same drive, different directory.

def lookup_csv(CR_nro, hlo_lista =[], output = my_output_dir ): 
   base_lib = 'adl://azuredatalakestore.net/<address>'
  all_files = pd.DataFrame(dbutils.fs.ls(base_lib + f'CR{CR_nro}'), columns = ['full', 'name', 'size'])
  done = pd.DataFrame(dbutils.fs.ls(output), columns = ['full', 'name', 'size'])
  all_files = all_files[~all_files['name'].isin(done['name'].str.replace('/', ''))]
  all_files = all_files[~all_files['name'].str.contains('header')]
 
  my_scema = spark.read.csv(base_lib + f'CR{CR_nro}/header.csv', sep='\t', header=True, maxColumns = 1000000).schema
  tmp_lst = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT'] + [i for i in hlo_lista  if i in my_scema.fieldNames()]
 
  for my_file in all_files.iterrows(): 
    print(my_file[1]['name'], time.ctime(time.time()))
    data = spark.read.option('comment', '#').option('maxColumns', 1000000).schema(my_scema).csv(my_file[1]['full'], sep='\t').select(tmp_lst)
    data.write.csv( output + my_file[1]['name'], header=True, sep='\t')

This works for ~30 files then hangs with error:

Py4JJavaError: An error occurred while calling o70690.csv. Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 154.0 failed 4 times, most recent failure: Lost task 0.3 in stage 154.0 (TID 1435, 10.11.64.46, executor 7): com.microsoft.azure.datalake.store.ADLException: Error creating file <my_output_dir>CR03_pt29.vcf.gz/_started_1438828951154916601 Operation CREATE failed with HTTP401 : null Last encountered exception thrown after 2 tries. [HTTP401(null),HTTP401(null)]

Apparently my credentials to access ald drive fail? Or time expires for single command? Any idea why this would happen?

If I queue the commands:

lookup_csv(<inputs>)

lookup_csv(<inputs>)

It works, fails and works next cell. But If I try to loop the commands:

for i in range(10)

lookup_csv<inputs>

if fails and keeps failing till the end of time.

Maybe I need to refresh the credentials every 10 files or something?

Re: Databricks fails writing after writing ~30 files

Hubert-Dudek — Mon, 15 Nov 2021 13:06:48 GMT

was actually anything created by script in directory <my_output_dir>?

The best would be to permanently mount ADSL storage and use azure app for that.

In Azure please go to App registrations - register app with name for example "databricks_mount" . Add IAM role "Storage Blob Data Contributor" for that app in your delta lake storage.

configs = {"fs.azure.account.auth.type": "OAuth",
          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", 
          "fs.azure.account.oauth2.client.id": "<your-client-id>",
          "fs.azure.account.oauth2.client.secret": "<your-secret>",
          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<your-endpoint>/oauth2/token"}
 
dbutils.fs.mount(
 source = "abfss://delta@yourdatalake.dfs.core.windows.net/",
 mount_point = "/mnt/delta",
 extra_configs = configs)

Re: Databricks fails writing after writing ~30 files

pine — Mon, 15 Nov 2021 13:22:31 GMT

Yes. The ~29 files are created just fine.

In SO it's pointed out that AD passthrough credentials expire after 60 minutes for single cell of command. This strikes as plausible explanation.

Re: Databricks fails writing after writing ~30 files

jose_gonzalez — Mon, 15 Nov 2021 18:58:34 GMT

hi @pine ,

It seems like the error is coming from your storage layer. Try to follow @Hubert Dudek recommendation to create a mount point.

Re: Databricks fails writing after writing ~30 files

pine — Tue, 16 Nov 2021 06:13:21 GMT

So it seems. Too bad corporate bureaucracy strictly forbids mounting anything.

Well, I suppose I got to bring this up to the policy board.

Re: Databricks fails writing after writing ~30 files

Hubert-Dudek — Tue, 16 Nov 2021 11:08:51 GMT

You can access without mount but still you need to register an app and apply config via spark settings in your notebook to get the access to ADLS. It should be permanent for whole session thanks to azure app:

spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"), 
spark.conf.set("fs.azure.account.oauth2.client.id", "<your-client-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<your-secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<your-endpoint>/oauth2/token")

This explanation is the best https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access.html#access-adls-gen2-directly although I remember that first times I had also problems with that. On that page is also explained how to register an app. Maybe it will be ok for your company policies.