Databricks Community

pine · ‎11-15-2021

Good day,

Copy of https://stackoverflow.com/questions/69974301/looping-through-files-in-databricks-fails

I got 100 files of csv data on adls-gen1 store. I want to do some processing to them and save results to same drive, different directory.

def lookup_csv(CR_nro, hlo_lista =[], output = my_output_dir ): 
   base_lib = 'adl://azuredatalakestore.net/<address>'
  all_files = pd.DataFrame(dbutils.fs.ls(base_lib + f'CR{CR_nro}'), columns = ['full', 'name', 'size'])
  done = pd.DataFrame(dbutils.fs.ls(output), columns = ['full', 'name', 'size'])
  all_files = all_files[~all_files['name'].isin(done['name'].str.replace('/', ''))]
  all_files = all_files[~all_files['name'].str.contains('header')]
 
  my_scema = spark.read.csv(base_lib + f'CR{CR_nro}/header.csv', sep='\t', header=True, maxColumns = 1000000).schema
  tmp_lst = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT'] + [i for i in hlo_lista  if i in my_scema.fieldNames()]
 
  for my_file in all_files.iterrows(): 
    print(my_file[1]['name'], time.ctime(time.time()))
    data = spark.read.option('comment', '#').option('maxColumns', 1000000).schema(my_scema).csv(my_file[1]['full'], sep='\t').select(tmp_lst)
    data.write.csv( output + my_file[1]['name'], header=True, sep='\t')

This works for ~30 files then hangs with error:

Py4JJavaError: An error occurred while calling o70690.csv. Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 154.0 failed 4 times, most recent failure: Lost task 0.3 in stage 154.0 (TID 1435, 10.11.64.46, executor 7): com.microsoft.azure.datalake.store.ADLException: Error creating file <my_output_dir>CR03_pt29.vcf.gz/_started_1438828951154916601 Operation CREATE failed with HTTP401 : null Last encountered exception thrown after 2 tries. [HTTP401(null),HTTP401(null)]

Apparently my credentials to access ald drive fail? Or time expires for single command? Any idea why this would happen?

If I queue the commands:

lookup_csv(<inputs>)

lookup_csv(<inputs>)

It works, fails and works next cell. But If I try to loop the commands:

for i in range(10)

lookup_csv<inputs>

if fails and keeps failing till the end of time.

Maybe I need to refresh the credentials every 10 files or something?

Hubert-Dudek · ‎11-16-2021

You can access without mount but still you need to register an app and apply config via spark settings in your notebook to get the access to ADLS. It should be permanent for whole session thanks to azure app:

spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"), 
spark.conf.set("fs.azure.account.oauth2.client.id", "<your-client-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<your-secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<your-endpoint>/oauth2/token")

This explanation is the best https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access.html#acc... although I remember that first times I had also problems with that. On that page is also explained how to register an app. Maybe it will be ok for your company policies.

View solution in original post

Hubert-Dudek · ‎11-15-2021

was actually anything created by script in directory <my_output_dir>?

The best would be to permanently mount ADSL storage and use azure app for that.

In Azure please go to App registrations - register app with name for example "databricks_mount" . Add IAM role "Storage Blob Data Contributor" for that app in your delta lake storage.

configs = {"fs.azure.account.auth.type": "OAuth",
          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", 
          "fs.azure.account.oauth2.client.id": "<your-client-id>",
          "fs.azure.account.oauth2.client.secret": "<your-secret>",
          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<your-endpoint>/oauth2/token"}
 
dbutils.fs.mount(
 source = "abfss://delta@yourdatalake.dfs.core.windows.net/",
 mount_point = "/mnt/delta",
 extra_configs = configs)

pine · ‎11-15-2021

Yes. The ~29 files are created just fine.

In SO it's pointed out that AD passthrough credentials expire after 60 minutes for single cell of command. This strikes as plausible explanation.

jose_gonzalez · ‎11-15-2021

hi @pine ,

It seems like the error is coming from your storage layer. Try to follow @Hubert Dudek recommendation to create a mount point.

pine · ‎11-15-2021

So it seems. Too bad corporate bureaucracy strictly forbids mounting anything.

Well, I suppose I got to bring this up to the policy board.

Hubert-Dudek · ‎11-16-2021

You can access without mount but still you need to register an app and apply config via spark settings in your notebook to get the access to ADLS. It should be permanent for whole session thanks to azure app:

spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"), 
spark.conf.set("fs.azure.account.oauth2.client.id", "<your-client-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<your-secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<your-endpoint>/oauth2/token")

This explanation is the best https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access.html#acc... although I remember that first times I had also problems with that. On that page is also explained how to register an app. Maybe it will be ok for your company policies.

Databricks Community

Databricks fails writing after writing ~30 files

Join Us as a Local Community Builder!

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Find Sensitive Data at Scale with Data Classification in Unity Catalog

Solution Accelerator Series | #6 - Adverse Drug Event Detection

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops