<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Databricks fails writing after writing ~30 files in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35123#M25796</link>
    <description>&lt;P&gt;So it seems. Too bad corporate bureaucracy strictly forbids mounting anything. &lt;/P&gt;&lt;P&gt;Well, I suppose I got to bring this up to the policy board.&lt;/P&gt;</description>
    <pubDate>Tue, 16 Nov 2021 06:13:21 GMT</pubDate>
    <dc:creator>pine</dc:creator>
    <dc:date>2021-11-16T06:13:21Z</dc:date>
    <item>
      <title>Databricks fails writing after writing ~30 files</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35119#M25792</link>
      <description>&lt;P&gt;Good day, &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Copy of &lt;A href="https://stackoverflow.com/questions/69974301/looping-through-files-in-databricks-fails" alt="https://stackoverflow.com/questions/69974301/looping-through-files-in-databricks-fails" target="_blank"&gt;https://stackoverflow.com/questions/69974301/looping-through-files-in-databricks-fails&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I got 100 files of csv data on adls-gen1 store. I want to do some processing to them and save results to same drive, different directory. &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;def lookup_csv(CR_nro, hlo_lista =[], output = my_output_dir ): 
   base_lib = 'adl://azuredatalakestore.net/&amp;lt;address&amp;gt;'
  all_files = pd.DataFrame(dbutils.fs.ls(base_lib + f'CR{CR_nro}'), columns = ['full', 'name', 'size'])
  done = pd.DataFrame(dbutils.fs.ls(output), columns = ['full', 'name', 'size'])
  all_files = all_files[~all_files['name'].isin(done['name'].str.replace('/', ''))]
  all_files = all_files[~all_files['name'].str.contains('header')]
&amp;nbsp;
  my_scema = spark.read.csv(base_lib + f'CR{CR_nro}/header.csv', sep='\t', header=True, maxColumns = 1000000).schema
  tmp_lst = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT'] + [i for i in hlo_lista  if i in my_scema.fieldNames()]
&amp;nbsp;
  for my_file in all_files.iterrows(): 
    print(my_file[1]['name'], time.ctime(time.time()))
    data = spark.read.option('comment', '#').option('maxColumns', 1000000).schema(my_scema).csv(my_file[1]['full'], sep='\t').select(tmp_lst)
    data.write.csv( output + my_file[1]['name'], header=True, sep='\t')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This works for ~30 files then hangs with error: &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Py4JJavaError: An error occurred while calling o70690.csv. ​Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 154.0 failed 4 times, most recent failure: Lost task 0.3 in stage 154.0 (TID 1435, 10.11.64.46, executor 7): com.microsoft.azure.datalake.store.ADLException: Error creating file &amp;lt;my_output_dir&amp;gt;CR03_pt29.vcf.gz/_started_1438828951154916601 Operation CREATE failed with HTTP401 : null Last encountered exception thrown after 2 tries. [HTTP401(null),HTTP401(null)]&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Apparently my credentials to access ald drive fail? Or time expires for single command? Any idea why this would happen? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I queue the commands: &lt;/P&gt;&lt;P&gt;&amp;nbsp;lookup_csv(&amp;lt;inputs&amp;gt;)&lt;/P&gt;&lt;P&gt;&amp;lt;new cell&amp;gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;lookup_csv(&amp;lt;inputs&amp;gt;)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It works, fails and works next cell. But If I try to loop the commands: &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for i in range(10) &lt;/P&gt;&lt;P&gt; lookup_csv&amp;lt;inputs&amp;gt; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;if fails and keeps failing till the end of time. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Maybe I need to refresh the credentials every 10 files or something? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Nov 2021 12:39:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35119#M25792</guid>
      <dc:creator>pine</dc:creator>
      <dc:date>2021-11-15T12:39:32Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks fails writing after writing ~30 files</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35120#M25793</link>
      <description>&lt;P&gt;was actually anything created by script in directory &amp;lt;my_output_dir&amp;gt;?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The best would be to permanently mount ADSL storage and use azure app for that.&lt;/P&gt;&lt;P&gt;In Azure please go to  App registrations - register app with name for example  "databricks_mount" . Add  IAM role "Storage Blob Data Contributor" for that app in your delta lake storage.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;configs = {"fs.azure.account.auth.type": "OAuth",
          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",&amp;nbsp;
          "fs.azure.account.oauth2.client.id": "&amp;lt;your-client-id&amp;gt;",
          "fs.azure.account.oauth2.client.secret": "&amp;lt;your-secret&amp;gt;",
          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/&amp;lt;your-endpoint&amp;gt;/oauth2/token"}
&amp;nbsp;
dbutils.fs.mount(
 source = "abfss://delta@yourdatalake.dfs.core.windows.net/",
 mount_point = "/mnt/delta",
 extra_configs = configs)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Nov 2021 13:06:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35120#M25793</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-11-15T13:06:48Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks fails writing after writing ~30 files</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35121#M25794</link>
      <description>&lt;P&gt;Yes. The ~29  files are created just fine. &lt;/P&gt;&lt;P&gt;In SO it's pointed out that AD passthrough credentials expire after 60 minutes for single cell of command. This strikes as plausible explanation. &lt;/P&gt;</description>
      <pubDate>Mon, 15 Nov 2021 13:22:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35121#M25794</guid>
      <dc:creator>pine</dc:creator>
      <dc:date>2021-11-15T13:22:31Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks fails writing after writing ~30 files</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35122#M25795</link>
      <description>&lt;P&gt;hi @pine​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It seems like the error is coming from your storage layer. Try to follow @Hubert Dudek​&amp;nbsp;recommendation to create a mount point. &lt;/P&gt;</description>
      <pubDate>Mon, 15 Nov 2021 18:58:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35122#M25795</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2021-11-15T18:58:34Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks fails writing after writing ~30 files</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35123#M25796</link>
      <description>&lt;P&gt;So it seems. Too bad corporate bureaucracy strictly forbids mounting anything. &lt;/P&gt;&lt;P&gt;Well, I suppose I got to bring this up to the policy board.&lt;/P&gt;</description>
      <pubDate>Tue, 16 Nov 2021 06:13:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35123#M25796</guid>
      <dc:creator>pine</dc:creator>
      <dc:date>2021-11-16T06:13:21Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks fails writing after writing ~30 files</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35124#M25797</link>
      <description>&lt;P&gt;You can access without mount but still you need to register an app and apply config via spark settings in your notebook to get the access to ADLS. It should be permanent for whole session thanks to azure app:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"), 
spark.conf.set("fs.azure.account.oauth2.client.id", "&amp;lt;your-client-id&amp;gt;")
spark.conf.set("fs.azure.account.oauth2.client.secret", "&amp;lt;your-secret&amp;gt;")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/&amp;lt;your-endpoint&amp;gt;/oauth2/token")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This explanation is the best &lt;A href="https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access.html#access-adls-gen2-directly" alt="https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access.html#access-adls-gen2-directly" target="_blank"&gt;https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access.html#access-adls-gen2-directly&lt;/A&gt; although I remember that first times I had also problems with that. On that page is also explained how to register an app. Maybe it will be ok for your company policies.&lt;/P&gt;</description>
      <pubDate>Tue, 16 Nov 2021 11:08:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-fails-writing-after-writing-30-files/m-p/35124#M25797</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-11-16T11:08:51Z</dc:date>
    </item>
  </channel>
</rss>

