<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Autoloader works on compute cluster, but does not work within a task in workflows in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4172#M971</link>
    <description>&lt;P&gt;@Vidula Khanna​&amp;nbsp;I see you have responded to previous autoloader questions. Can you help me?&lt;/P&gt;</description>
    <pubDate>Fri, 19 May 2023 10:42:29 GMT</pubDate>
    <dc:creator>96286</dc:creator>
    <dc:date>2023-05-19T10:42:29Z</dc:date>
    <item>
      <title>Autoloader works on compute cluster, but does not work within a task in workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4171#M970</link>
      <description>&lt;P&gt;I feel like I am going crazy with this. I have tested a data pipeline on my standard compute cluster. I am loading new files as batch from a Google Cloud Storage bucket. Autoloader works exactly as expected from my notebook on my compute cluster. Then, I simply used this notebook as a first task in a workflow using a new job cluster. In order to test this pipeline as a workflow I first removed all checkpoint files and directories before starting the run using this command. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dbutils.fs.rm(checkpoint_path, True)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For some reason, the code works perfectly when testing, but in workflows, I get "streaming stopped" and no data from autoloader. Here is my config for autoloader:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;file_path = "gs://raw_zone_twitter"&lt;/P&gt;&lt;P&gt;table_name = f"twitter_data_autoloader"&lt;/P&gt;&lt;P&gt;checkpoint_path = f"/tmp/_checkpoint/twitter_checkpoint"&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;spark.sql(f"DROP TABLE IF EXISTS {table_name}")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;query = (spark.readStream&lt;/P&gt;&lt;P&gt;  .format("cloudFiles")&lt;/P&gt;&lt;P&gt;  .option("cloudFiles.format", "text")&lt;/P&gt;&lt;P&gt;  .option("cloudFiles.schemaLocation", checkpoint_path)&lt;/P&gt;&lt;P&gt;  .load(file_path)&lt;/P&gt;&lt;P&gt;  .withColumn("filePath", input_file_name())&lt;/P&gt;&lt;P&gt;  .writeStream&lt;/P&gt;&lt;P&gt;  .option("checkpointLocation", checkpoint_path)&lt;/P&gt;&lt;P&gt;  .trigger(once=True)&lt;/P&gt;&lt;P&gt;  .toTable(table_name))&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;When running this as a workflow I see that the checkpoint directory is created, but there is no data inside. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The code between testing on my compute cluster, and the task in my workflow is exactly the same (same notebook), so I really have no idea why autoloader is not working within my workflow...&lt;/P&gt;</description>
      <pubDate>Fri, 19 May 2023 07:49:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4171#M970</guid>
      <dc:creator>96286</dc:creator>
      <dc:date>2023-05-19T07:49:05Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader works on compute cluster, but does not work within a task in workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4172#M971</link>
      <description>&lt;P&gt;@Vidula Khanna​&amp;nbsp;I see you have responded to previous autoloader questions. Can you help me?&lt;/P&gt;</description>
      <pubDate>Fri, 19 May 2023 10:42:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4172#M971</guid>
      <dc:creator>96286</dc:creator>
      <dc:date>2023-05-19T10:42:29Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader works on compute cluster, but does not work within a task in workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4173#M972</link>
      <description>&lt;P&gt;Still no progress on this. I want to confirm that my cluster configurations are identical in my notebook running on my general purpose compute cluster and my job cluster. Also I am using the same GCP service account. On my compute cluster autoloader works exactly as expected. Here is the code being used for autoloader (this works on compute cluster).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Screenshot 2023-05-22 at 17.43.40"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/189i3388B05BD596D793/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2023-05-22 at 17.43.40" alt="Screenshot 2023-05-22 at 17.43.40" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, when I run this exact same code (from the same notebook) as a job autoloader stops the stream (it seems at .writeStream) and i simply see "stream stopped" with no real clue as to why, as seen below.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Screenshot 2023-05-22 at 17.45.53"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/180i76586DBC043B4E12/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2023-05-22 at 17.45.53" alt="Screenshot 2023-05-22 at 17.45.53" /&gt;&lt;/span&gt;If I go to cloud storage I see that my checkpoint location was created, but the commits folder is empty, meaning autoloader was unable to write the stream. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Screenshot 2023-05-22 at 17.50.55"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/191i1157187D2909DB79/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2023-05-22 at 17.50.55" alt="Screenshot 2023-05-22 at 17.50.55" /&gt;&lt;/span&gt;If I run the notebook outside of workflows I see the commits folder gets populated, and if i remove the  dbutils.fs.rm(checkpoint_path, True) command autoloader correctly does not write new files until new files are available in the source bucket. &lt;/P&gt;</description>
      <pubDate>Mon, 22 May 2023 15:54:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4173#M972</guid>
      <dc:creator>96286</dc:creator>
      <dc:date>2023-05-22T15:54:19Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader works on compute cluster, but does not work within a task in workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4174#M973</link>
      <description>&lt;P&gt;Just to be clear, here are the configurations of my job cluster.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Screenshot 2023-05-22 at 18.16.53"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/197i0A2357C0E948C8B2/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2023-05-22 at 18.16.53" alt="Screenshot 2023-05-22 at 18.16.53" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 22 May 2023 16:17:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4174#M973</guid>
      <dc:creator>96286</dc:creator>
      <dc:date>2023-05-22T16:17:36Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader works on compute cluster, but does not work within a task in workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4175#M974</link>
      <description>&lt;P&gt;I found the issue. I describe the solution in the following SO post. &lt;A href="https://stackoverflow.com/questions/76287095/databricks-autoloader-works-on-compute-cluster-but-does-not-work-within-a-task/76313794#76313794" target="test_blank"&gt;https://stackoverflow.com/questions/76287095/databricks-autoloader-works-on-compute-cluster-but-does-not-work-within-a-task/76313794#76313794&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 23 May 2023 14:09:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-works-on-compute-cluster-but-does-not-work-within-a/m-p/4175#M974</guid>
      <dc:creator>96286</dc:creator>
      <dc:date>2023-05-23T14:09:22Z</dc:date>
    </item>
  </channel>
</rss>

