<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Where are default temporary checkpoint locations created for streaming queries with display command? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/where-are-default-temporary-checkpoint-locations-created-for/m-p/113006#M44386</link>
    <description>&lt;P&gt;Hello!&lt;/P&gt;&lt;P&gt;I created a streaming query using Auto Loader to read data from S3 and used display command to see if the query was working. Initially,&amp;nbsp;&lt;SPAN&gt;cloudFiles.includeExistingFiles was set to True, but since we have data in Glacier that needs to be retrieved before it can be read, the command failed.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;We did not provide a custom checkpoint location with the display command and&amp;nbsp;spark.sql.streaming.checkpointLocation was set to None.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;Below is the code snippet -&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;source_s3_path = spark.conf.get("pipeline_dlt.source_s3_path")
checkpoint_location = spark.conf.get("pipeline_dlt.schema_checkpoint_location")

reader_options = {
    'cloudFiles.format': 'parquet',
    'cloudFiles.backfillInterval': '1 day',
    'cloudFiles.schemaLocation': schema_checkpoint_location,
    'mergeSchema': True,
    'maxFilesPerTrigger': 1,
}

df = spark.readStream.format(
    'cloudFiles'
).options(
    **reader_options
).load(
    source_s3_path
)

display(df)&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;We then set cloudFiles.includeExistingFiles to False and re-ran the query. Below is the updated code -&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;source_s3_path = spark.conf.get("pipeline_dlt.source_s3_path")
checkpoint_location = spark.conf.get("pipeline_dlt.schema_checkpoint_location")

reader_options = {
    'cloudFiles.format': 'parquet',
    'cloudFiles.includeExistingFiles': False,
    'cloudFiles.backfillInterval': '1 day',
    'cloudFiles.schemaLocation': schema_checkpoint_location,
    'mergeSchema': True,
    'maxFilesPerTrigger': 1,
}

df = spark.readStream.format(
    'cloudFiles'
).options(
    **reader_options
).load(
    source_s3_path
)

display(df)&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;Even then, the command failed. It keeps looking for older files that cannot be downloaded and fails. So we surmised that the new code snippet is picking up the temporary checkpoint that Databricks creates by default.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;So my question is, where does Databricks create the temporary checkpoint location by default, when none is provided? I would like to find this location and clean it up so I can run my code.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Wed, 19 Mar 2025 05:20:13 GMT</pubDate>
    <dc:creator>shavya</dc:creator>
    <dc:date>2025-03-19T05:20:13Z</dc:date>
    <item>
      <title>Where are default temporary checkpoint locations created for streaming queries with display command?</title>
      <link>https://community.databricks.com/t5/data-engineering/where-are-default-temporary-checkpoint-locations-created-for/m-p/113006#M44386</link>
      <description>&lt;P&gt;Hello!&lt;/P&gt;&lt;P&gt;I created a streaming query using Auto Loader to read data from S3 and used display command to see if the query was working. Initially,&amp;nbsp;&lt;SPAN&gt;cloudFiles.includeExistingFiles was set to True, but since we have data in Glacier that needs to be retrieved before it can be read, the command failed.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;We did not provide a custom checkpoint location with the display command and&amp;nbsp;spark.sql.streaming.checkpointLocation was set to None.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;Below is the code snippet -&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;source_s3_path = spark.conf.get("pipeline_dlt.source_s3_path")
checkpoint_location = spark.conf.get("pipeline_dlt.schema_checkpoint_location")

reader_options = {
    'cloudFiles.format': 'parquet',
    'cloudFiles.backfillInterval': '1 day',
    'cloudFiles.schemaLocation': schema_checkpoint_location,
    'mergeSchema': True,
    'maxFilesPerTrigger': 1,
}

df = spark.readStream.format(
    'cloudFiles'
).options(
    **reader_options
).load(
    source_s3_path
)

display(df)&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;We then set cloudFiles.includeExistingFiles to False and re-ran the query. Below is the updated code -&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;source_s3_path = spark.conf.get("pipeline_dlt.source_s3_path")
checkpoint_location = spark.conf.get("pipeline_dlt.schema_checkpoint_location")

reader_options = {
    'cloudFiles.format': 'parquet',
    'cloudFiles.includeExistingFiles': False,
    'cloudFiles.backfillInterval': '1 day',
    'cloudFiles.schemaLocation': schema_checkpoint_location,
    'mergeSchema': True,
    'maxFilesPerTrigger': 1,
}

df = spark.readStream.format(
    'cloudFiles'
).options(
    **reader_options
).load(
    source_s3_path
)

display(df)&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;Even then, the command failed. It keeps looking for older files that cannot be downloaded and fails. So we surmised that the new code snippet is picking up the temporary checkpoint that Databricks creates by default.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;So my question is, where does Databricks create the temporary checkpoint location by default, when none is provided? I would like to find this location and clean it up so I can run my code.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Wed, 19 Mar 2025 05:20:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-are-default-temporary-checkpoint-locations-created-for/m-p/113006#M44386</guid>
      <dc:creator>shavya</dc:creator>
      <dc:date>2025-03-19T05:20:13Z</dc:date>
    </item>
    <item>
      <title>Re: Where are default temporary checkpoint locations created for streaming queries with display comm</title>
      <link>https://community.databricks.com/t5/data-engineering/where-are-default-temporary-checkpoint-locations-created-for/m-p/121514#M46470</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/154141"&gt;@shavya&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Good day!!&lt;/P&gt;
&lt;P&gt;When you do &lt;STRONG data-start="12" data-end="19"&gt;not&lt;/STRONG&gt; specify a &lt;CODE data-start="30" data-end="50"&gt;checkpointLocation&lt;/CODE&gt; in a streaming query in &lt;STRONG data-start="75" data-end="89"&gt;Databricks.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="1007" data-end="1082"&gt;It uses a temporary system directory such as:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV class="contain-inline-size rounded-2xl border-[0.5px] border-token-border-medium relative bg-token-sidebar-surface-primary"&gt;
&lt;DIV class="sticky top-9"&gt;
&lt;DIV class="absolute end-0 bottom-0 flex h-9 items-center pe-2"&gt;
&lt;DIV class="bg-token-sidebar-surface-primary text-token-text-secondary dark:bg-token-main-surface-secondary flex items-center gap-4 rounded-sm px-2 font-sans text-xs"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;&lt;CODE class="whitespace-pre!"&gt;&lt;SPAN&gt;dbfs:/local_disk0/tmp/temporary-&amp;lt;random_uuid&amp;gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;To remove the temporary checkpoint, please set the below configuration to true.&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;spark.sql.streaming.forceDeleteTempCheckpointLocation true&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;When &lt;STRONG data-start="593" data-end="610"&gt;set to &lt;CODE data-start="602" data-end="608"&gt;true&lt;/CODE&gt;&lt;/STRONG&gt;&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;
&lt;UL&gt;
&lt;LI data-start="624" data-end="747"&gt;
&lt;P data-start="626" data-end="747"&gt;Spark &lt;STRONG data-start="632" data-end="657"&gt;automatically deletes&lt;/STRONG&gt; the &lt;STRONG data-start="662" data-end="696"&gt;temporary checkpoint directory&lt;/STRONG&gt; after the streaming query is stopped or completed.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI data-start="750" data-end="856"&gt;
&lt;P data-start="752" data-end="856"&gt;This is useful to avoid cluttering &lt;CODE data-start="787" data-end="793"&gt;/tmp&lt;/CODE&gt; or the Spark local directories with leftover checkpoint files.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;&lt;STRONG&gt;Reference doc:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://spark.apache.org/docs/latest/configuration.html#:~:text=spark.sql.streaming.forceDeleteTempCheckpointLocation" target="_blank"&gt;https://spark.apache.org/docs/latest/configuration.html#:~:text=spark.sql.streaming.forceDeleteTempCheckpointLocation&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="overflow-y-auto p-4" dir="ltr"&gt;Kindly let me know if you have any questions on this.&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;</description>
      <pubDate>Wed, 11 Jun 2025 17:26:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-are-default-temporary-checkpoint-locations-created-for/m-p/121514#M46470</guid>
      <dc:creator>Saritha_S</dc:creator>
      <dc:date>2025-06-11T17:26:26Z</dc:date>
    </item>
  </channel>
</rss>

