<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Managed File Events: Are reads from the file events cache independent per pipeline? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/managed-file-events-are-reads-from-the-file-events-cache/m-p/149429#M53095</link>
    <description>&lt;P class="p1"&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/216652"&gt;@raimundovidal&lt;/a&gt;,&lt;/P&gt;
&lt;P class="p1"&gt;You’re safe to run both staging and production Lakeflow Spark Declarative Pipelines with cloudFiles.useManagedFileEvents = "true" against the same external location (same S3 path) and same Unity Catalog metastore, as long as each pipeline uses its own checkpoint location.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;A few key points:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI class="p1"&gt;When a stream first runs with managed file events, Auto Loader does a full directory listing of the load path to get current with the file events cache and then stores the read position in that stream’s checkpoint. Subsequent runs read from the cache using the stored read position rather than S3 directly.&lt;/LI&gt;
&lt;LI class="p1"&gt;The checkpoint also records a unique stream ID, and the checkpointLocation is explicitly per stream.&lt;/LI&gt;
&lt;LI class="p1"&gt;External locations (and their file events setup) are metastore-level objects and are, by default, accessible from all workspaces attached to that metastore. So two workspaces attached to the same metastore behave just like two streams in one workspace from the file-events perspective.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;Putting that together...&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI class="p1"&gt;&lt;STRONG&gt;Is the “stored read position” scoped per pipeline/stream or shared?&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="p1 lia-indent-padding-left-30px"&gt;&amp;nbsp;The stored read position is scoped per stream, not shared across all consumers of the external location. Each&amp;nbsp; Auto Loader stream (including each DLT pipeline) keeps its own position in the file events cache inside its own checkpoint. That position is obtained during the initial run and then reused on subsequent runs of that specific stream.&lt;/P&gt;
&lt;P class="p1 lia-indent-padding-left-30px"&gt;So in your scenario:&lt;/P&gt;
&lt;UL&gt;
&lt;LI class="p1"&gt;If the staging pipeline runs first and advances its own read position, that does not move the read position for the production pipeline.&lt;/LI&gt;
&lt;LI class="p1"&gt;When the production pipeline runs, it uses its own checkpoint and its own stored position in the cache, and will still see all new files (subject to the usual rules: e.g., it must run frequently enough that its stored position remains valid).&lt;/LI&gt;
&lt;LI class="p1"&gt;There is no single “global cursor” for the external location that all streams share.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1 lia-indent-padding-left-30px"&gt;&lt;STRONG&gt;2&lt;/STRONG&gt;. &lt;STRONG&gt;Can multiple independent consumers read the same file events without interfering (Kafka‑like semantics)?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p1 lia-indent-padding-left-30px"&gt;Yes. That’s the intent.&lt;/P&gt;
&lt;P class="p1"&gt;A few implementation details that are relevant to your concerns:&lt;/P&gt;
&lt;UL&gt;
&lt;LI class="p1"&gt;For an external location with file events enabled, Databricks configures one underlying queue per external location (not per stream).&lt;/LI&gt;
&lt;LI class="p1"&gt;A control-plane service (CSMS) consumes from that queue and caches file metadata. Auto Loader streams do not read the queue directly; instead, they query this service using an internal listObjects‑style API, which returns objects plus a continuation token.&lt;/LI&gt;
&lt;LI class="p1"&gt;Each stream keeps its own continuation token / read position (per external location) in its checkpoint. As a result, multiple streams can safely read from the same cache independently, very similar to distinct Kafka consumer groups each tracking their own offsets.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;This design is explicitly used and supported when:&lt;/P&gt;
&lt;UL&gt;
&lt;LI class="p1"&gt;Multiple Auto Loader streams ingest different subpaths under the same external location with managed file events.&lt;/LI&gt;
&lt;LI class="p1"&gt;Customers run many concurrent streams per external location; guidance focuses on performance tuning (for example, using volumes for each subfolder), not on avoiding missed files.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;So from a correctness perspective, having both a staging and production pipeline consume from the same external location and path is supported and will not cause one to steal events from the other.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;Caveats to be aware of &lt;/STRONG&gt;(These don’t change the “no shared cursor” answer, but are worth calling out:)&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Run frequency:&lt;/STRONG&gt; If a given stream isn’t run for more than about 7 days, its stored read position in the cache can expire, forcing a full directory listing on the next run, but it should still not &lt;I&gt;miss&lt;/I&gt; files because of that.&lt;/LI&gt;
&lt;LI class="p1"&gt;&lt;STRONG&gt;Source cleanup options:&lt;/STRONG&gt; Options like cloudFiles.cleanSource that delete source files after processing are not recommended when multiple streams read the same source, because the faster consumer can delete files before the slower one sees them. This is orthogonal to managed file events, but relevant for your staging vs production pattern.&lt;/LI&gt;
&lt;LI class="p1"&gt;&lt;STRONG&gt;Normal streaming guarantees:&lt;/STRONG&gt; As with any streaming job, keeping distinct checkpointLocation values per pipeline is critical. You’re already doing that, which is the right setup.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;Direct answers to your questions&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI class="p1"&gt;&lt;STRONG&gt;Stored read position scope:&lt;/STRONG&gt;&amp;nbsp;It is per stream/pipeline, stored in that stream’s checkpoint. It is not a single shared cursor for all consumers of the external location.&lt;/LI&gt;
&lt;LI class="p1"&gt;&lt;STRONG&gt;Multiple independent consumers:&amp;nbsp;&lt;/STRONG&gt;Yes. The managed file events cache is designed to support multiple independent streams (even across workspaces) reading the same file events without interfering with each other. The physical queue is shared per external location, but the logical read positions are per stream, much like Kafka consumer groups each having their own offsets.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;Your current understanding -&amp;nbsp;&lt;I&gt;“each pipeline maintains its own read position via its checkpoint, making this safe even across workspaces”&lt;/I&gt;&amp;nbsp;- is correct.&lt;/P&gt;
&lt;P class="p1"&gt;Hope this helps!&lt;/P&gt;
&lt;P class="p1"&gt;&lt;FONT size="2" color="#FF6600"&gt;&lt;STRONG&gt;&lt;I&gt;If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.&lt;/I&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P class="p1"&gt;Regards,&lt;/P&gt;</description>
    <pubDate>Thu, 26 Feb 2026 22:01:46 GMT</pubDate>
    <dc:creator>Ashwin_DSA</dc:creator>
    <dc:date>2026-02-26T22:01:46Z</dc:date>
    <item>
      <title>Managed File Events: Are reads from the file events cache independent per pipeline?</title>
      <link>https://community.databricks.com/t5/data-engineering/managed-file-events-are-reads-from-the-file-events-cache/m-p/148693#M52948</link>
      <description>&lt;P&gt;We have two Databricks workspaces (staging and production) attached to the same Unity Catalog metastore. Both workspaces run DLT pipelines that use Auto Loader with cloudFiles.useManagedFileEvents = "true" to ingest from the same&lt;BR /&gt;external location (same S3 path).&lt;/P&gt;&lt;P&gt;Each pipeline has its own separate checkpoint location.&lt;/P&gt;&lt;P&gt;The documentation states that managed file events uses "a single file notification queue for all streams that process files from a given external location" and that streams discover new files by "reading directly from cache using&lt;BR /&gt;stored read position."&lt;/P&gt;&lt;P&gt;Our concern: If the staging pipeline runs first and reads new files from the file events cache, will the production pipeline still see those same files when it runs later? Or does one pipeline's read advance a shared cursor that&lt;BR /&gt;causes the other to miss files?&lt;/P&gt;&lt;P&gt;Specifically, we'd like to clarify:&lt;/P&gt;&lt;P&gt;1. Is the "stored read position" scoped per pipeline/stream (each pipeline independently tracks its own position in the cache) or is it shared across all consumers of the external location?&lt;BR /&gt;2. Is the file events cache designed to support multiple independent consumers reading the same file events without interference — similar to how Kafka consumer groups each maintain their own offset?&lt;/P&gt;&lt;P&gt;Our current understanding is that each pipeline maintains its own read position via its checkpoint, making this safe. But we couldn't find explicit documentation confirming this for cross-workspace, same-metastore scenarios.&lt;/P&gt;&lt;P&gt;Any clarification would be appreciated. Thanks!&lt;/P&gt;</description>
      <pubDate>Wed, 18 Feb 2026 13:00:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/managed-file-events-are-reads-from-the-file-events-cache/m-p/148693#M52948</guid>
      <dc:creator>raimundovidal</dc:creator>
      <dc:date>2026-02-18T13:00:10Z</dc:date>
    </item>
    <item>
      <title>Re: Managed File Events: Are reads from the file events cache independent per pipeline?</title>
      <link>https://community.databricks.com/t5/data-engineering/managed-file-events-are-reads-from-the-file-events-cache/m-p/149429#M53095</link>
      <description>&lt;P class="p1"&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/216652"&gt;@raimundovidal&lt;/a&gt;,&lt;/P&gt;
&lt;P class="p1"&gt;You’re safe to run both staging and production Lakeflow Spark Declarative Pipelines with cloudFiles.useManagedFileEvents = "true" against the same external location (same S3 path) and same Unity Catalog metastore, as long as each pipeline uses its own checkpoint location.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;A few key points:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI class="p1"&gt;When a stream first runs with managed file events, Auto Loader does a full directory listing of the load path to get current with the file events cache and then stores the read position in that stream’s checkpoint. Subsequent runs read from the cache using the stored read position rather than S3 directly.&lt;/LI&gt;
&lt;LI class="p1"&gt;The checkpoint also records a unique stream ID, and the checkpointLocation is explicitly per stream.&lt;/LI&gt;
&lt;LI class="p1"&gt;External locations (and their file events setup) are metastore-level objects and are, by default, accessible from all workspaces attached to that metastore. So two workspaces attached to the same metastore behave just like two streams in one workspace from the file-events perspective.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;Putting that together...&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI class="p1"&gt;&lt;STRONG&gt;Is the “stored read position” scoped per pipeline/stream or shared?&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="p1 lia-indent-padding-left-30px"&gt;&amp;nbsp;The stored read position is scoped per stream, not shared across all consumers of the external location. Each&amp;nbsp; Auto Loader stream (including each DLT pipeline) keeps its own position in the file events cache inside its own checkpoint. That position is obtained during the initial run and then reused on subsequent runs of that specific stream.&lt;/P&gt;
&lt;P class="p1 lia-indent-padding-left-30px"&gt;So in your scenario:&lt;/P&gt;
&lt;UL&gt;
&lt;LI class="p1"&gt;If the staging pipeline runs first and advances its own read position, that does not move the read position for the production pipeline.&lt;/LI&gt;
&lt;LI class="p1"&gt;When the production pipeline runs, it uses its own checkpoint and its own stored position in the cache, and will still see all new files (subject to the usual rules: e.g., it must run frequently enough that its stored position remains valid).&lt;/LI&gt;
&lt;LI class="p1"&gt;There is no single “global cursor” for the external location that all streams share.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1 lia-indent-padding-left-30px"&gt;&lt;STRONG&gt;2&lt;/STRONG&gt;. &lt;STRONG&gt;Can multiple independent consumers read the same file events without interfering (Kafka‑like semantics)?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p1 lia-indent-padding-left-30px"&gt;Yes. That’s the intent.&lt;/P&gt;
&lt;P class="p1"&gt;A few implementation details that are relevant to your concerns:&lt;/P&gt;
&lt;UL&gt;
&lt;LI class="p1"&gt;For an external location with file events enabled, Databricks configures one underlying queue per external location (not per stream).&lt;/LI&gt;
&lt;LI class="p1"&gt;A control-plane service (CSMS) consumes from that queue and caches file metadata. Auto Loader streams do not read the queue directly; instead, they query this service using an internal listObjects‑style API, which returns objects plus a continuation token.&lt;/LI&gt;
&lt;LI class="p1"&gt;Each stream keeps its own continuation token / read position (per external location) in its checkpoint. As a result, multiple streams can safely read from the same cache independently, very similar to distinct Kafka consumer groups each tracking their own offsets.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;This design is explicitly used and supported when:&lt;/P&gt;
&lt;UL&gt;
&lt;LI class="p1"&gt;Multiple Auto Loader streams ingest different subpaths under the same external location with managed file events.&lt;/LI&gt;
&lt;LI class="p1"&gt;Customers run many concurrent streams per external location; guidance focuses on performance tuning (for example, using volumes for each subfolder), not on avoiding missed files.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;So from a correctness perspective, having both a staging and production pipeline consume from the same external location and path is supported and will not cause one to steal events from the other.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;Caveats to be aware of &lt;/STRONG&gt;(These don’t change the “no shared cursor” answer, but are worth calling out:)&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Run frequency:&lt;/STRONG&gt; If a given stream isn’t run for more than about 7 days, its stored read position in the cache can expire, forcing a full directory listing on the next run, but it should still not &lt;I&gt;miss&lt;/I&gt; files because of that.&lt;/LI&gt;
&lt;LI class="p1"&gt;&lt;STRONG&gt;Source cleanup options:&lt;/STRONG&gt; Options like cloudFiles.cleanSource that delete source files after processing are not recommended when multiple streams read the same source, because the faster consumer can delete files before the slower one sees them. This is orthogonal to managed file events, but relevant for your staging vs production pattern.&lt;/LI&gt;
&lt;LI class="p1"&gt;&lt;STRONG&gt;Normal streaming guarantees:&lt;/STRONG&gt; As with any streaming job, keeping distinct checkpointLocation values per pipeline is critical. You’re already doing that, which is the right setup.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;Direct answers to your questions&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI class="p1"&gt;&lt;STRONG&gt;Stored read position scope:&lt;/STRONG&gt;&amp;nbsp;It is per stream/pipeline, stored in that stream’s checkpoint. It is not a single shared cursor for all consumers of the external location.&lt;/LI&gt;
&lt;LI class="p1"&gt;&lt;STRONG&gt;Multiple independent consumers:&amp;nbsp;&lt;/STRONG&gt;Yes. The managed file events cache is designed to support multiple independent streams (even across workspaces) reading the same file events without interfering with each other. The physical queue is shared per external location, but the logical read positions are per stream, much like Kafka consumer groups each having their own offsets.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;Your current understanding -&amp;nbsp;&lt;I&gt;“each pipeline maintains its own read position via its checkpoint, making this safe even across workspaces”&lt;/I&gt;&amp;nbsp;- is correct.&lt;/P&gt;
&lt;P class="p1"&gt;Hope this helps!&lt;/P&gt;
&lt;P class="p1"&gt;&lt;FONT size="2" color="#FF6600"&gt;&lt;STRONG&gt;&lt;I&gt;If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.&lt;/I&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P class="p1"&gt;Regards,&lt;/P&gt;</description>
      <pubDate>Thu, 26 Feb 2026 22:01:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/managed-file-events-are-reads-from-the-file-events-cache/m-p/149429#M53095</guid>
      <dc:creator>Ashwin_DSA</dc:creator>
      <dc:date>2026-02-26T22:01:46Z</dc:date>
    </item>
  </channel>
</rss>

