cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Akash_Varuna
by New Contributor II
  • 355 Views
  • 1 replies
  • 0 kudos

Streaming Table data leakage to historical permanent table

Data Leakage in Historical Table from Streaming TableEnvironmentPlatform: Azure Databricks + Azure Event HubsStreaming Framework: Spark Structured StreamingStorage: Delta LakePipeline  Event Hubs → stream_messages (live 24hr rolling window) → message...

  • 355 Views
  • 1 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @Akash_Varuna, The count discrepancies you are seeing between stream_messages and messages are almost certainly caused by the 24-hour rolling window on your stream_messages table expiring data while the load_messages job is paused during your main...

  • 0 kudos
senkii
by Databricks Partner
  • 2062 Views
  • 2 replies
  • 1 kudos

Resolved! How to stop task retry

I would like to stop automatic retries, but the max retries configuration does not seem to work.Could you please tell me how to disable retries? I would also like to understand why the task retries automatically.I did not set any scheduler. I created...

senkii_0-1771320879821.png senkii_1-1771320966643.png senkii_2-1771321009007.png
  • 2062 Views
  • 2 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @senkii, There are two separate retry mechanisms in Databricks that can cause tasks to run again, and distinguishing between them is important for your situation. 1. TASK-LEVEL RETRIES (Workflows setting) This is the "Retries" setting you configur...

  • 1 kudos
1 More Replies
developer3535
by New Contributor II
  • 746 Views
  • 2 replies
  • 0 kudos

Resolved! Zerobus Kafka-compatible API

Hi Team,I went through a recording where it was mentioned that a Kafka‑compatible API is planned for a Beta release in Q1. Do we have any rough timeline on when this feature might be available?We already have Kafka producer topics, and we would like ...

developer3535_0-1771491074468.png
  • 746 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @developer3535, I see @stbjelcevic already confirmed the Q1 2026 timeline for the Kafka-compatible API Beta. I wanted to add some context on what you can do in the meantime and where to look for updates. CURRENT ZEROBUS INGEST INTERFACES While wai...

  • 0 kudos
1 More Replies
samuelperezh
by New Contributor
  • 1348 Views
  • 2 replies
  • 2 kudos

Architecture Advice: DLT Strategy for Daily Snapshots to SCD2 with "Grace Period" Deletes

Hi everyone,I’m looking for architectural advice on building a Silver layer in DLT. I am dealing with inventory data and need to handle historical tracking, "sold" logic based on disappearance, and storage cost optimization.Here's how the situation l...

  • 1348 Views
  • 2 replies
  • 2 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 2 kudos

Hi @samuelperezh, Building on @aleksandra_ch's reply, I wanted to add some additional detail around each of your three questions, especially around the grace period implementation and the backfill strategy. 1. GRACE PERIOD PATTERN As aleksandra_ch no...

  • 2 kudos
1 More Replies
aranjan99
by Contributor
  • 918 Views
  • 2 replies
  • 1 kudos

how does Job cluster auto scaling work

Can you share the metrics databricks uses during job cluster auto scaling?Is Databricks  looking at queued tasks, slot utilization etc or just looking at CPU utilizations?The autoscaling docuemnt https://docs.databricks.com/aws/en/compute/configure?u...

  • 918 Views
  • 2 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @aranjan99, The autoscaling behavior on job clusters depends on your workspace pricing tier. Here is a breakdown of the metrics and mechanics involved. WHAT METRICS DRIVE SCALING DECISIONS Job cluster autoscaling uses Spark scheduler signals, not ...

  • 1 kudos
1 More Replies
Ashley1
by Contributor
  • 2847 Views
  • 4 replies
  • 1 kudos

Resolved! Turn off AI assistance in notebooks

Hi, has anyone found a way that the AI assistant can be turned off in notebooks? I would be happy to keep code introspection but I find I'm more often hitting escape than accepting the AI's suggestions (or removing the code it has suggested when I ac...

  • 2847 Views
  • 4 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @Ashley1, There are a few different levels where you can control the AI assistance behavior in notebooks. Here is a breakdown: USER-LEVEL: DISABLE AI AUTOCOMPLETE (INLINE SUGGESTIONS) This is the setting that controls the "ghost text" inline code ...

  • 1 kudos
3 More Replies
yit337
by Contributor
  • 745 Views
  • 2 replies
  • 1 kudos

Resolved! Identity column has null values

I want to update a dimension table in the gold model from a silver table by using  create_auto_cdc_from_snapshot_flow and SCD2. In the target table, I have defined an IDENTITY column, which should be populated automatically.The dlt flow runs successf...

  • 745 Views
  • 2 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @yit337, The reason your identity column values are NULL is that the target table created by create_auto_cdc_from_snapshot_flow is a streaming table, and streaming tables do not support identity columns. This is a documented limitation: https://do...

  • 1 kudos
1 More Replies
Saikumar_Manne
by New Contributor II
  • 1835 Views
  • 4 replies
  • 1 kudos

Resolved! How to use multi-threading and batch inserts for large UPSERT to PostgreSQL from Databricks?

Hi everyone,We have a Databricks (Unity Catalog) pipeline where we process large datasets in Spark and need to load incremental data into a PostgreSQL target table.Our scenario is:Initial full load (~300 million rows) to PostgreSQL using bulk COPY is...

  • 1835 Views
  • 4 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @Saikumar_Manne, With 190M+ daily rows going into PostgreSQL via INSERT ON CONFLICT DO UPDATE, there are several levers to pull. Here is a breakdown of the approaches and tuning options. APPROACH 1: STAGING TABLE + MERGE (RECOMMENDED FOR THIS VOLU...

  • 1 kudos
3 More Replies
ChrisLawford_n1
by Contributor II
  • 942 Views
  • 3 replies
  • 1 kudos

Resolved! DeltaFileOperations: Listing improvement?

Hello, I am using databricks autoloader with managedfileevents turned on and include existing files.I want to understand if there is a way of increasing the speed of the initial listing of the files for autoloader.I thought that the idea behind the m...

  • 942 Views
  • 3 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @ChrisLawford_n1, You are correct that managed file events (cloudFiles.useManagedFileEvents = true) works by having Databricks maintain a record of file events on the external location, so when you start a new Auto Loader stream, it can replay tho...

  • 1 kudos
2 More Replies
arushigulati
by Databricks Partner
  • 969 Views
  • 2 replies
  • 0 kudos

Lakebridge transpile to translate from oracle to databricks sql

Hi Community,I am currently working on a PoC to migrate data from Oracle to Databricks. As part of this, we are attempting to automate the DDL conversion process.We are leveraging Databricks Labs Lakebridge for transpilation, but it is failing to con...

arushigulati_0-1769670207329.png arushigulati_1-1769670261737.png
  • 969 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @arushigulati, Lakebridge (the Databricks Labs project formerly known as Remorph) does support Oracle as a source dialect for transpilation, but the DDL handling, particularly around constraints like PRIMARY KEY, has some gaps depending on the ver...

  • 0 kudos
1 More Replies
echol
by New Contributor II
  • 1311 Views
  • 6 replies
  • 1 kudos

Redeploy Databricks Asset Bundle created by others

Hi everyone,Our team is using Databricks Asset Bundles (DAB) with a customized template to develop data pipelines. We have a core team that maintains the shared infrastructure and templates, and multiple product teams that use this template to develo...

  • 1311 Views
  • 6 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @echol, This is a common scenario when multiple team members work with Databricks Asset Bundles, and there are a few approaches to solve it cleanly. THE ROOT CAUSE When Staff A deploys a bundle, the jobs and other resources are created with Staff ...

  • 1 kudos
5 More Replies
RIDBX
by Contributor
  • 594 Views
  • 3 replies
  • 1 kudos

Robust/complex scheduling with dependency within Databricks?

Robust scheduling with dependency within Databricks?======================================  Thanks for reviewing my threads. I like to explore Robust/complex scheduling with dependency within Databricks.We know traditional scheduling framework allow ...

  • 594 Views
  • 3 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @RIDBX, Databricks Lakeflow Jobs has several features that let you build exactly this kind of tiered, dependency-driven orchestration natively. Here is how I would approach your HR (Tier 1) and Finance (Tier 2) scenario. OPTION 1: SINGLE ORCHESTRA...

  • 1 kudos
2 More Replies
YuriS
by New Contributor III
  • 790 Views
  • 6 replies
  • 3 kudos

Resolved! StreamingQueryListener metrics strange behaviour (inputRowsPerSecond metric is set to 0)

After implementing StreamingQueryListener to enable integration with our monitoring solution we have noticed some strange metrics for our DeltaSource streams (based on https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/stream-mon...

YuriS_0-1769419735190.png YuriS_1-1769419836870.png
  • 790 Views
  • 6 replies
  • 3 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 3 kudos

Hi @YuriS, There are a few things going on here, and I will walk through each one. INPUTROWSPERSECOND SHOWING 0 The inputRowsPerSecond metric is not calculated from the current batch. It is the rate of data arriving between the end of the previous tr...

  • 3 kudos
5 More Replies
_its_akshaye
by New Contributor
  • 464 Views
  • 2 replies
  • 1 kudos

How to Track Hourly or Daily # of Upsert/Delete Metrics in a DLT Streaming Pipeline

We created a Delta Live Tables (DLT) streaming pipeline to ingest data from the Bronze layer to the Silver layer using Change Data Feed (CDF) enabled.The stream runs continuously and shows # of upserted and deleted rows at an aggregate level from the...

  • 464 Views
  • 2 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @_its_akshaye, The pipeline event log captures exactly what you need. Every Lakeflow Spark Declarative Pipeline (SDP, formerly known as DLT) records flow_progress events that include per-flow metrics with num_upserted_rows and num_deleted_rows fie...

  • 1 kudos
1 More Replies
aranjan99
by Contributor
  • 814 Views
  • 6 replies
  • 1 kudos

System table missing primary keys?

This simple query takes 50seconds for me on a X-Small warehouse.select * from SYSTEM.access.workspaces_latest where workspace_id = '442224551661121'Can the team comment on why querying on system tables takes so long? I also dont see any primary keys ...

  • 814 Views
  • 6 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @aranjan99, There are two separate topics here, so let me address each one. WHY THE QUERY IS SLOW (~50 SECONDS ON X-SMALL) System tables are served via Delta Sharing from a Databricks-hosted storage account in the same region as your Unity Catalog...

  • 1 kudos
5 More Replies
Labels