cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

RGSLCA
by New Contributor II
  • 94 Views
  • 1 replies
  • 0 kudos

Selective overwrite on Partition and Liquid clustered tables

Hi,I have created 2 identical tables but one is partitioned and the one is a Liquid Clustered with Auto Clustering.I inserted 30M rows x 2 (60M) for two dates , date 1 = 2026-06-01 and date = 2026-06-02 , then I overwrite the date 2026-06-02 with a s...

  • 94 Views
  • 1 replies
  • 0 kudos
Latest Reply
balajij8
Contributor III
  • 0 kudos

Hi, the current way is not optimal. You can follow belowINSERT query ran with mostly 43 tasks, creating 43 output files. Since the Liquid clustered table has no organization (clusterBy "[]") - dates are randomly scattered across files.Partition table...

  • 0 kudos
Sainath368
by Contributor
  • 310 Views
  • 1 replies
  • 2 kudos

Resolved! DESCRIBE HISTORY Performance Issue for Large Scale Tables (22K Tables)

Hi everyone, I’m working with around 22,000 Unity Catalog external Delta tables, and my requirement is to execute DESCRIBE HISTORY table_name LIMIT 1 for each table and append the latest record into a single consolidated table. I’ve already tried mul...

  • 310 Views
  • 1 replies
  • 2 kudos
Latest Reply
ShamenParis
New Contributor III
  • 2 kudos

Hi,The reason your performance degrades so badly (4 mins for 2k tables, but 50 mins for 12k) is because of the Spark Driver. When you run spark.sql("DESCRIBE HISTORY...") inside a ThreadPoolExecutor, every single one of those 22,000 queries has to be...

  • 2 kudos
naveenayalla
by New Contributor II
  • 320 Views
  • 1 replies
  • 3 kudos

Why We Moved Our Operational Database Into Databricks — And Stopped Managing Two Stacks

Lakebase just went GA. Here's what a production migration actually looks like.For most of the last decade, our data infrastructure lived in two separate worlds.On one side: a transactional database handling operational workloads — the writes, the loo...

Data Engineering
Architecture
Community articles
Database
DIAS2026
lakebase
  • 320 Views
  • 1 replies
  • 3 kudos
Latest Reply
Mailendiran
New Contributor III
  • 3 kudos

Great write up and felt useful. Thanks for sharing the real experience.!

  • 3 kudos
naveenayalla
by New Contributor II
  • 303 Views
  • 1 replies
  • 0 kudos

From RAG Demo to Production on Databricks: 7 Things Teams Should Validate First

From RAG Demo to Production on Databricks: 7 Things Teams Should Validate FirstBy Naveen AyallaMany teams can build a RAG demo quickly.Upload documents, create embeddings, connect a model, ask a question, and show an answer.But production is differen...

naveen0808_0-1780880239856.png
  • 303 Views
  • 1 replies
  • 0 kudos
Latest Reply
naveenayalla
New Contributor II
  • 0 kudos

Thanks for reading. I’m especially interested in hearing from people who have worked on real RAG or GenAI workflows.Which one has been the biggest challenge for your team?1. Choosing the right source data2. Access control and governance3. Improving r...

  • 0 kudos
RGSLCA
by New Contributor II
  • 318 Views
  • 1 replies
  • 1 kudos

Resolved! How execute SET spark.sql.sources.partitionOverwriteMode = dynamic; in SQL Stored procedures

Hi,I am able to execute the INSERT OVERWRITE TABLE <tables> PARTITION command , in a notebook cell  SET spark.sql.sources.partitionOverwriteMode=dynamic;DECLARE OR REPLACE VARIABLE v_load_date DATE;SET VAR v_load_date = DATE '2026-05-03';INSERT OVERW...

  • 318 Views
  • 1 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

Hello @RGSLCA ,  The short answer is that you can't make SET spark.sql.sources.partitionOverwriteMode=dynamic work from a stored procedure running on a SQL warehouse or serverless. That dynamic partition overwrite path is legacy, and it's SQL-support...

  • 1 kudos
maikel
by Contributor III
  • 580 Views
  • 3 replies
  • 1 kudos

Resolved! Job tasks monitoring

Hello Community,We have a case in our project that we would like to solve in an elegant and scalable manner. As always, I would really appreciate your suggestions and experience.In short:We have a multi-step job consisting of 4 stages. In one of the ...

  • 580 Views
  • 3 replies
  • 1 kudos
Latest Reply
maikel
Contributor III
  • 1 kudos

@MoJaMa thanks a lot for these suggestions!

  • 1 kudos
2 More Replies
Rahul_Dhankhar
by New Contributor II
  • 263 Views
  • 1 replies
  • 2 kudos

Seeking Volunteers with Lakehouse, Fabric, Databricks, or Snowflake Experience

Hello everyone,I am a doctoral researcher at the University of the Cumberlands and seeking 2–3 volunteers for a 20–25-minute field test for my dissertation research on Lakehouse platform adoption.The field test will be conducted over Zoom or Microsof...

  • 263 Views
  • 1 replies
  • 2 kudos
Latest Reply
sameer_yasser
New Contributor III
  • 2 kudos

I am interested. Let me know. 

  • 2 kudos
Raj_DB
by Contributor
  • 289 Views
  • 1 replies
  • 1 kudos

Resolved! Automating Job Permission Updates in Databricks Using a Notebook

Hi everyone,I am looking to create a notebook that, when executed by a user, performs the following actions:Retrieves all Databricks jobs created by the current userChecks whether a specific role already has permissions on those jobsAutomatically add...

  • 289 Views
  • 1 replies
  • 1 kudos
Latest Reply
ziafazal
Databricks Partner
  • 1 kudos

Hi @Raj_DB You can use databricks SDK to retrieve all jobs filter them by selecting only those where owner is current usersomething like thisfrom databricks.sdk import WorkspaceClient w = WorkspaceClient() # Specify the user email/username you want...

  • 1 kudos
lrm_data
by New Contributor III
  • 640 Views
  • 3 replies
  • 2 kudos

Resolved! **Lakeflow Connect SQL Server — Snapshots Firing Outside Configured Full Refresh Window?**

Has anyone else seen full refresh snapshots trigger outside of their configured refresh window in Lakeflow Connect?Here's our situation:- We have a full refresh window configured to restrict snapshot operations to off-hours- On at least one occasion,...

  • 640 Views
  • 3 replies
  • 2 kudos
Latest Reply
lrm_data
New Contributor III
  • 2 kudos

Hello @Sumit_7 ,I have tested a few scenarios and logged a ticket with Databricks and discovered the following:Common MisconceptionThe start_window setting does not define a bounded time window during which full refreshesare contained. It is simply a...

  • 2 kudos
2 More Replies
lrm_data
by New Contributor III
  • 915 Views
  • 4 replies
  • 0 kudos

Resolved! Lakeflow Connect - SQL Server - Issues restarting after failure

Has anyone else run into a situation where a breaking schema change on a SQL Server source table leaves their Lakeflow Connect pipeline in a state it can't recover from — even after destroying and recreating the pipeline?Here's what happened to us:- ...

  • 915 Views
  • 4 replies
  • 0 kudos
Latest Reply
lrm_data
New Contributor III
  • 0 kudos

Hey all,Following up.I was able to recover. The one step I was missing is resetting CDC in the source side. After that, I was able to destroy and recreate the bundle and successfully refresh all tables. Thanks!

  • 0 kudos
3 More Replies
Avinash_Narala
by Databricks Partner
  • 794 Views
  • 2 replies
  • 2 kudos

Resolved! Data Loss in Incremental Batch Jobs Due to Latency in delta file write to blob

Hi everyone,I am facing a data consistency issue in my Databricks incremental pipeline where records are being skipped because of a time gap between when a record is processed and when the physical file is finalized in Azure Blob Storage (ABFS).Our A...

  • 794 Views
  • 2 replies
  • 2 kudos
Latest Reply
balajij8
Contributor III
  • 2 kudos

You can handle it as belowFix the Bronze Write - The 20+ minutes commit gap suggests metadata contention or "Small File Issues" in the bronze delta tables. You can optimize tables manually or enable Optimized Write and Auto Optimize if feasible. This...

  • 2 kudos
1 More Replies
harisrinivasay
by New Contributor II
  • 669 Views
  • 4 replies
  • 1 kudos

Resolved! Unable to View Tables While Setting Up PostgreSQL CDC via Lakeflow Connect

Dear Experts,I have a requirement to implement PostgreSQL CDC using Databricks Lakeflow Connect. While setting up the tables, I am unable to see the list of available tables, even though the connection settings appear to be correct.Could you please s...

  • 669 Views
  • 4 replies
  • 1 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 1 kudos

Hi @harisrinivasay, @szymon_dybczak is correct. You must enter the database name. Lakeflow Connect can only connect to and query that database, and list the schemas and tables if you provide the correct name. If the name is incorrect or if you don’t ...

  • 1 kudos
3 More Replies
Raj_DB
by Contributor
  • 1668 Views
  • 7 replies
  • 11 kudos

Resolved! Designing Reliable Data Versioning Strategies in Databricks

Hi everyone,I’m working on a use case where I need to retain 30 days of historical data in a Delta table and use it to build trend reports.I’m looking for the best approach to reliably maintain this historical data while also making it suitable for r...

  • 1668 Views
  • 7 replies
  • 11 kudos
Latest Reply
DivyaandData
Databricks Employee
  • 11 kudos

Hey @Raj_DB , The TLDR is  time travel is great for short-term ops and debugging, but brittle as your primary reporting history, and its cost profile is harder to control and reason about than a purpose-built history table. Docs 1,2 explicitly say De...

  • 11 kudos
6 More Replies
Labels