cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

dbernstein_tp
by New Contributor III
  • 70 Views
  • 2 replies
  • 1 kudos

Lakeflow Connect CDC error, broken links

I get this error, regarding database validation, when setting up a lakeflow connect CDC pipeline (see screenshot). The two links mentioned in the message are broken, they give me a "404 - Content Not Found" when I try to open them. 

Screenshot 2025-11-21 at 9.42.20 AM.png
  • 70 Views
  • 2 replies
  • 1 kudos
Latest Reply
Advika
Databricks Employee
  • 1 kudos

Sharing a likely doc that should help: https://learn.microsoft.com/en-us/azure/databricks/ingestion/lakeflow-connect/sql-server-utility

  • 1 kudos
1 More Replies
hobrob
by New Contributor
  • 52 Views
  • 2 replies
  • 0 kudos

UDFs for working with date ranges

Hi bricklayers,Originally from a Teradata background and relatively new to Databricks, I was in need of brushing up on my Python and Github CI/CD skills so I’ve spun up a repo for a project I’m calling Terabricks.The aim is to provide a space for mak...

  • 52 Views
  • 2 replies
  • 0 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 0 kudos

Fantastic Initiative @hobrob.I have used Teradata for good 5+ years but pre-2014/5. So I will be closely following it and very happy to contribute to it. Thanks. 

  • 0 kudos
1 More Replies
bruce17
by New Contributor II
  • 234 Views
  • 4 replies
  • 2 kudos

Support Request: Issue Running Multiple Ingestion Gateway Concurrently

Hi, we are ingesting data using Databricks Lake flow SQL connector from two different SQL Server databases hosted on separate servers. As part of the setup:We created two separate Ingestion Gateways.We created two separate ingestion pipelines.Both pi...

  • 234 Views
  • 4 replies
  • 2 kudos
Latest Reply
HarishPrasath25
New Contributor
  • 2 kudos

Hi @Louis_Frolio , I’ve successfully ingested one SQL database using the Lakeflow SQL connector. As part of the setup, I created an ingestion pipeline along with a gateway, and it is working as expected - when I run or re-run the pipeline, it picks u...

  • 2 kudos
3 More Replies
Sainath368
by Contributor
  • 204 Views
  • 4 replies
  • 4 kudos

Resolved! Autoloader Managed File events

Hi all,We are in the process of migrating from directory listing to managed file events in Azure Databricks. Our data is stored in an Azure Data Lake container with the following folder structure:To enable file events in Unity Catalog (UC), I created...

Sainath368_0-1763538057402.png
  • 204 Views
  • 4 replies
  • 4 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 4 kudos

Recommended approach to continue your existing pattern:Keep the External Location enabled for file events at the high-level path (/Landing).Run a separate Structured Streaming job for each table, specifying the full sub-path in the .load() function (...

  • 4 kudos
3 More Replies
Techtic_kush
by New Contributor II
  • 168 Views
  • 2 replies
  • 2 kudos

Resolved! Can’t save results to target table – out-of-memory error

Hi team, I’m processing ~5,000 EMR notes with a Databricks notebook. The job reads from `crc_lakehouse.bronze.emr_notes`, runs SciSpaCy UMLS entity extraction plus a fine-tuned BERT sentiment model per partition, and builds a DataFrame (`df_entities`...

  • 168 Views
  • 2 replies
  • 2 kudos
Latest Reply
bianca_unifeye
New Contributor III
  • 2 kudos

You’re right that the behaviour is weird at first glance (“5k rows on a 64 GB cluster and I blow up on write”), but your stack trace is actually very revealing: this isn’t a classic Delta write / shuffle OOM – it’s SciSpaCy/UMLS falling over when loa...

  • 2 kudos
1 More Replies
DarioB
by New Contributor III
  • 123 Views
  • 1 replies
  • 1 kudos

Resolved! Issues recreating Tables with enableRowTracking and DBR16.4 and below

We are running a Deep Clone script to copy Catalogs between Environments; this script is run through a job (run by SP) with DBR 16.4.12.Some tables are Deep Cloned and other ones are Dropped and Recreated to load partial data. The ones dropped are re...

  • 123 Views
  • 1 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

Happy Monday @DarioB , I did some digging and would like to provide you with some helpful hints/tips. Thanks for the detailed context—this is a known rough edge in DBR 16.x when recreating tables that have row tracking materialized. What’s happening ...

  • 1 kudos
Suheb
by New Contributor III
  • 99 Views
  • 1 replies
  • 0 kudos

What are best practices for designing a large-scale data engineering pipeline on Databricks for real

How do you design a scalable, reliable pipeline that handles both fast/continuous data and slower bulk data in the same system?

  • 99 Views
  • 1 replies
  • 0 kudos
Latest Reply
Coffee77
Contributor III
  • 0 kudos

Very generic question Here are general rules and best practices related to Databricks well-architected framework: https://docs.databricks.com/aws/en/lakehouse-architecture/well-architected Take a deeper look on operational excellence, reliability an...

  • 0 kudos
intelliconnectq
by New Contributor II
  • 178 Views
  • 2 replies
  • 0 kudos

Resolved! Loading CSV from private S3 bucket

Trying to load a csv file from a private S3 bucketplease clarify requirements to do this- Can I do it in community edition (if yes then how)? How to do it in premium version?I have IAM role and I also access key & secret 

  • 178 Views
  • 2 replies
  • 0 kudos
Latest Reply
Coffee77
Contributor III
  • 0 kudos

Assuming you have these pre-requisites: A private S3 bucket (e.g., s3://my-private-bucket/data/file.csv)An IAM user or role with access (list/get) to that bucketThe AWS Access Key ID and Secret Access Key (client and secret)The most straightforward w...

  • 0 kudos
1 More Replies
Brahmareddy
by Esteemed Contributor
  • 249 Views
  • 2 replies
  • 7 kudos

Future of Movie Discovery: How I Built an AI Movie Recommendation Agent on Databricks Free Edition

As a data engineer deeply passionate about how data and AI can come together to create real-world impact, I’m excited to share my project for the Databricks Free Edition Hackathon 2025 — Future of Movie Discovery (FMD). Built entirely on Databricks F...

  • 249 Views
  • 2 replies
  • 7 kudos
Latest Reply
hasnat_unifeye
New Contributor II
  • 7 kudos

Hi @Brahmareddy ,Really enjoyed your hackathon demo. you’ve set a high bar for NLP-focused projects. I picked up a lot from your approach and it’s definitely given me ideas to try out.For my hackathon entry, I took a similar direction using pyspark.m...

  • 7 kudos
1 More Replies
eyalholzmann
by New Contributor II
  • 202 Views
  • 3 replies
  • 2 kudos

Does VACUUM on Delta Lake also clean Iceberg metadata when using Iceberg Uniform feature?

I'm working with Delta tables using the Iceberg Uniform feature to enable Iceberg-compatible reads. I’m trying to understand how metadata cleanup works in this setup.Specifically, does the VACUUM operation—which removes old Delta Lake metadata based ...

  • 202 Views
  • 3 replies
  • 2 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 2 kudos

Here’s how to approach cleaning and maintaining Apache Iceberg metadata on Databricks, and how it differs from Delta workflows. First, know your table type For Unity Catalog–managed Iceberg tables, Databricks runs table maintenance for you (predicti...

  • 2 kudos
2 More Replies
Naveenkumar1811
by New Contributor II
  • 181 Views
  • 2 replies
  • 0 kudos

What is the Best Practice of Maintaining the Delta table loaded in Streaming?

Hi Team,We have our Bronze(append) Silver(append) and Gold(merge) Tables loaded using spark streaming continuously with trigger as processing time(3 secs).We Also Run our Maintenance Job on the Table like OPTIMIZE,VACCUM and we perform DELETE for som...

  • 181 Views
  • 2 replies
  • 0 kudos
Latest Reply
Naveenkumar1811
New Contributor II
  • 0 kudos

Hi Mark,But the real problem is our streaming job runs 365 days 24 *7 and we cant afford any further latency to our data flowing to gold layer. We don't have any window to pause or slower our streaming and we continuously get the data feed actually s...

  • 0 kudos
1 More Replies
liquibricks
by New Contributor III
  • 208 Views
  • 3 replies
  • 2 kudos

Resolved! Moving tables between pipelines in production

We are testing an ingestion from kafka to databricks using a streaming table. The streaming table was created by a DAB deployed to "production" which runs as a service principal. This means the service principal is the "owner" of the table.We now wan...

  • 208 Views
  • 3 replies
  • 2 kudos
Latest Reply
nayan_wylde
Esteemed Contributor
  • 2 kudos

You’ve hit two limitations:Streaming tables don’t allow SET OWNER – ownership cannot be changed.Lakeflow pipeline ID changes require pipeline-level permissions – if you’re not the pipeline owner, you can’t run ALTER STREAMING TABLE ... SET PIPELINE_I...

  • 2 kudos
2 More Replies
cdn_yyz_yul
by New Contributor III
  • 263 Views
  • 4 replies
  • 1 kudos

delta as streaming source, can the reader reads only newly appended rows?

Hello everyone,In our implementation of Medallion Architecture, we want to stream changes with spark structured streaming. I would like some advice on how to use delta table as source correctly, and if there is performance (memory usage) concern in t...

  • 263 Views
  • 4 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

In your scenario using Medallion Architecture with Delta tables as both streaming source and sink, it is important to understand Spark Structured Streaming behavior and performance characteristics, especially with joins and memory usage. Here is a di...

  • 1 kudos
3 More Replies
GANAPATI_HEGDE
by New Contributor III
  • 138 Views
  • 2 replies
  • 0 kudos

Unable to configure custom compute for DLT pipeline

I am trying to configure cluster for a pipeline like above, However dlt keeps using the small cluster as usual, how to resolve this? 

GANAPATI_HEGDE_0-1762754316899.png GANAPATI_HEGDE_1-1762754398253.png
  • 138 Views
  • 2 replies
  • 0 kudos
Latest Reply
GANAPATI_HEGDE
New Contributor III
  • 0 kudos

i updated my CLI and deployed the job, still i dont see the clusters updates in  pipeline

  • 0 kudos
1 More Replies
Labels