Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I am in the process of designing a Medallion architecture where the data sources include REST API calls, JSON files, SQL Server, and Azure Event Hubs.For the Silver and Gold layers, I plan to leverage Delta Live Tables (DLT). However, I am seeking gu...
I get this error, regarding database validation, when setting up a lakeflow connect CDC pipeline (see screenshot). The two links mentioned in the message are broken, they give me a "404 - Content Not Found" when I try to open them.
Hi bricklayers,Originally from a Teradata background and relatively new to Databricks, I was in need of brushing up on my Python and Github CI/CD skills so I’ve spun up a repo for a project I’m calling Terabricks.The aim is to provide a space for mak...
Fantastic Initiative @hobrob.I have used Teradata for good 5+ years but pre-2014/5. So I will be closely following it and very happy to contribute to it. Thanks.
Hi, we are ingesting data using Databricks Lake flow SQL connector from two different SQL Server databases hosted on separate servers. As part of the setup:We created two separate Ingestion Gateways.We created two separate ingestion pipelines.Both pi...
Hi @Louis_Frolio , I’ve successfully ingested one SQL database using the Lakeflow SQL connector. As part of the setup, I created an ingestion pipeline along with a gateway, and it is working as expected - when I run or re-run the pipeline, it picks u...
Hi all,We are in the process of migrating from directory listing to managed file events in Azure Databricks. Our data is stored in an Azure Data Lake container with the following folder structure:To enable file events in Unity Catalog (UC), I created...
Recommended approach to continue your existing pattern:Keep the External Location enabled for file events at the high-level path (/Landing).Run a separate Structured Streaming job for each table, specifying the full sub-path in the .load() function (...
Hi team, I’m processing ~5,000 EMR notes with a Databricks notebook. The job reads from `crc_lakehouse.bronze.emr_notes`, runs SciSpaCy UMLS entity extraction plus a fine-tuned BERT sentiment model per partition, and builds a DataFrame (`df_entities`...
You’re right that the behaviour is weird at first glance (“5k rows on a 64 GB cluster and I blow up on write”), but your stack trace is actually very revealing: this isn’t a classic Delta write / shuffle OOM – it’s SciSpaCy/UMLS falling over when loa...
We are running a Deep Clone script to copy Catalogs between Environments; this script is run through a job (run by SP) with DBR 16.4.12.Some tables are Deep Cloned and other ones are Dropped and Recreated to load partial data. The ones dropped are re...
Happy Monday @DarioB , I did some digging and would like to provide you with some helpful hints/tips.
Thanks for the detailed context—this is a known rough edge in DBR 16.x when recreating tables that have row tracking materialized.
What’s happening ...
Very generic question Here are general rules and best practices related to Databricks well-architected framework: https://docs.databricks.com/aws/en/lakehouse-architecture/well-architected Take a deeper look on operational excellence, reliability an...
Trying to load a csv file from a private S3 bucketplease clarify requirements to do this- Can I do it in community edition (if yes then how)? How to do it in premium version?I have IAM role and I also access key & secret
Assuming you have these pre-requisites: A private S3 bucket (e.g., s3://my-private-bucket/data/file.csv)An IAM user or role with access (list/get) to that bucketThe AWS Access Key ID and Secret Access Key (client and secret)The most straightforward w...
As a data engineer deeply passionate about how data and AI can come together to create real-world impact, I’m excited to share my project for the Databricks Free Edition Hackathon 2025 — Future of Movie Discovery (FMD). Built entirely on Databricks F...
Hi @Brahmareddy ,Really enjoyed your hackathon demo. you’ve set a high bar for NLP-focused projects. I picked up a lot from your approach and it’s definitely given me ideas to try out.For my hackathon entry, I took a similar direction using pyspark.m...
I'm working with Delta tables using the Iceberg Uniform feature to enable Iceberg-compatible reads. I’m trying to understand how metadata cleanup works in this setup.Specifically, does the VACUUM operation—which removes old Delta Lake metadata based ...
Here’s how to approach cleaning and maintaining Apache Iceberg metadata on Databricks, and how it differs from Delta workflows.
First, know your table type
For Unity Catalog–managed Iceberg tables, Databricks runs table maintenance for you (predicti...
Hi Team,We have our Bronze(append) Silver(append) and Gold(merge) Tables loaded using spark streaming continuously with trigger as processing time(3 secs).We Also Run our Maintenance Job on the Table like OPTIMIZE,VACCUM and we perform DELETE for som...
Hi Mark,But the real problem is our streaming job runs 365 days 24 *7 and we cant afford any further latency to our data flowing to gold layer. We don't have any window to pause or slower our streaming and we continuously get the data feed actually s...
We are testing an ingestion from kafka to databricks using a streaming table. The streaming table was created by a DAB deployed to "production" which runs as a service principal. This means the service principal is the "owner" of the table.We now wan...
You’ve hit two limitations:Streaming tables don’t allow SET OWNER – ownership cannot be changed.Lakeflow pipeline ID changes require pipeline-level permissions – if you’re not the pipeline owner, you can’t run ALTER STREAMING TABLE ... SET PIPELINE_I...
Hello everyone,In our implementation of Medallion Architecture, we want to stream changes with spark structured streaming. I would like some advice on how to use delta table as source correctly, and if there is performance (memory usage) concern in t...
In your scenario using Medallion Architecture with Delta tables as both streaming source and sink, it is important to understand Spark Structured Streaming behavior and performance characteristics, especially with joins and memory usage. Here is a di...