Data Engineering

Forum Posts

Sorted by:

by Charansai • New Contributor III

11-13-2025 2:03:34 PM

695 Views
1 replies
0 kudos

How to use serverless clusters in DAB deployments with Unity Catalog in private network?

Hi everyone,I’m deploying Jobs and Pipelines using Databricks Asset Bundles (DAB) in an Azure Databricks workspace configured with private networking. I’m trying to use serverless compute for some workloads, but I’m running into issues when Unity Cat...

Data Engineering

695 Views
1 replies
0 kudos

11-13-2025 2:03:34 PM

View Replies

Latest Reply

Coffee77
Honored Contributor II

11-14-2025 12:16:41 AM

0 kudos

A lot of questions Concerning usage of serverless clusters in databricks.yml and assuming you're using those clusters in jobs, you must define them in the job definition. Take a look here: https://github.com/databricks/bundle-examples/tree/main/know...

0 kudos

11-14-2025 12:16:41 AM

by intelliconnectq • New Contributor III

11-11-2025 8:58:18 PM

408 Views
2 replies
0 kudos

Resolved! Loading CSV from private S3 bucket

Trying to load a csv file from a private S3 bucketplease clarify requirements to do this- Can I do it in community edition (if yes then how)? How to do it in premium version?I have IAM role and I also access key & secret

Data Engineering

408 Views
2 replies
0 kudos

11-11-2025 8:58:18 PM

View Replies

Latest Reply

Coffee77
Honored Contributor II

11-12-2025 12:43:46 AM

0 kudos

Assuming you have these pre-requisites: A private S3 bucket (e.g., s3://my-private-bucket/data/file.csv)An IAM user or role with access (list/get) to that bucketThe AWS Access Key ID and Secret Access Key (client and secret)The most straightforward w...

0 kudos

11-12-2025 12:43:46 AM

1 More Replies

by Hubert-Dudek • Databricks MVP

11-17-2021 6:16:14 AM

28237 Views
14 replies
12 kudos

Resolved! dbutils or other magic way to get notebook name or cell title inside notebook cell

Not sure it exists but maybe there is some trick to get directly from python code:NotebookNameCellTitlejust working on some logger script shared between notebooks and it could make my life a bit easier

Data Engineering

28237 Views
14 replies
12 kudos

11-17-2021 6:16:14 AM

View Replies

Latest Reply

rtullis
New Contributor II

11-21-2024 12:22:16 PM

12 kudos

I got the solution to work in terms of printing the notebook that I was running; however, what if you have notebook A that calls a function that prints the notebook name, and you run notebook B that %runs notebook A? I get the notebook B's name when...

12 kudos

11-21-2024 12:22:16 PM

13 More Replies

by kahrees • New Contributor II

11-11-2025 11:32:46 AM

913 Views
3 replies
4 kudos

Resolved! DATA_SOURCE_NOT_FOUND Error with MongoDB (Suggestions in other similar posts have not worked)

I am trying to load data from MongoDB into Spark. I am using the Community/Free version of DataBricks so my Jupiter Notebook is in a Chrome browser.Here is my code:from pyspark.sql import SparkSession spark = SparkSession.builder \ .config("spar...

Data Engineering

913 Views
3 replies
4 kudos

11-11-2025 11:32:46 AM

View Replies

Latest Reply

K_Anudeep
Databricks Employee

11-11-2025 7:02:05 PM

4 kudos

Hey @kahrees , Good Day! I tested this internally, and I was able to reproduce the issue. Screenshot below: You’re getting [DATA_SOURCE_NOT_FOUND] ... mongodb because the MongoDB Spark connector jar isn’t actually on your cluster’s classpath. On D...

4 kudos

11-11-2025 7:02:05 PM

2 More Replies

by eyalholzmann • Databricks Partner

11-09-2025 1:43:47 AM

852 Views
3 replies
2 kudos

Resolved! Does VACUUM on Delta Lake also clean Iceberg metadata when using Iceberg Uniform feature?

I'm working with Delta tables using the Iceberg Uniform feature to enable Iceberg-compatible reads. I’m trying to understand how metadata cleanup works in this setup.Specifically, does the VACUUM operation—which removes old Delta Lake metadata based ...

Data Engineering

852 Views
3 replies
2 kudos

11-09-2025 1:43:47 AM

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

11-13-2025 6:35:32 AM

2 kudos

Here’s how to approach cleaning and maintaining Apache Iceberg metadata on Databricks, and how it differs from Delta workflows. First, know your table type For Unity Catalog–managed Iceberg tables, Databricks runs table maintenance for you (predicti...

2 kudos

11-13-2025 6:35:32 AM

2 More Replies

by pooja_bhumandla • Databricks Partner

11-13-2025 4:42:31 AM

862 Views
1 replies
1 kudos

Resolved! Should I enable Liquid Clustering based on table size distribution?

Hi everyone,I’m evaluating whether Liquid Clustering would be beneficial for the tables based on the sizes. Below is the size distribution of tables in my environment:Size Bucket Table Count Large (> 1 TB)3Medium (10 GB – 1 TB)284Small (< 10 GB)17,26...

Data Engineering

862 Views
1 replies
1 kudos

11-13-2025 4:42:31 AM

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

11-13-2025 6:25:37 AM

1 kudos

Greetings @pooja_bhumandla Based on your size distribution, enabling Liquid Clustering can provide meaningful gains—but you’ll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where q...

1 kudos

11-13-2025 6:25:37 AM

by bidek56 • Contributor

10-30-2025 11:19:23 AM

859 Views
5 replies
1 kudos

Resolved! Location of spark.scheduler.allocation.file

In DBR 164.LTS, I am trying to add the following Spark config: spark.scheduler.allocation.file: file:/Workspace/init/fairscheduler.xmlBut the all purpose cluster is throwing this error Spark error: Driver down cause: com.databricks.backend.daemon.dri...

Data Engineering

859 Views
5 replies
1 kudos

10-30-2025 11:19:23 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-13-2025 5:45:28 AM

1 kudos

Here's some solutions without using DBFS.. Yes, there are solutions for using the Spark scheduler allocation file on Databricks without DBFS, but options are limited and depend on your environment and access controls. Alternatives to DBFS for Schedu...

1 kudos

11-13-2025 5:45:28 AM

4 More Replies

by Yuki • Contributor

11-13-2025 12:35:46 AM

769 Views
4 replies
1 kudos

Resolved! Is there any way to run jobs from github actions and catch the results?

Hi all,Is there any way to run jobs from github actions and catch the results?Of course, I can do this if I use the API or CLI.But I found the actions for notebook: https://github.com/marketplace/actions/run-databricks-notebook Compared to this, wri...

Data Engineering

769 Views
4 replies
1 kudos

11-13-2025 12:35:46 AM

View Replies

Latest Reply

Yuki
Contributor

11-13-2025 5:19:29 AM

1 kudos

OK, thank you for your advices, I will consider to use asset bundles for this.

1 kudos

11-13-2025 5:19:29 AM

3 More Replies

by Naveenkumar1811 • New Contributor III

11-12-2025 3:55:52 AM

500 Views
2 replies
0 kudos

What is the Best Practice of Maintaining the Delta table loaded in Streaming?

Hi Team,We have our Bronze(append) Silver(append) and Gold(merge) Tables loaded using spark streaming continuously with trigger as processing time(3 secs).We Also Run our Maintenance Job on the Table like OPTIMIZE,VACCUM and we perform DELETE for som...

Data Engineering

500 Views
2 replies
0 kudos

11-12-2025 3:55:52 AM

View Replies

Latest Reply

Naveenkumar1811
New Contributor III

11-13-2025 5:01:15 AM

0 kudos

Hi Mark,But the real problem is our streaming job runs 365 days 24 *7 and we cant afford any further latency to our data flowing to gold layer. We don't have any window to pause or slower our streaming and we continuously get the data feed actually s...

0 kudos

11-13-2025 5:01:15 AM

1 More Replies

by hidden • New Contributor II

11-13-2025 4:10:05 AM

255 Views
1 replies
0 kudos

DLT PARAMETERIZATION FROM JOBS PARAMETERS

I have created a dlt pipeline notebook which creates tables based on a config file that has the configuration of the tables that need to be created . now what i want is i want to run my pipeline every 30 min for 4 tables from config and every 3 hours...

Data Engineering

255 Views
1 replies
0 kudos

11-13-2025 4:10:05 AM

View Replies

Latest Reply

Coffee77
Honored Contributor II

11-13-2025 4:52:15 AM

0 kudos

Define "parameters" in job as usual and then, try to capture them in DLT by using similar code to this one:dlt.conf.get("PARAMETER_NAME", "PARAMETER_DEFAULT_VALUE")It should get parameter values from job if value exists, otherwise it'll set the defau...

0 kudos

11-13-2025 4:52:15 AM

by santosh_bhosale • New Contributor

11-13-2025 2:58:52 AM

305 Views
2 replies
1 kudos

Issue with Unity Catlog on Azure

when I create Databricks workspace on Azure and tries to login on https://accounts.azuredatabricks.net/ it redirects to my workspace. Where as on Azure subscription I am the owner, I created this azure subscription and Databricks workspace is also cr...

Data Engineering

305 Views
2 replies
1 kudos

11-13-2025 2:58:52 AM

View Replies

Latest Reply

Coffee77
Honored Contributor II

11-13-2025 4:46:06 AM

1 kudos

Clearly, you don't have "admin account" permissions. Try to click in the workspace drop-down and then, check if you can see and click in "Manage Account" to confirm BUT it will be very likely you are not allowed to access.You must be Azure Global Adm...

1 kudos

11-13-2025 4:46:06 AM

1 More Replies

by Allen123Maria_1 • New Contributor

11-11-2025 12:59:13 AM

2457 Views
2 replies
0 kudos

Resolved! Optimizing Azure Functions for Performance and Cost with Variable Workloads

Hey, everyone!!I use Azure Functions in a project where the workloads change a lot. Sometimes it's quiet, and other times we get a lot of traffic.Azure Functions is very scalable, but I've had some trouble with cold starts and keeping costs down.I'm ...

Data Engineering

2457 Views
2 replies
0 kudos

11-11-2025 12:59:13 AM

View Replies

Latest Reply

susanrobert3
New Contributor II

11-13-2025 3:35:06 AM

0 kudos

Hey!!!Cold starts on Azure Functions Premium can still bite if your instances go idle long enough — even with pre-warmed instances.What usually helps is bumping the `preWarmedInstanceCount` to at least 1 per plan (so there’s always a warm worker), an...

0 kudos

11-13-2025 3:35:06 AM

1 More Replies

by wkgcls • New Contributor II

11-13-2025 1:36:30 AM

741 Views
2 replies
2 kudos

Resolved! DQX usage outside Databricks

Hello, When evaluating data quality frameworks for PySpark pipelines, I came across DQX. I noticed it's available on PyPI (databricks-labs-dqx) and GitHub, which is great for accessibility.However, I'm trying to understand the licensing requirements....

Data Engineering

741 Views
2 replies
2 kudos

11-13-2025 1:36:30 AM

View Replies

Latest Reply

wkgcls
New Contributor II

11-13-2025 2:37:24 AM

2 kudos

Thanks a lot for the quick response, @ManojkMohan! This was very helpful.I'll keep this in mind.

2 kudos

11-13-2025 2:37:24 AM

1 More Replies

by liquibricks • Databricks Partner

11-12-2025 6:54:26 AM

658 Views
3 replies
2 kudos

Resolved! Moving tables between pipelines in production

We are testing an ingestion from kafka to databricks using a streaming table. The streaming table was created by a DAB deployed to "production" which runs as a service principal. This means the service principal is the "owner" of the table.We now wan...

Data Engineering

658 Views
3 replies
2 kudos

11-12-2025 6:54:26 AM

View Replies

Latest Reply

nayan_wylde
Esteemed Contributor II

11-12-2025 8:23:00 AM

2 kudos

You’ve hit two limitations:Streaming tables don’t allow SET OWNER – ownership cannot be changed.Lakeflow pipeline ID changes require pipeline-level permissions – if you’re not the pipeline owner, you can’t run ALTER STREAMING TABLE ... SET PIPELINE_I...

2 kudos

11-12-2025 8:23:00 AM

2 More Replies

by Suheb • Contributor

11-12-2025 10:59:54 PM

1806 Views
4 replies
3 kudos

When working with large data sets in Databricks, what are best practices to avoid memory out-of-memo

How can I optimize Databricks to handle large datasets without running into memory or performance problems?

Data Engineering

1806 Views
4 replies
3 kudos

11-12-2025 10:59:54 PM

View Replies

Latest Reply

tarunnagar
Contributor

11-13-2025 1:23:05 AM

3 kudos

Hey! Great question — I’ve run into this issue quite a few times while working with large datasets in Databricks, and out-of-memory errors can be a real headache. One of the biggest things that helps is making sure your cluster configuration matches ...

3 kudos

11-13-2025 1:23:05 AM

3 More Replies

Databricks Community

Forum Posts

How to use serverless clusters in DAB deployments with Unity Catalog in private network?

Resolved! Loading CSV from private S3 bucket

Resolved! dbutils or other magic way to get notebook name or cell title inside notebook cell

Resolved! DATA_SOURCE_NOT_FOUND Error with MongoDB (Suggestions in other similar posts have not worked)

Resolved! Does VACUUM on Delta Lake also clean Iceberg metadata when using Iceberg Uniform feature?

Resolved! Should I enable Liquid Clustering based on table size distribution?

Resolved! Location of spark.scheduler.allocation.file

Resolved! Is there any way to run jobs from github actions and catch the results?

What is the Best Practice of Maintaining the Delta table loaded in Streaming?

DLT PARAMETERIZATION FROM JOBS PARAMETERS

Issue with Unity Catlog on Azure

Resolved! Optimizing Azure Functions for Performance and Cost with Variable Workloads

Resolved! DQX usage outside Databricks

Resolved! Moving tables between pipelines in production

When working with large data sets in Databricks, what are best practices to avoid memory out-of-memo

File Arrival Trigger - Multiple tables

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template