Data Engineering

Forum Posts

Sorted by:

by mlivshutz • New Contributor II

03-14-2025 10:30:15 AM

3398 Views
2 replies
2 kudos

How to configure DAB bundles to run serverless

I am following the guidelines in https://docs.databricks.com/aws/en/dev-tools/bundles/jobs-tutorial to setup the job for serverless. It says to "omit the job_clusters configuration from the bundle configuration file." It sounds like the idea is to si...

Data Engineering

3398 Views
2 replies
2 kudos

03-14-2025 10:30:15 AM

View Replies

Latest Reply

mlivshutz
New Contributor II

03-17-2025 3:04:18 AM

2 kudos

Hi, @ashraf1395 , Thank you for looking at my question. My cli is 0.243, which is current as of today (3/17/25).The task definition within resources/dbx_backfill_emotion_job.yml:tasks: - task_key: dbx_backfill_base_fields_x_1 # job_...

2 kudos

03-17-2025 3:04:18 AM

1 More Replies

by noorbasha534 • Valued Contributor II

03-11-2025 4:23:45 PM

2191 Views
1 replies
0 kudos

Databricks Jobs API - Throttling

Dear all,I am planning to execute a script that fetches databricks jobs status every 10 minutes. I have around 500 jobs in my workspace. The APIs I use are listed below - list runs, get all job runs.I was wondering if this could cause throttling as t...

Data Engineering

2191 Views
1 replies
0 kudos

03-11-2025 4:23:45 PM

View Replies

Latest Reply

koji_kawamura
Databricks Employee

03-17-2025 1:17:46 AM

0 kudos

Hi @noorbasha534 Different limitations are implemented at API endpoints. The "/jobs/runs/list" has a limitation of 30 requests/second. The number of concurrent task executions is limited up to 2000. These limits work separately, so the job list API ...

0 kudos

03-17-2025 1:17:46 AM

by brickster_2018 • Databricks Employee

06-25-2021 3:41:25 PM

13098 Views
6 replies
3 kudos

Why do we need CRC files in Delta logs. How does CRC file help for the transaction control in Delta

Data Engineering

13098 Views
6 replies
3 kudos

06-25-2021 3:41:25 PM

View Replies

Latest Reply

VasuBajaj
New Contributor II

03-16-2025 10:27:34 PM

3 kudos

A .CRC file (Cyclic Redundancy Check) is an internal checksum file used by Spark (and Hadoop) to ensure data integrity when reading and writing files.Data Integrity Check – .CRC files store checksums of actual data files. When reading a file, Spark/H...

3 kudos

03-16-2025 10:27:34 PM

5 More Replies

by BricksGuy • New Contributor III

03-14-2025 4:04:04 AM

1475 Views
1 replies
0 kudos

DLT Pipeline OOM issue

Hi ,I am getting performance issues in one of my pipeline which is taking 5hour to run even for no data where it was taking 1hour earlier. It seems as the volume of the source grows it keeps degrading the performance. I am having below setup.Source i...

Data Engineering

1475 Views
1 replies
0 kudos

03-14-2025 4:04:04 AM

View Replies

Latest Reply

Brahmareddy
Esteemed Contributor

03-16-2025 9:48:12 PM

0 kudos

Hi BricksGuy,How are you doing today?, As per my understanding, It looks like your pipeline is slowing down because it's processing too many small parquet files—over 10 million—which is causing high metadata overhead and memory issues. Since Spark ha...

0 kudos

03-16-2025 9:48:12 PM

by lmorrissey • New Contributor II

02-05-2025 4:01:52 PM

979 Views
3 replies
0 kudos

Unable to connect to mongodb spark connector on a shared cluster

The connector works without issue if the cluster is made private; does anyone know why this is or have a workaround (besides spawning a bunch of private clusters)

Data Engineering

979 Views
3 replies
0 kudos

02-05-2025 4:01:52 PM

View Replies

Latest Reply

dewman
New Contributor II

03-16-2025 4:58:35 PM

0 kudos

Any news on this? I too am having issues where a dedicated cluster can read from MongoDB no problem but as soon as I try to run the notebook with a shared cluster, I get an ConflictType (ofclass com.mongodb.spark.sql.types.ConflictTypes) error

0 kudos

03-16-2025 4:58:35 PM

2 More Replies

by Soumik • Databricks Partner

03-03-2025 12:17:20 AM

4340 Views
2 replies
1 kudos

#N/A value is coming as null/NaN while using pandas.read_excel

Hi All,I am trying to read an input_file.xlsx file using pandas.read_excel. I am using the below option import pandas as pddf = pd.read_excel(input_file, sheetname = sheetname, dtype = str, na_filter= False, keep_default_na = FalseNot sure but the va...

Data Engineering

4340 Views
2 replies
1 kudos

03-03-2025 12:17:20 AM

View Replies

Latest Reply

Brahmareddy
Esteemed Contributor

03-15-2025 9:03:56 PM

1 kudos

Hi Soumik,How are you doing today? As per my understanding, It looks like Pandas is still treating #N/A as a missing value because Excel considers it a special type of NA. Even though you've set na_filter=False and keep_default_na=False, Pandas might...

1 kudos

03-15-2025 9:03:56 PM

1 More Replies

by dyusuf • New Contributor II

03-15-2025 1:35:29 AM

1904 Views
2 replies
0 kudos

Data Skewnesss

I am trying to visualize data skewness through a simple aggregation example by performing groupby operation on a dataframe, the data is skewed highly for one customer, but yet databricks is balancing it automatically when I check spark UI. Is there a...

Data Engineering

1904 Views
2 replies
0 kudos

03-15-2025 1:35:29 AM

View Replies

Latest Reply

SantoshJoshi
New Contributor III

03-15-2025 7:45:52 AM

0 kudos

Hi @dyusuf ,It could be because AQE (Adaptive Query Execution) is enabled....AQE, dynamically handles skew...Please refer below link for more details:https://docs.databricks.com/aws/en/optimizations/aqeCan you please disable AQE and check if this wor...

0 kudos

03-15-2025 7:45:52 AM

1 More Replies

by Kutbuddin • New Contributor III

03-11-2025 8:40:18 AM

1515 Views
1 replies
0 kudos

Random failures with serverless compute running dbt jobs

We recently encountered the below issue, where a databricks job configured to run a dbt task on serverless compute and warehouse failed due to python dependency failure:run failed with error message Library installation failed: Library installation ...

Data Engineering

1515 Views
1 replies
0 kudos

03-11-2025 8:40:18 AM

View Replies

Latest Reply

Brahmareddy
Esteemed Contributor

03-15-2025 8:52:10 PM

0 kudos

Hi,How are you doing today?, As per my understanding, This kind of random failure is usually due to network issues, temporary package repository problems, or how serverless compute handles dependencies. Since serverless clusters are short-lived and s...

0 kudos

03-15-2025 8:52:10 PM

by SObiero • New Contributor

03-14-2025 6:17:49 AM

1723 Views
2 replies
0 kudos

Databricks App Error libodbc.so.2: cannot open shared object file: No such file or directory

How do I solve this error in my Databricks Apps when using the pyodbc library? I have used an init script to install the library in my cluster, which has resolved the issue in Notebooks. However, the problem persists in Apps. I have used the followin...

Data Engineering

1723 Views
2 replies
0 kudos

03-14-2025 6:17:49 AM

View Replies

Latest Reply

mourinseoexpart
New Contributor II

03-15-2025 8:41:49 PM

0 kudos

Your analysis is spot on! The issue likely stems from environment differences between Notebooks and Apps. Checking cluster consistency, verifying and setting should help. Also, reviewing permissions and alternative installations might be necessary. L...

0 kudos

03-15-2025 8:41:49 PM

1 More Replies

by NathanE • New Contributor II

10-16-2023 4:27:10 AM

10687 Views
7 replies
10 kudos

Java 21 support with Databricks JDBC driver

Hello,I was wondering if there was any timeline for Java 21 support with the Databricks JDBC driver (current version is 2.34).One of the required change is to update the dependency to arrow to version 13.0 (current version is 9.0.0).The current worka...

Data Engineering

driver

java21

JDBC

10687 Views
7 replies
10 kudos

10-16-2023 4:27:10 AM

View Replies

Latest Reply

yunbodeng
Databricks Employee

03-15-2025 9:43:00 AM

10 kudos

You can download the latest Databricks JDBC here (https://www.databricks.com/spark/jdbc-drivers-download) in which the latest Arrow version is 17 (https://databricks-bi-artifacts.s3.us-east-2.amazonaws.com/simbaspark-drivers/jdbc/2.7.1/docs/release-n...

10 kudos

03-15-2025 9:43:00 AM

6 More Replies

by jrod123 • New Contributor II

02-25-2025 2:34:44 PM

2344 Views
4 replies
1 kudos

Simple append for a DLT

Looking for some help getting unstuck re: appending to DLTs in Databricks. I have successfully extracted data via API endpoint, done some initial data cleaning/processing, and subsequently stored that data in a DLT. Great start. But I noticed that ea...

Data Engineering

2344 Views
4 replies
1 kudos

02-25-2025 2:34:44 PM

View Replies

Latest Reply

tastefulSamurai
New Contributor II

03-14-2025 3:51:55 PM

1 kudos

I am likewise struggling with this. All DLT configurations that I've tried (including spark_conf={"pipelines.autoOptimize.appendOnly": "true"}) just yield overwrites of the existing data. Any luck @jrod123

1 kudos

03-14-2025 3:51:55 PM

3 More Replies

by AyushModi038 • New Contributor III

02-17-2023 6:26:26 AM

14435 Views
8 replies
10 kudos

Library installation in cluster taking a long time

I am trying to install "pycaret" libraray in cluster using whl file.But it is creating conflict in the dependency sometimes (not always, sometimes it works too.) My questions are -1 - How to install libraries in cluster only single time (Maybe from ...

Data Engineering

14435 Views
8 replies
10 kudos

02-17-2023 6:26:26 AM

View Replies

Latest Reply

Spencer_Kent
New Contributor III

07-05-2024 10:22:18 AM

10 kudos

@Retired_modWhat about question #1, which is what subsequent comments to this thread have been referring to? To recap the question: is it possible for "cluster-installed" libraries to be cached in such a way that they aren't completely reinstalled ev...

10 kudos

07-05-2024 10:22:18 AM

7 More Replies

by shervinmir • New Contributor II

02-17-2025 8:17:49 PM

6708 Views
4 replies
0 kudos

Using user-assigned managed identity inside notebook

Hi team,I am interested in using a user-assigned managed identity within my notebook. I've come across examples using system-assigned managed identities or leveraging the Access Connector for Azure Databricks via Unity Catalog. However, as I do not h...

Data Engineering

6708 Views
4 replies
0 kudos

02-17-2025 8:17:49 PM

View Replies

Latest Reply

shervinmir
New Contributor II

03-10-2025 5:50:29 PM

0 kudos

Hi team, Just wondering if anyone has any suggestions. We are still unable to use User Assigned managed identity inside the a notebook in Databricks to connect to a external Gen 2 storage

0 kudos

03-10-2025 5:50:29 PM

3 More Replies

by Prashanth24 • New Contributor III

08-07-2024 9:16:57 AM

15453 Views
8 replies
4 kudos

Resolved! Difference between Liquid clustering and Z-ordering

I am trying to understand the difference between Liquid clustering and z-ordering. As per my understanding, both stores the clustered information into ZCubes which is of size 100 GB.Liquid Clustering maintains ZCube id in transaction log so when opti...

Data Engineering

15453 Views
8 replies
4 kudos

08-07-2024 9:16:57 AM

View Replies

Latest Reply

canadiandataguy
New Contributor III

03-13-2025 8:31:15 AM

4 kudos

I have built a decision tree on how to think about it https://www.canadiandataguy.com/p/optimizing-delta-lake-tables-liquid?triedRedirect=true

4 kudos

03-13-2025 8:31:15 AM

7 More Replies

by alextc77 • New Contributor II

03-05-2025 9:39:10 PM

4595 Views
3 replies
2 kudos

Resolved! Unable to Access Unity Catalog from Cluster in Azure Databricks while Serverless Works

I can't access tables and volumes from the Unity Catalog using a cluster in Azure Databricks, although it works with serverless. Why is this the case?※As for the cluster, the summary displayed the "UnityCatalog" UC badge, and the access mode (data_se...

Data Engineering

4595 Views
3 replies
2 kudos

03-05-2025 9:39:10 PM

View Replies

Latest Reply

Brahmareddy
Esteemed Contributor

03-14-2025 9:10:45 AM

2 kudos

Thanks for the updates, Alex. Good day.

2 kudos

03-14-2025 9:10:45 AM

2 More Replies

Databricks Community

Forum Posts

How to configure DAB bundles to run serverless

Databricks Jobs API - Throttling

Why do we need CRC files in Delta logs. How does CRC file help for the transaction control in Delta

DLT Pipeline OOM issue

Unable to connect to mongodb spark connector on a shared cluster

#N/A value is coming as null/NaN while using pandas.read_excel

Data Skewnesss

Random failures with serverless compute running dbt jobs

Databricks App Error libodbc.so.2: cannot open shared object file: No such file or directory

Java 21 support with Databricks JDBC driver

Simple append for a DLT

Library installation in cluster taking a long time

Using user-assigned managed identity inside notebook

Resolved! Difference between Liquid clustering and Z-ordering

Resolved! Unable to Access Unity Catalog from Cluster in Azure Databricks while Serverless Works

File Arrival Trigger - Multiple tables

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template