cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

mlivshutz
by New Contributor II
  • 3398 Views
  • 2 replies
  • 2 kudos

How to configure DAB bundles to run serverless

I am following the guidelines in https://docs.databricks.com/aws/en/dev-tools/bundles/jobs-tutorial to setup the job for serverless. It says to "omit the job_clusters configuration from the bundle configuration file." It sounds like the idea is to si...

  • 3398 Views
  • 2 replies
  • 2 kudos
Latest Reply
mlivshutz
New Contributor II
  • 2 kudos

Hi, @ashraf1395 , Thank you for looking at my question. My cli is 0.243, which is current as of today (3/17/25).The task definition within resources/dbx_backfill_emotion_job.yml:tasks: - task_key: dbx_backfill_base_fields_x_1 # job_...

  • 2 kudos
1 More Replies
noorbasha534
by Valued Contributor II
  • 2191 Views
  • 1 replies
  • 0 kudos

Databricks Jobs API - Throttling

Dear all,I am planning to execute a script that fetches databricks jobs status every 10 minutes. I have around 500 jobs in my workspace. The APIs I use are listed below - list runs, get all job runs.I was wondering if this could cause throttling as t...

  • 2191 Views
  • 1 replies
  • 0 kudos
Latest Reply
koji_kawamura
Databricks Employee
  • 0 kudos

Hi @noorbasha534  Different limitations are implemented at API endpoints. The "/jobs/runs/list" has a limitation of 30 requests/second. The number of concurrent task executions is limited up to 2000. These limits work separately, so the job list API ...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 13098 Views
  • 6 replies
  • 3 kudos
  • 13098 Views
  • 6 replies
  • 3 kudos
Latest Reply
VasuBajaj
New Contributor II
  • 3 kudos

A .CRC file (Cyclic Redundancy Check) is an internal checksum file used by Spark (and Hadoop) to ensure data integrity when reading and writing files.Data Integrity Check – .CRC files store checksums of actual data files. When reading a file, Spark/H...

  • 3 kudos
5 More Replies
BricksGuy
by New Contributor III
  • 1475 Views
  • 1 replies
  • 0 kudos

DLT Pipeline OOM issue

Hi ,I am getting performance issues in one of my pipeline which is taking 5hour to run even for no data where it was taking 1hour earlier. It seems as the volume of the source grows it keeps degrading the performance. I am having below setup.Source i...

  • 1475 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 0 kudos

Hi BricksGuy,How are you doing today?, As per my understanding, It looks like your pipeline is slowing down because it's processing too many small parquet files—over 10 million—which is causing high metadata overhead and memory issues. Since Spark ha...

  • 0 kudos
lmorrissey
by New Contributor II
  • 979 Views
  • 3 replies
  • 0 kudos

Unable to connect to mongodb spark connector on a shared cluster

The connector works without issue if the cluster is made private; does anyone know why this is or have a workaround (besides spawning a bunch of private clusters)

  • 979 Views
  • 3 replies
  • 0 kudos
Latest Reply
dewman
New Contributor II
  • 0 kudos

Any news on this? I too am having issues where a dedicated cluster can read from MongoDB no problem but as soon as I try to run the notebook with a shared cluster, I get an ConflictType (ofclass com.mongodb.spark.sql.types.ConflictTypes) error

  • 0 kudos
2 More Replies
Soumik
by Databricks Partner
  • 4340 Views
  • 2 replies
  • 1 kudos

#N/A value is coming as null/NaN while using pandas.read_excel

Hi All,I am trying to read an input_file.xlsx file using pandas.read_excel. I am using the below option import pandas as pddf = pd.read_excel(input_file, sheetname = sheetname, dtype = str, na_filter= False, keep_default_na = FalseNot sure but the va...

  • 4340 Views
  • 2 replies
  • 1 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 1 kudos

Hi Soumik,How are you doing today? As per my understanding, It looks like Pandas is still treating #N/A as a missing value because Excel considers it a special type of NA. Even though you've set na_filter=False and keep_default_na=False, Pandas might...

  • 1 kudos
1 More Replies
dyusuf
by New Contributor II
  • 1904 Views
  • 2 replies
  • 0 kudos

Data Skewnesss

I am trying to visualize data skewness through a simple aggregation example by performing groupby operation on a dataframe, the data is skewed highly for one customer, but yet databricks is balancing it automatically when I check spark UI. Is there a...

  • 1904 Views
  • 2 replies
  • 0 kudos
Latest Reply
SantoshJoshi
New Contributor III
  • 0 kudos

Hi @dyusuf ,It could be because AQE (Adaptive Query Execution) is enabled....AQE, dynamically handles skew...Please refer below link for more details:https://docs.databricks.com/aws/en/optimizations/aqeCan you please disable AQE and check if this wor...

  • 0 kudos
1 More Replies
Kutbuddin
by New Contributor III
  • 1515 Views
  • 1 replies
  • 0 kudos

Random failures with serverless compute running dbt jobs

We recently encountered the below issue, where a databricks job configured to run a dbt task on serverless compute and warehouse failed due to python dependency failure:run failed with error message Library installation failed: Library installation ...

  • 1515 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 0 kudos

Hi,How are you doing today?, As per my understanding, This kind of random failure is usually due to network issues, temporary package repository problems, or how serverless compute handles dependencies. Since serverless clusters are short-lived and s...

  • 0 kudos
SObiero
by New Contributor
  • 1723 Views
  • 2 replies
  • 0 kudos

Databricks App Error libodbc.so.2: cannot open shared object file: No such file or directory

How do I solve this error in my Databricks Apps when using the pyodbc library? I have used an init script to install the library in my cluster, which has resolved the issue in Notebooks. However, the problem persists in Apps. I have used the followin...

  • 1723 Views
  • 2 replies
  • 0 kudos
Latest Reply
mourinseoexpart
New Contributor II
  • 0 kudos

Your analysis is spot on! The issue likely stems from environment differences between Notebooks and Apps. Checking cluster consistency, verifying and setting should help. Also, reviewing permissions and alternative installations might be necessary. L...

  • 0 kudos
1 More Replies
NathanE
by New Contributor II
  • 10687 Views
  • 7 replies
  • 10 kudos

Java 21 support with Databricks JDBC driver

Hello,I was wondering if there was any timeline for Java 21 support with the Databricks JDBC driver (current version is 2.34).One of the required change is to update the dependency to arrow to version 13.0 (current version is 9.0.0).The current worka...

Data Engineering
driver
java21
JDBC
  • 10687 Views
  • 7 replies
  • 10 kudos
Latest Reply
yunbodeng
Databricks Employee
  • 10 kudos

You can download the latest Databricks JDBC here (https://www.databricks.com/spark/jdbc-drivers-download) in which the latest Arrow version is 17 (https://databricks-bi-artifacts.s3.us-east-2.amazonaws.com/simbaspark-drivers/jdbc/2.7.1/docs/release-n...

  • 10 kudos
6 More Replies
jrod123
by New Contributor II
  • 2344 Views
  • 4 replies
  • 1 kudos

Simple append for a DLT

Looking for some help getting unstuck re: appending to DLTs in Databricks. I have successfully extracted data via API endpoint, done some initial data cleaning/processing, and subsequently stored that data in a DLT. Great start. But I noticed that ea...

  • 2344 Views
  • 4 replies
  • 1 kudos
Latest Reply
tastefulSamurai
New Contributor II
  • 1 kudos

I am likewise struggling with this. All DLT configurations that I've tried (including spark_conf={"pipelines.autoOptimize.appendOnly": "true"}) just yield overwrites of the existing data. Any luck @jrod123 

  • 1 kudos
3 More Replies
AyushModi038
by New Contributor III
  • 14435 Views
  • 8 replies
  • 10 kudos

Library installation in cluster taking a long time

I am trying to install "pycaret" libraray in cluster using whl file.But it is creating conflict in the dependency sometimes (not always, sometimes it works too.) ​My questions are -1 - How to install libraries in cluster only single time (Maybe from ...

  • 14435 Views
  • 8 replies
  • 10 kudos
Latest Reply
Spencer_Kent
New Contributor III
  • 10 kudos

@Retired_modWhat about question #1, which is what subsequent comments to this thread have been referring to? To recap the question: is it possible for "cluster-installed" libraries to be cached in such a way that they aren't completely reinstalled ev...

  • 10 kudos
7 More Replies
shervinmir
by New Contributor II
  • 6708 Views
  • 4 replies
  • 0 kudos

Using user-assigned managed identity inside notebook

Hi team,I am interested in using a user-assigned managed identity within my notebook. I've come across examples using system-assigned managed identities or leveraging the Access Connector for Azure Databricks via Unity Catalog. However, as I do not h...

  • 6708 Views
  • 4 replies
  • 0 kudos
Latest Reply
shervinmir
New Contributor II
  • 0 kudos

Hi team, Just wondering if anyone has any suggestions. We are still unable to use User Assigned managed identity inside the a notebook in Databricks to connect to a external Gen 2 storage 

  • 0 kudos
3 More Replies
Prashanth24
by New Contributor III
  • 15453 Views
  • 8 replies
  • 4 kudos

Resolved! Difference between Liquid clustering and Z-ordering

I am trying to understand the difference between Liquid clustering and z-ordering. As per my understanding, both stores the clustered information into ZCubes which is of size 100 GB.Liquid Clustering maintains ZCube id in transaction log so when opti...

  • 15453 Views
  • 8 replies
  • 4 kudos
Latest Reply
canadiandataguy
New Contributor III
  • 4 kudos

I have built a decision tree on how to think about it https://www.canadiandataguy.com/p/optimizing-delta-lake-tables-liquid?triedRedirect=true

  • 4 kudos
7 More Replies
alextc77
by New Contributor II
  • 4595 Views
  • 3 replies
  • 2 kudos

Resolved! Unable to Access Unity Catalog from Cluster in Azure Databricks while Serverless Works

I can't access tables and volumes from the Unity Catalog using a cluster in Azure Databricks, although it works with serverless. Why is this the case?※As for the cluster, the summary displayed the "UnityCatalog" UC badge, and the access mode (data_se...

  • 4595 Views
  • 3 replies
  • 2 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 2 kudos

Thanks for the updates, Alex. Good day.

  • 2 kudos
2 More Replies
Labels