cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

kenmyers-8451
by Contributor II
  • 606 Views
  • 3 replies
  • 1 kudos

Resolved! mode: development not working as expected

Hey I'm trying to add mode: development to my "Development" target (which is default) but it does not seem to be working as I expected. Here is what my targets file looks like:I'm deploying with this command: databricks272 bundle deploy -p dev3 -t De...

Screenshot 2026-03-05 at 9.51.17 AM.png Screenshot 2026-03-05 at 9.53.51 AM.png
  • 606 Views
  • 3 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @kenmyers-8451, Glad you tracked this down. This is a common gotcha with Databricks Asset Bundles (DABs) when splitting configuration across multiple files: if the file containing your target definition (with mode: development) is not listed in th...

  • 1 kudos
2 More Replies
abhijit007
by Databricks Partner
  • 567 Views
  • 2 replies
  • 2 kudos

Resolved! Databricks App Issue– “socket hang up / ECONNRESET” when API call runs > 30 seconds

Problem Statement:We are running a Data App on Databricks that uses Next.js (frontend) and FastAPI (backend). The backend calls a Databricks Agent (AgentBricks) via a serving endpoint, which typically needs ~1 minute to return a response. However, an...

  • 567 Views
  • 2 replies
  • 2 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 2 kudos

Hi @abhijit007, Your debugging was thorough and you correctly isolated the issue: the timeout is happening upstream of your application code. Databricks Apps run behind a managed ingress/request router that enforces request-level timeouts (typically ...

  • 2 kudos
1 More Replies
neerajaN
by New Contributor II
  • 512 Views
  • 4 replies
  • 2 kudos

Resolved! count function

Hi, as per spark internals, once count function executed in worker nodes , one of the worker node collect all the count of records and do summation ?or count of records from all worker nodes passed to driver node. and summation done driver node side....

  • 512 Views
  • 4 replies
  • 2 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 2 kudos

Hi @neerajaN, You are correct that the count() operation follows a two-phase aggregation pattern in Spark. Here is how it works in detail: PHASE 1: PARTIAL AGGREGATION (EXECUTORS) Each executor computes a local partial count for the partitions assign...

  • 2 kudos
3 More Replies
Malthe
by Valued Contributor II
  • 434 Views
  • 3 replies
  • 0 kudos

Resolved! Python segmentation fault in serverless job

We're getting a Python segmentation fault in a serverless job that uses Delta Table merge inside a foreachBatch step in structured streaming (trigger once)./databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/streaming/query.py:479: Us...

Screenshot 2026-03-05 at 11.01.39.png
  • 434 Views
  • 3 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @Malthe, Since you have confirmed this is vanilla PySpark with no external libraries on serverless runtime environment version 5, this narrows things down considerably. Here are some additional observations and recommendations beyond what Louis sh...

  • 0 kudos
2 More Replies
NW1000
by New Contributor III
  • 500 Views
  • 3 replies
  • 1 kudos

Resolved! Unable to access files using a classic cluster

I used the same code with the classic cluster (RunTime 17.3LTS ML, with spark config: "spark.databricks.workspace.fileSystem.enabled true"), not able to access files in workspace with the following python code: import os# Check if source exists and w...

  • 500 Views
  • 3 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @NW1000, This behavior comes down to how workspace file access and identity work differently between serverless compute and classic clusters. SERVERLESS COMPUTE Serverless interactive compute runs under your own identity. It inherits your workspac...

  • 1 kudos
2 More Replies
Seunghyun
by Contributor
  • 601 Views
  • 3 replies
  • 2 kudos

Resolved! Conditional Logic in Databricks Asset Bundles using Go Templates

I am defining a job using Databricks Asset Bundles (DABs) as follows:YAML resources: jobs: job_name: ... schedule: {{ if eq ${var.env} "prd" }} pause_status: "UNPAUSED" {{ else }} pause_status: "PAUSE...

  • 601 Views
  • 3 replies
  • 2 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 2 kudos

Hi @Seunghyun, Go template syntax ({{if}}, {{eq}}, etc.) is only supported in bundle project templates, which are the .tmpl files used during "databricks bundle init" to scaffold new projects. It is not supported inside your regular databricks.yml co...

  • 2 kudos
2 More Replies
Seunghyun
by Contributor
  • 556 Views
  • 2 replies
  • 2 kudos

Resolved! Managing dashboard refresh schedules in DABs

I am currently using Databricks Asset Bundles (DABs) to deploy and manage dashboard resources. While I can manually add a schedule to a dashboard via the Databricks console, I would like to reflect this same configuration in the dashboard YAML file. ...

  • 556 Views
  • 2 replies
  • 2 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 2 kudos

Hi @Seunghyun, You are correct that the dashboard resource definition in Databricks Asset Bundles does not currently include schedule-related properties. The dashboard resource supports properties like display_name, file_path, warehouse_id, embed_cre...

  • 2 kudos
1 More Replies
FAHADURREHMAN
by New Contributor III
  • 458 Views
  • 3 replies
  • 2 kudos

Optimizing Large Materialized View to expedite query execution

Hi All, I have a DLT Pipeline setup which reading Parquets from S3 Bucket and create a materialized view. Created view is quite big and contains Billion of records and contain around few TB of data. Predictive Optimization is already enabled. automat...

  • 458 Views
  • 3 replies
  • 2 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 2 kudos

Hi @FAHADURREHMAN, There are several layers to optimizing query performance on a multi-TB materialized view, and the other replies here cover the ingestion/refresh side well. Let me add some guidance on the query-side tuning and help you decide betwe...

  • 2 kudos
2 More Replies
yit337
by Contributor
  • 511 Views
  • 2 replies
  • 1 kudos

Resolved! Stream to static join - late arriving records

I have a stream to static join, but some of the rows in the static table arrive later than the linked rows in the stream.What is the default behaviour if a record in the stream hasn't joined a record in the static table? Is it lost forever?How is thi...

  • 511 Views
  • 2 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @yit337, This is an important topic to understand, so let me walk through the mechanics in detail. HOW STREAM-STATIC JOINS WORK In a stream-static join, each micro-batch of streaming data is joined against the static DataFrame. The key behavior de...

  • 1 kudos
1 More Replies
Innuendo84
by Databricks Partner
  • 480 Views
  • 3 replies
  • 1 kudos

Resolved! Fatal error: The Python kernel is unresponsive.

I'm having problems with running databricks with cv2.Everytime I try to import cv2 I get this error. If I comment it out error disappears.

  • 480 Views
  • 3 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @Innuendo84, Glad you got this resolved. For anyone else who runs into this, here is some additional context on why this happens and how to avoid it. THE ROOT CAUSE The standard opencv-python package includes GUI components (highgui, GTK/Qt bindin...

  • 1 kudos
2 More Replies
FAHADURREHMAN
by New Contributor III
  • 366 Views
  • 3 replies
  • 1 kudos

Resolved! DLT Auto Loader Reading from Parent S3 Folder not Sub Folders

Hi All, I am trying to read csv files from one Folder of S3 bucket. For this particular used case, I do not intent to read from sub folders. I am using below code however its reading all CSVs in sub folders as well. How can i avoid that? I used many ...

  • 366 Views
  • 3 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @FAHADURREHMAN, This is expected behavior with Auto Loader. By default, when you point it at a directory path like s3://bucket/folder/, it will recursively traverse all subdirectories and pick up matching files. The pathGlobFilter option only filt...

  • 1 kudos
2 More Replies
dplatform_user
by New Contributor II
  • 373 Views
  • 2 replies
  • 1 kudos

Resolved! DEEP CLONE fails with [UNRESOLVED_ROUTINE] Cannot resolve routine isNotNull on DBR 16.4

Hi Databricks Community, I'm encountering an issue when attempting to DEEP CLONE a Delta table on DBR 16.4 that works fine on DBR 13.3.Error Message: [UNRESOLVED_ROUTINE] Cannot resolve routine `isNotNull` on search path [`system`.`builtin`, `system`...

  • 373 Views
  • 2 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @dplatform_user, This error occurs because of how NOT NULL constraints are internally represented in Delta table metadata. When a Delta table has NOT NULL columns, the Delta protocol stores these as CHECK constraints using expressions like isNotNu...

  • 1 kudos
1 More Replies
Kirankumarbs
by Contributor
  • 437 Views
  • 2 replies
  • 0 kudos

Resolved! Databricks Spark UI showing -1 Executors

Hi Community,This might be a basic question, but I’m asking for educational purposes.I noticed that in one of my jobs, the Spark UI shows -1 executors. Initially, I thought this might indicate that executors are idle, but that doesn’t seem to explain...

Screenshot 2026-03-03 at 09.14.39.png Screenshot 2026-03-03 at 09.14.00.png
  • 437 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @Kirankumarbs, The -1 value you are seeing for executors in the Spark UI depends on which type of compute your job is running on, so let me cover both scenarios. SERVERLESS COMPUTE If your job is running on serverless compute, this is the expected...

  • 0 kudos
1 More Replies
Kirankumarbs
by Contributor
  • 526 Views
  • 3 replies
  • 2 kudos

Python logger.info() not showing inside applyInPandas (but print() works) — why?

Problem: In Databricks, logs from an external binary (via os.system) show up, but Python logger.info() inside groupBy(...).applyInPandas(...) does not. print(..., flush=True) does show up.Why: applyInPandas runs your function as a pandas UDF.  That c...

  • 526 Views
  • 3 replies
  • 2 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 2 kudos

Hi @Kirankumarbs, You have correctly identified the root cause here. When you use applyInPandas, your function runs inside a separate Python worker process on each executor, not in the driver process. The logging configuration you set up on the drive...

  • 2 kudos
2 More Replies
smoortema
by Contributor
  • 483 Views
  • 2 replies
  • 2 kudos

Resolved! check statistics of clustering columns per file to see how liquid clustering works

I have a Delta table on which I set up liquid clustering using three columns. I would like to check file statistics to see how the clustering column values are distributed along the files. How can I write a query that shows min and max values, etc. o...

  • 483 Views
  • 2 replies
  • 2 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 2 kudos

Hi @smoortema, There are several approaches for inspecting per-file column statistics on a liquid-clustered Delta table. Here is a walkthrough from simplest to most detailed. APPROACH 1: CONFIRM CLUSTERING CONFIGURATION First, verify that clustering ...

  • 2 kudos
1 More Replies
Labels