cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

liquibricks
by Contributor
  • 248 Views
  • 7 replies
  • 4 kudos

Declarative Pipeline error: Name 'kdf' is not defined. Did you mean: 'sdf'

We have a Lakeflow Spark Declarative Pipeline using the new PySpark Pipelines API. This was working fine until about 7am (Central European) this morning when the pipeline started failing with a PYTHON.NAME_ERROR: name 'kdf' is not defined. Did you me...

  • 248 Views
  • 7 replies
  • 4 kudos
Latest Reply
zkaliszamisza
New Contributor
  • 4 kudos

For us it happened in westeurope around the same time

  • 4 kudos
6 More Replies
fintech_latency
by New Contributor
  • 245 Views
  • 9 replies
  • 2 kudos

How to guarantee “always-warm” serverless compute for low-latency Jobs workloads?

We’re building a low-latency processing pipeline on Databricks and are running into serverless cold-start constraints.We ingest events (calls) continuously via a Spark Structured Streaming listener.For each event, we trigger a serverlesss compute tha...

  • 245 Views
  • 9 replies
  • 2 kudos
Latest Reply
iyashk-DB
Databricks Employee
  • 2 kudos

@fintech_latency  For streaming: refactor to one long‑running Structured Streaming job with a short trigger interval (for example, 1s) and move “assignment” logic into foreachBatch or a transactional task table updated within the micro‑batch. For per...

  • 2 kudos
8 More Replies
Malthe
by Contributor III
  • 186 Views
  • 3 replies
  • 0 kudos

Intermittent task execution issues

We're getting intermittent errors:[ISOLATION_STARTUP_FAILURE.SANDBOX_STARTUP] Failed to start isolated execution environment. Sandbox startup failed. Exception class: INTERNAL. Exception message: INTERNAL: LaunchSandboxRequest create failed - Error e...

  • 186 Views
  • 3 replies
  • 0 kudos
Latest Reply
aleksandra_ch
Databricks Employee
  • 0 kudos

Hi @Malthe , Please check if custom Spark image is used in the jobs. If it is, try to remove it and stick to default parameters. If not, I highly recommend to open a support ticket (assuming you are on Azure Databricks) via Azure portal.  Best regard...

  • 0 kudos
2 More Replies
Naveenkumar1811
by New Contributor III
  • 450 Views
  • 6 replies
  • 2 kudos

Reduce the Time for First Spark Streaming Run Kick off

Hi Team,Currently I have a Silver Delta Table(External) is loading on Streaming and the Gold is on Batch.I Need to Make the Gold Delta as well to Streaming. In My First Run I can the stream initializing process is taking an hour or so as my Silver ta...

  • 450 Views
  • 6 replies
  • 2 kudos
Latest Reply
Naveenkumar1811
New Contributor III
  • 2 kudos

Yes, I understand on the optimize and Vaccum...But still the silver table is very heavy. It is definitely going to take long.Any other suggestions in prod scenario where we can perform this without data loss?

  • 2 kudos
5 More Replies
ChrisLawford_n1
by Contributor
  • 211 Views
  • 2 replies
  • 2 kudos

Resolved! Bug Report: SDP (DLT) with autoloader not passing through pipe delimiter/separator

I am noticing a difference between using autoloader in an interactive notebook vs using it in a Spark Declarative Pipeline (DLT Pipeline). This issue seems to be very similar to this other unanswered post from a few years ago. Bug report: the delimit...

  • 211 Views
  • 2 replies
  • 2 kudos
Latest Reply
ChrisLawford_n1
Contributor
  • 2 kudos

Hey,Okay thanks @nikhilj0421. I have now solved the issue but not with a full refresh of the table. I had tried this previously and even deleted the DLT pipeline hoping that would provide me the clean slate if this lingering schema was an issue but w...

  • 2 kudos
1 More Replies
smpa01
by Contributor
  • 220 Views
  • 2 replies
  • 2 kudos

Resolved! Python DataSource API utilities/ Import Fails in Spark Declarative Pipeline

TLDR - UDFs work fine when imported from `utilities/` folder in DLT pipelines, but custom Python DataSource APIs fail with ModuleNotFoundError: No module named 'utilities'` during serialization. Only inline definitions work. Need reusable DataSource ...

  • 220 Views
  • 2 replies
  • 2 kudos
Latest Reply
smpa01
Contributor
  • 2 kudos

@emma_s  Thank you for the guidance! The wheel package approach worked perfectly.I also tried putting the .py directly in but it did not work/Workspace/Libraries/custom_datasource.py 

  • 2 kudos
1 More Replies
SparkMan
by New Contributor II
  • 204 Views
  • 2 replies
  • 2 kudos

Resolved! Job Cluster Reuse

Hi, I have a job where a job cluster is reused twice for task A and task C. Between A and C, task B runs for 4 hours on a different interactive cluster. The issue here is that the job cluster doesn't terminate as soon as Task A is completed and sits ...

  • 204 Views
  • 2 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @SparkMan ,This is expected behavior with Databricks job cluster reuse unless you change your job/task configuration. Look at following documentation entry:So with your flow you have something like this:Task A (job cluster) → Task B (interactive c...

  • 2 kudos
1 More Replies
ChristianRRL
by Honored Contributor
  • 242 Views
  • 2 replies
  • 3 kudos

Resolved! Serverless Compute Spark Version Flexibility?

Hi there, I'm wondering what determines the Serverless Compute spark version? Is it based on the current DBR LTS? And is there a way to modify the spark version for serverless compute?For example, when I check the spark version for our serverless com...

ChristianRRL_0-1768409059721.png ChristianRRL_1-1768409577998.png
  • 242 Views
  • 2 replies
  • 3 kudos
Latest Reply
Databricks77
New Contributor II
  • 3 kudos

Serverless compute always run on the latest runtime version. You cannot choose it like in standard compute.

  • 3 kudos
1 More Replies
ChristianRRL
by Honored Contributor
  • 428 Views
  • 4 replies
  • 10 kudos

Resolved! Testing Spark Declarative Pipeline in Docker Container > PySparkRuntimeError

Hi there, I see via an announcement last year that Spark Declarative Pipeline (previously DLT) was getting open sourced into Apache Spark, and I see that this recently is true as of Apache 4.1:Spark Declarative Pipelines Programming Guide I'm trying ...

ChristianRRL_0-1768361209159.png
  • 428 Views
  • 4 replies
  • 10 kudos
Latest Reply
aleksandra_ch
Databricks Employee
  • 10 kudos

Hi @ChristianRRL ,In addition to @osingh 's answers, check out this old but good blog post about how to structure the pipelines's code to enable dev and test cycle: https://www.databricks.com/blog/applying-software-development-devops-best-practices-d...

  • 10 kudos
3 More Replies
Dhruv-22
by Contributor II
  • 236 Views
  • 3 replies
  • 0 kudos

Merge with schema evolution fails because of upper case columns

The following is a minimal reproducible example of what I'm facing right now.%sql CREATE OR REPLACE TABLE edw_nprd_aen.bronze.test_table ( id INT ); INSERT INTO edw_nprd_aen.bronze.test_table VALUES (1); SELECT * FROM edw_nprd_aen.bronze.test_tab...

Dhruv22_0-1768233514715.png Dhruv22_1-1768233551139.png Dhruv22_0-1768234077162.png
  • 236 Views
  • 3 replies
  • 0 kudos
Latest Reply
css-1029
New Contributor II
  • 0 kudos

Hi @Dhruv-22,It's actually not a bug. Let me explain what's happening.The Root CauseThe issue stems from how schema evolution works with Delta Lake's MERGE statement, combined with Spark SQL's case-insensitivity settings.Here's the key insight: spark...

  • 0 kudos
2 More Replies
bsr
by New Contributor II
  • 1434 Views
  • 4 replies
  • 5 kudos

Resolved! DBR 17.3.3 introduced unexpected DEBUG logs from ThreadMonitor – how to disable?

After upgrading from DBR 17.3.2 to DBR 17.3.3, we started seeing a flood of DEBUG logs like this in job outputs:```DEBUG:ThreadMonitor:Logging python thread stack frames for MainThread and py4j threads: DEBUG:ThreadMonitor:Logging Thread-8 (run) stac...

  • 1434 Views
  • 4 replies
  • 5 kudos
Latest Reply
WAHID
New Contributor II
  • 5 kudos

@iyashk-DBWe are currently using DBR version 17.3 LTS, and the issue is still occurring.Do you know when the fix is expected to be applied? We need this information to decide whether we should wait for the fix or proceed with the workaround you propo...

  • 5 kudos
3 More Replies
rijin-thomas
by New Contributor II
  • 360 Views
  • 4 replies
  • 3 kudos

Mongo Db connector - Connection timeout when trying to connect to AWS Document DB

I am on Databricks Run Time LTE 14.3 Spark 3.5.0 Scala 2.12 and mongodb-spark-connector_2.12:10.2.0. Trying to connect to Document DB using the connector and all I get is a connection timeout. I tried using PyMongo, which works as expected and I can ...

  • 360 Views
  • 4 replies
  • 3 kudos
Latest Reply
Sanjeeb2024
Contributor III
  • 3 kudos

Hi @rijin-thomas - Can you please allow the CIDR block for databricks account VPC from aws document db sg ( Executor connectivity stated by@bianca_unifeye ) . 

  • 3 kudos
3 More Replies
SaugatMukherjee
by New Contributor III
  • 498 Views
  • 2 replies
  • 1 kudos

Structured streaming for iceberg tables

According to this https://iceberg.apache.org/docs/latest/spark-structured-streaming/ , we can stream from iceberg tables. I have ensured that my source table is Iceberg version 3, but no matter what I do, I get Iceberg does not streaming reads. Looki...

  • 498 Views
  • 2 replies
  • 1 kudos
Latest Reply
SaugatMukherjee
New Contributor III
  • 1 kudos

Hi,Iceberg streaming is possible in Databricks. One does not need to change to Delta Lake. In my previous attempt, I used "load" while reading the source iceberg table. One should instead use "table". Load apparently seems to take a path and not a ta...

  • 1 kudos
1 More Replies
AcrobaticMonkey
by New Contributor III
  • 214 Views
  • 2 replies
  • 2 kudos

Salesforce Connector SCD2 - Get new record with isDeleted = true on deletion

Hi all,I'm using the Databricks Salesforce connector to ingest tables with history tracking enabled (SCD Type 2).When records are deleted in Salesforce, the connector closes the existing record by setting the end date. The isDeleted flag remains fals...

  • 214 Views
  • 2 replies
  • 2 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 2 kudos

Greetings @AcrobaticMonkey , I put on my researcher hat and dug into our internal docs. Here is what I found:  Short answer: this isn’t configurable today. The connector’s SCD Type 2 behavior “closes” a record by setting __END_AT and does not emit a ...

  • 2 kudos
1 More Replies
pooja_bhumandla
by New Contributor III
  • 230 Views
  • 2 replies
  • 0 kudos

Behavior of Zstd Compression for Delta Tables Across Different Databricks Runtime Versions

Hi all,For ZSTD compression, as per the documentation, any table created with DBR 16.0 or newer (or Apache Spark 3.5+) uses Zstd as the default compression codec instead of Snappy.I explicitly set the table property to Zstd:spark.sql("""ALTER TABLE m...

  • 230 Views
  • 2 replies
  • 0 kudos
Latest Reply
JAHNAVI
Databricks Employee
  • 0 kudos

@pooja_bhumandla  New files written by DBR 15.4 (or any pre‑16.0 runtime) will still use Zstd as long as the table property delta.compression.codec = 'zstd' remains set on the table.When we explicitly run: ALTER TABLE my_tableSET TBLPROPERTIES ('delt...

  • 0 kudos
1 More Replies
Labels