Data Engineering

Forum Posts

Sorted by:

by IGRACH • New Contributor III

06-10-2025 1:39:22 AM

1094 Views
2 replies
1 kudos

Resolved! Specifing Output mode and Path when using For Each Batch

Since .foreachBatch() is "hijacking" the stream and executing arbitrary code in it, do I need to specify Output mode and Path:(df.writeStream .format("delta") .trigger(availableNow = True) .option("checkpointLocation", "check_point_location") .forea...

Data Engineering

1094 Views
2 replies
1 kudos

06-10-2025 1:39:22 AM

View Replies

Latest Reply

Branislav
New Contributor II

06-11-2025 12:01:19 PM

1 kudos

Thanks xD

1 kudos

06-11-2025 12:01:19 PM

1 More Replies

by ClarkElliott • New Contributor

03-18-2025 2:26:50 PM

4346 Views
1 replies
0 kudos

Parquet file for delta streaming live table with pipeline

I am having an issue with parquet files: I'm getting Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)) error while trying to read a parquet file (generated outside of DataBricks). I am using a Delta streaming live table with a pipeline. If I r...

Data Engineering

4346 Views
1 replies
0 kudos

03-18-2025 2:26:50 PM

View Replies

Latest Reply

Saritha_S
Databricks Employee

06-11-2025 10:39:34 AM

0 kudos

Hi @ClarkElliott Good day!! Cause Databricks Runtime versions 11.3 LTS and above do not support the TIMESTAMP_NANOS type in open source Apache Spark and Databricks Runtime. If a Parquet file contains fields with the TIMESTAMP_NANOS type, attempts to...

0 kudos

06-11-2025 10:39:34 AM

by shavya • New Contributor

03-18-2025 10:15:28 PM

5118 Views
1 replies
0 kudos

Where are default temporary checkpoint locations created for streaming queries with display command?

Hello!I created a streaming query using Auto Loader to read data from S3 and used display command to see if the query was working. Initially, cloudFiles.includeExistingFiles was set to True, but since we have data in Glacier that needs to be retrieve...

Data Engineering

5118 Views
1 replies
0 kudos

03-18-2025 10:15:28 PM

View Replies

Latest Reply

Saritha_S
Databricks Employee

06-11-2025 10:26:26 AM

0 kudos

Hi @shavya Good day!! When you do not specify a checkpointLocation in a streaming query in Databricks. It uses a temporary system directory such as: dbfs:/local_disk0/tmp/temporary-<random_uuid> To remove the temporary checkpoint, please ...

0 kudos

06-11-2025 10:26:26 AM

by lprevost • Contributor III

06-06-2025 5:52:39 AM

1258 Views
1 replies
0 kudos

Streaming query error - [STREAMING_STATEFUL_OPERATOR_NOT_MATCH_IN_STATE_METADATA]

[STREAM_FAILED] Query [id = 6a821fbc-490b-4ad8-891d-e4cacc2af1d6, runId = e055fede-8012-4369-861b-47183999e91d] terminated with exception: [STREAMING_STATEFUL_OPERATOR_NOT_MATCH_IN_STATE_METADATA] Streaming stateful operator name does not match with ...

Data Engineering

1258 Views
1 replies
0 kudos

06-06-2025 5:52:39 AM

View Replies

Latest Reply

Saritha_S
Databricks Employee

06-11-2025 9:31:48 AM

0 kudos

Hi @lprevost Good day!! Please find below my analysis for your issue. Error: [STREAM_FAILED] Query [id = 6a821fbc-490b-4ad8-891d-e4cacc2af1d6, runId = e055fede-8012-4369-861b-47183999e91d] terminated with exception: [STREAMING_STATEFUL_OPERATOR_NOT...

0 kudos

06-11-2025 9:31:48 AM

by Klusener • Contributor

06-09-2025 11:27:22 PM

1469 Views
1 replies
4 kudos

Resolved! Handling partition overwrite in Liquid Clustering

Hello,Currently we have delta tables in TBs partitioned by year, month, day. We perform dynamic partition overwrite using partitionOverwriteMode as dynamic to handle rerun/corrections.With liquid clustering, since explicit partitions are not require...

Data Engineering

1469 Views
1 replies
4 kudos

06-09-2025 11:27:22 PM

View Replies

Latest Reply

Saritha_S
Databricks Employee

06-11-2025 9:25:44 AM

4 kudos

Hi @Klusener Good day!!Dynamic partition overwrites only supports selective overwrites for partitioned columns, not for liquid clustering or regular columns.If you know the exact predicates, use replaceWhere. Note: This is not possible without knowin...

4 kudos

06-11-2025 9:25:44 AM

by Malthe • Valued Contributor II

06-11-2025 5:39:27 AM

1257 Views
1 replies
1 kudos

Resolved! Unable to add primary key constraint to nullable identity column

While we can in fact define a primary key during table creation for an identity column that's nullable (i.e., not constrained using NOT NULL), it's not possible to add such a primary key constraint after the table has been created.We get an error mes...

Data Engineering

1257 Views
1 replies
1 kudos

06-11-2025 5:39:27 AM

View Replies

Latest Reply

amuchoudhary
New Contributor III

06-11-2025 8:19:12 AM

1 kudos

Creating a table with a nullable IDENTITY column and defining the primary key at creation time works.The database quietly interprets the column as NOT NULL for the purposes of the primary key, even though it's technically defined as nullable (i.e., n...

1 kudos

06-11-2025 8:19:12 AM

by mohdluqmancse88 • New Contributor

06-11-2025 5:21:10 AM

491 Views
1 replies
0 kudos

Databricks on Azure

We are setting up data hubs that interacts with each other for Gen AI use cases. I want to prove that catalog sharing works across azure subscriptions if all UCs are mapped to the same metastore. Can you point me to the right documentation?

Data Engineering

491 Views
1 replies
0 kudos

06-11-2025 5:21:10 AM

View Replies

Latest Reply

Gopichand_G
Databricks Partner

06-11-2025 7:34:21 AM

0 kudos

I believe you need to follow below steps.1. Deploying a metastore in one region.2. Linking each workspace (from different Azure subscriptions but same tenant and region) to it.3. Then validating that metadata objects like catalogs, schemas, and table...

0 kudos

06-11-2025 7:34:21 AM

by Anand13 • New Contributor II

06-02-2025 5:12:32 AM

1467 Views
2 replies
0 kudos

Getting concurrent issue on delta table using liquid clustering

In our project, we are testing liquid clustering using a test table called status_update, where we need to update the status for different market IDs. We are attempting to update the status_update table in parallel using the UPDATE command.ALTER TABL...

Data Engineering

1467 Views
2 replies
0 kudos

06-02-2025 5:12:32 AM

View Replies

Latest Reply

Anand13
New Contributor II

06-11-2025 4:02:10 AM

0 kudos

@Walter_C We are using Liquid Clustering as our first strategy. Our Databricks Runtime is 13.3, and we have a table named status_update containing approximately 30 market IDs, each with a single record. In our pipeline, if any market fails, we need t...

0 kudos

06-11-2025 4:02:10 AM

1 More Replies

by amarnadh-gadde • New Contributor II

01-30-2025 10:08:58 AM

2228 Views
6 replies
0 kudos

Default catalog created wrong on my workspace

We have provisioned a new databricks account and workspace on premium plan. When built out workspace using terraform, we expected to see a default catalog matching workspace name as per this documentation. However I dont see it. All I see are the 3 c...

Data Engineering

2228 Views
6 replies
0 kudos

01-30-2025 10:08:58 AM

View Replies

Latest Reply

loic
Contributor

06-11-2025 2:10:10 AM

0 kudos

Hello,Meanwhile I try to get help on Databricks default catalog behavior, I found this topic.If I can give my advice here, one reason I see for the behavior of @amarnadh-gadde is that you deployed your new workspace in a region where there is already...

0 kudos

06-11-2025 2:10:10 AM

5 More Replies

by seefoods • Valued Contributor

06-10-2025 4:38:52 AM

3311 Views
6 replies
7 kudos

Resolved! autoloader strategy write ( APPEND, MERGE, UPDATE, COMPLETE, OVERWRITE)

Hello Guys, I want to know if operations like overwrite, merge, update in static write its the same when we using autoloader. I'm so confusing about the behavior of mode like ( complete, update and append). After that, i want to know what its the co...

Data Engineering

3311 Views
6 replies
7 kudos

06-10-2025 4:38:52 AM

View Replies

Latest Reply

chanukya-pekala
Contributor III

06-11-2025 1:45:05 AM

7 kudos

Thanks for discussion. I have a tiny suggestion. Based on my experience working with streaming loads, I often find the checkpoint location hard enough to actually check the offset information or delete that directory for fresh load of data. Hence I h...

7 kudos

06-11-2025 1:45:05 AM

5 More Replies

by SatyaKoduri • New Contributor II

06-10-2025 2:39:04 AM

1669 Views
1 replies
1 kudos

Resolved! Yaml file to Dataframe

Hi, I'm trying to read YAML files using pyyaml and convert them into a Spark DataFrame with createDataFrame, without specifying a schema—allowing flexibility for potential YAML schema changes over time. This approach worked as expected on Databricks ...

Data Engineering

1669 Views
1 replies
1 kudos

06-10-2025 2:39:04 AM

View Replies

Latest Reply

lingareddy_Alva
Esteemed Contributor

06-10-2025 5:01:09 PM

1 kudos

Hi @SatyaKoduri This is a known issue with newer Spark versions (3.5+) that came with Databricks Runtime 15.4.The schema inference has become more strict and struggles with deeply nested structures like your YAML's nested maps.Here are a few solution...

1 kudos

06-10-2025 5:01:09 PM

by tuckera • New Contributor

06-10-2025 2:59:01 PM

517 Views
1 replies
0 kudos

Governance in pipelines

How does everyone track and deploy their pipelines and generated data assets? DABs? Terraform? Manual? Something else entirely?

Data Engineering

517 Views
1 replies
0 kudos

06-10-2025 2:59:01 PM

View Replies

Latest Reply

lingareddy_Alva
Esteemed Contributor

06-10-2025 4:34:09 PM

0 kudos

Hi @tuckera The data engineering landscape shows a pretty diverse mix of approaches for tracking and deploying pipelines and data assets, often varying by company size, maturity, and specific needs.Infrastructure as Code (IaC) tools like Terraform an...

0 kudos

06-10-2025 4:34:09 PM

by Edoa • New Contributor

06-10-2025 8:14:30 AM

1544 Views
1 replies
0 kudos

SFTP Connection Timeout on Job Cluster but Works on Serverless Compute

Hi all,I'm experiencing inconsistent behavior when connecting to an SFTP server using Paramiko in Databricks.When I run the code on Serverless Compute, the connection to xxx.yyy.com via SFTP works correctly.When I run the same code on a Job Cluster, ...

Data Engineering

1544 Views
1 replies
0 kudos

06-10-2025 8:14:30 AM

View Replies

Latest Reply

lingareddy_Alva
Esteemed Contributor

06-10-2025 12:52:02 PM

0 kudos

Hi @Edoa This is a common networking issue in Databricks related to the different network configurations between Serverless Compute and Job Clusters.Here are the key differences and potential solutions:Root CauseServerless Compute runs in Databricks'...

0 kudos

06-10-2025 12:52:02 PM

by oeztuerk82 • New Contributor II

06-10-2025 2:04:05 AM

1338 Views
2 replies
3 kudos

Deletion of Resource Group on Azure and Impact on Databricks Workspace

Hello together,I would like to confirm the data retention and deletion behavior associated with an Azure Databricks workspace, particularly in the context of deleting an Azure resource group where a Databricks Workspace lays in.Recently, I deleted an...

Data Engineering

1338 Views
2 replies
3 kudos

06-10-2025 2:04:05 AM

View Replies

Latest Reply

SAKBAR
New Contributor II

06-10-2025 8:12:13 AM

3 kudos

Resource group once deleted cannot be recovered as like ADLS, so not possible to restore workspace or any resource under the resource group. May be Microsoft support can recover if under premium plan with them. For future perspective it is always bet...

3 kudos

06-10-2025 8:12:13 AM

1 More Replies

by DarioB • New Contributor III

06-05-2025 9:23:52 AM

1754 Views
1 replies
1 kudos

Resolved! DAB for_each_task - Passing task values

I am trying to deploy a job with a for_each_task using DAB and Terraform and I am unable to properly pass the task value into the subsequent task.These are my job tasks definition in the YAML: tasks: - task_key: FS_batching job_c...

Data Engineering

1754 Views
1 replies
1 kudos

06-05-2025 9:23:52 AM

View Replies

Latest Reply

DarioB
New Contributor III

06-10-2025 7:35:27 AM

1 kudos

We have been testing and find out the issue (I just realized that my anonymization of the names removed the source of the error).We have tracked down to the inputs parameter of the for_each_task. It seems that is unable to call to task names with das...

1 kudos

06-10-2025 7:35:27 AM

Databricks Community

Forum Posts

Resolved! Specifing Output mode and Path when using For Each Batch

Parquet file for delta streaming live table with pipeline

Where are default temporary checkpoint locations created for streaming queries with display command?

Streaming query error - [STREAMING_STATEFUL_OPERATOR_NOT_MATCH_IN_STATE_METADATA]

Resolved! Handling partition overwrite in Liquid Clustering

Resolved! Unable to add primary key constraint to nullable identity column

Databricks on Azure

Getting concurrent issue on delta table using liquid clustering

Default catalog created wrong on my workspace

Resolved! autoloader strategy write ( APPEND, MERGE, UPDATE, COMPLETE, OVERWRITE)

Resolved! Yaml file to Dataframe

Governance in pipelines

SFTP Connection Timeout on Job Cluster but Works on Serverless Compute

Deletion of Resource Group on Azure and Impact on Databricks Workspace

Resolved! DAB for_each_task - Passing task values

File Arrival Trigger - Multiple tables

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template