Data Engineering

Forum Posts

Sorted by:

by bidek56 • Contributor

a week ago

126 Views
3 replies
0 kudos

Location of spark.scheduler.allocation.file

In DBR 164.LTS, I am trying to add the following Spark config: spark.scheduler.allocation.file: file:/Workspace/init/fairscheduler.xmlBut the all purpose cluster is throwing this error Spark error: Driver down cause: com.databricks.backend.daemon.dri...

Data Engineering

126 Views
3 replies
0 kudos

a week ago

View Replies

Latest Reply

bidek56
Contributor

Thursday

0 kudos

@mark_ott Setting WSFS_ENABLE=false does not effect anything. Thx

0 kudos

Thursday

2 More Replies

by LBISWAS • New Contributor

Wednesday

48 Views
1 replies
0 kudos

Search result shows presence of a text in notebook, but its not present in notebook

Data Engineering

48 Views
1 replies
0 kudos

Wednesday

View Replies

Latest Reply

-werners-
Esteemed Contributor III

Thursday

0 kudos

Ah yes a classic. The search also looks into hidden/collapsed content which is not visible.F.e. results or metadata.

0 kudos

Thursday

by 02CSE33 • New Contributor

Monday

99 Views
2 replies
0 kudos

Migrating SQL Server Tables and Views to Databricks using Lakebridge

We have a requirement to carry out migration of few 100 tables which are present in SQL Server to Databricks Delta Table. We intend to explore Lakebridge capability for carrying out a PoC for this. We also want to migrate few historic records say las...

Data Engineering

99 Views
2 replies
0 kudos

Monday

View Replies

Latest Reply

mark_ott
Databricks Employee

Thursday

0 kudos

Migrating several hundred SQL Server tables to Databricks Delta Lake, using Lakebridge for a Proof of Concept (PoC), can be approached with custom pipelines—especially for filtering by a date/time column to migrate only the last two years of data. Of...

0 kudos

Thursday

1 More Replies

by gudurusreddy99 • New Contributor II

Monday

78 Views
1 replies
1 kudos

DLT or DP: How to do full refresh of Delta table from DLT Pipeline to consider all records from Tbl

RequirementI have a Kafka streaming pipeline that ingests Pixels data. For each incoming record, I need to validate the Pixels key against an existing Delta table (pixel_tracking_data), which contains over 2 billion records accumulated over the past ...

Data Engineering

78 Views
1 replies
1 kudos

Monday

View Replies

Latest Reply

mark_ott
Databricks Employee

Thursday

1 kudos

Matching streaming data in real time against a massive, fast-changing Delta table requires careful architectural choices. In your case, latency is high for the most recent records, and the solution only matches against data ≥10 minutes old. This is a...

1 kudos

Thursday

by der • Contributor II

2 weeks ago

472 Views
10 replies
0 kudos

Rasterio on shared/standard cluster has no access to proj.db

We try to use rasterio on a Databricks shared/standard cluster with DBR 17.1. Rasterio is directly installed on the cluster as library. Code:import rasterio rasterio.show_versions()Output: rasterio info:rasterio: 1.4.3GDAL: 3.9.3PROJ: 9.4.1GEOS: 3.11...

Data Engineering

472 Views
10 replies
0 kudos

2 weeks ago

View Replies

Latest Reply

der
Contributor II

Thursday

0 kudos

Current Workaround:If you select the "Photon" engine on a Standard/Shared Cluster, they change the access rights of /databricks/native/proj-data and rasterio works fine.The downside:Pay for "Photon" compute to use a Python library, which do not use S...

0 kudos

Thursday

9 More Replies

by jano • New Contributor III

Monday

101 Views
2 replies
2 kudos

Resolved! DABs with multi github sources

I want to deploy a dabs that has dev using a github branch and prod using a github release tag. I can't seem to find a way to make this part dynamic based on the target. Things I've tried:- Setting the git varaible in the databricks.yml- making the g...

Data Engineering

101 Views
2 replies
2 kudos

Monday

View Replies

Latest Reply

jano
New Contributor III

Wednesday

2 kudos

I ended up finding this discussion which mostly ended up working. What was not mentioned is the first resources block should be in the job.yml and the overwrite parameters mentioned below are in the databricks.yml. You cannot put both in the databric...

2 kudos

Wednesday

1 More Replies

by Volker • Contributor

06-07-2024 7:02:19 AM

3292 Views
5 replies
4 kudos

Asset Bundles cannot run job with single node job cluster

Hello community,we are deploying a job using asset bundles and the job should run on a single node job cluster. Here is the DAB job definition:resources: jobs: example_job: name: example_job tasks: - task_key: main_task ...

Data Engineering

3292 Views
5 replies
4 kudos

06-07-2024 7:02:19 AM

View Replies

Latest Reply

kunalmishra9
Contributor

Wednesday

4 kudos

In case this is now breaking for anyone (as it is for me), there's an update here to follow along with on how to define single node compute!https://github.com/databricks/databricks-sdk-py/issues/881

4 kudos

Wednesday

4 More Replies

by hanspetter • New Contributor III

08-02-2017 12:26:46 AM

65498 Views
21 replies
7 kudos

Resolved! Is it possible to get Job Run ID of notebook run by dbutils.notbook.run?

When running a notebook using dbutils.notebook.run from a master-notebook, an url to that running notebook is printed, i.e.: Notebook job #223150 Notebook job #223151 Are there any ways to capture that Job Run ID (#223150 or #223151)? We have 50 or ...

Data Engineering

65498 Views
21 replies
7 kudos

08-02-2017 12:26:46 AM

View Replies

Latest Reply

no2
New Contributor II

Wednesday

7 kudos

Thanks for the response @Manoj5 - I had to use this "safeToJson()" option too because all of the previous suggestions in this thread were erroring out for me with a message like "py4j.security.Py4JSecurityException: Method public java.lang.String com...

7 kudos

Wednesday

20 More Replies

by Richard_547342 • New Contributor III

05-11-2022 5:44:43 AM

3486 Views
2 replies
2 kudos

Resolved! Column comments in DLT python notebook

The SQL API specification in the DLT docs shows an option for adding column comments when creating a table. Is there an equivalent way to do this when creating a DLT pipeline with a python notebook? The Python API specification in the DLT docs does n...

Data Engineering

3486 Views
2 replies
2 kudos

05-11-2022 5:44:43 AM

View Replies

Latest Reply

jonathandbyrd
New Contributor

Wednesday

2 kudos

this works in a readStream writeStream scenario for us, but the exact same code fails when put in a DLT

2 kudos

Wednesday

1 More Replies

by alex307 • New Contributor

Wednesday

80 Views
1 replies
2 kudos

How to Stop Driver Node from Overloading When Using ThreadPoolExecutor in Databricks

Hi everyone,I'm using a ThreadPoolExecutor in Databricks to run multiple notebooks at the same time. The problem is that it seems like all the processing happens on the driver node, while the executor nodes are idle. This causes the driver to run out...

Data Engineering

80 Views
1 replies
2 kudos

Wednesday

View Replies

Latest Reply

mmayorga
Databricks Employee

Wednesday

2 kudos

Greetings @alex307 and thank you for sending your question. When using ThreadPoolExecutor to run multiple notebooks concurrently in Databricks, the workload is being executed on the driver node rather than distributed across Spark executors. This res...

2 kudos

Wednesday

by vartyg • New Contributor

Tuesday

84 Views
2 replies
0 kudos

Scaling Declarative Streaming Pipelines for CDC from On-Prem Database to Lakehouse

We have a scenario where we need to mirror thousands of tables from on-premises Db2 databases to an Azure Lakehouse. The goal is to create mirror Delta tables in the Lakehouse.Since LakeFlow Connect currently does not support direct mirroring from on...

Data Engineering

84 Views
2 replies
0 kudos

Tuesday

View Replies

Latest Reply

AbhaySingh
Databricks Employee

Wednesday

0 kudos

Yes, a databricks labs project seems perfect for your scenario. https://databrickslabs.github.io/dlt-meta/index.html

0 kudos

Wednesday

1 More Replies

by Nis • New Contributor II

05-17-2023 7:01:35 AM

2428 Views
2 replies
2 kudos

Best sequence of using Vacuum, optimize, fsck repair and refresh commands.

I have a delta table whose size will increases gradually now we have around 1.5 crores of rows while running vacuum command on that table i am getting the below error.ERROR: Job aborted due to stage failure: Task 7 in stage 491.0 failed 4 times, most...

Data Engineering

2428 Views
2 replies
2 kudos

05-17-2023 7:01:35 AM

View Replies

Latest Reply

alex307
New Contributor

Wednesday

2 kudos

In my opinion Best order: Optimize → Vacuum → FSCK Repair → Refresh.Your error is likely a timeout — try more cluster resources or a longer retention period.

2 kudos

Wednesday

1 More Replies

by hgm251 • New Contributor

Tuesday

132 Views
3 replies
1 kudos

online tables to synced table, why is it creating a different service principal everytime?

Hello!We started to move our online tables to synced_tables. We just couldnt figure out why it is creating a new service principal everytime we ran the same code we use for online tables?try: fe.create_feature_spec(name=feature_spec_name ...

Data Engineering

132 Views
3 replies
1 kudos

Tuesday

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

Tuesday

1 kudos

Greetings @hgm251 , here are some things to consider. Things are working as designed: when you create a new Feature Serving or Model Serving endpoint, Databricks automatically provisions a dedicated service principal for that endpoint, and a fresh...

1 kudos

Tuesday

2 More Replies

by DaPo • New Contributor III

03-19-2025 10:30:13 AM

3466 Views
2 replies
2 kudos

Resolved! DLT Streaming With Watermark fails, suggesting I should add watermarks

Hi all,I have the following Problem: I have two streaming tables containing time-series measurements from different sensor data, each feed by multiple sensors. (Imagine: Multiple Temperature Sensors for the first table, and multiple humidity sensors ...

Data Engineering

3466 Views
2 replies
2 kudos

03-19-2025 10:30:13 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

Wednesday

2 kudos

To resolve the DLT streaming aggregation error about unsupported output modes and watermarks in Databricks, you need to carefully set watermarks on the original event timestamp rather than on computed columns like "time_window" and carefully consider...

2 kudos

Wednesday

1 More Replies

by Dave_Nithio • Contributor II

03-11-2025 11:40:40 AM

3302 Views
1 replies
0 kudos

Transaction Log Failed Integrity Checks

I have started to receive the following error message - that the transaction log has failed integrity checks - when attempting to optimize and run compaction on a table. It also occurs when I attempt to alter this table.This blocks my pipeline from r...

Data Engineering

3302 Views
1 replies
0 kudos

03-11-2025 11:40:40 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

Wednesday

0 kudos

Your issue—encountering "the transaction log has failed integrity checks" in Databricks Delta Lake—indicates metadata corruption or an inconsistency in the Delta transaction log (_delta_log). This commonly disrupts DML operations like OPTIMIZE, DELET...

0 kudos

Wednesday

Databricks Community

Forum Posts

Location of spark.scheduler.allocation.file

Search result shows presence of a text in notebook, but its not present in notebook

Migrating SQL Server Tables and Views to Databricks using Lakebridge

DLT or DP: How to do full refresh of Delta table from DLT Pipeline to consider all records from Tbl

Rasterio on shared/standard cluster has no access to proj.db

Resolved! DABs with multi github sources

Asset Bundles cannot run job with single node job cluster

Resolved! Is it possible to get Job Run ID of notebook run by dbutils.notbook.run?

Resolved! Column comments in DLT python notebook

How to Stop Driver Node from Overloading When Using ThreadPoolExecutor in Databricks

Scaling Declarative Streaming Pipelines for CDC from On-Prem Database to Lakehouse

Best sequence of using Vacuum, optimize, fsck repair and refresh commands.

online tables to synced table, why is it creating a different service principal everytime?

Resolved! DLT Streaming With Watermark fails, suggesting I should add watermarks

Transaction Log Failed Integrity Checks

Join Us as a Local Community Builder!

Delta live table not showing in workspace (Azure d...

Unable to install libraries from requirements.txt ...

Databricks Bundle Validation Error After CLI Upgra...

DABs with multi github sources

DLT Streaming With Watermark fails, suggesting I s...