Data Engineering

Forum Posts

Sorted by:

by hanspetter • New Contributor III

08-02-2017 12:26:46 AM

65276 Views
21 replies
7 kudos

Resolved! Is it possible to get Job Run ID of notebook run by dbutils.notbook.run?

When running a notebook using dbutils.notebook.run from a master-notebook, an url to that running notebook is printed, i.e.: Notebook job #223150 Notebook job #223151 Are there any ways to capture that Job Run ID (#223150 or #223151)? We have 50 or ...

Data Engineering

65276 Views
21 replies
7 kudos

08-02-2017 12:26:46 AM

View Replies

Latest Reply

no2
New Contributor II

15m ago

7 kudos

Thanks for the response @Manoj5 - I had to use this "safeToJson()" option too because all of the previous suggestions in this thread were erroring out for me with a message like "py4j.security.Py4JSecurityException: Method public java.lang.String com...

7 kudos

15m ago

20 More Replies

by Richard_547342 • New Contributor III

05-11-2022 5:44:43 AM

3435 Views
2 replies
2 kudos

Resolved! Column comments in DLT python notebook

The SQL API specification in the DLT docs shows an option for adding column comments when creating a table. Is there an equivalent way to do this when creating a DLT pipeline with a python notebook? The Python API specification in the DLT docs does n...

Data Engineering

3435 Views
2 replies
2 kudos

05-11-2022 5:44:43 AM

View Replies

Latest Reply

jonathandbyrd
Visitor

44m ago

2 kudos

this works in a readStream writeStream scenario for us, but the exact same code fails when put in a DLT

2 kudos

44m ago

1 More Replies

by alex307 • New Contributor

2 hours ago

10 Views
1 replies
0 kudos

How to Stop Driver Node from Overloading When Using ThreadPoolExecutor in Databricks

Hi everyone,I'm using a ThreadPoolExecutor in Databricks to run multiple notebooks at the same time. The problem is that it seems like all the processing happens on the driver node, while the executor nodes are idle. This causes the driver to run out...

Data Engineering

10 Views
1 replies
0 kudos

2 hours ago

View Replies

Latest Reply

mmayorga
Databricks Employee

an hour ago

0 kudos

Greetings @alex307 and thank you for sending your question. When using ThreadPoolExecutor to run multiple notebooks concurrently in Databricks, the workload is being executed on the driver node rather than distributed across Spark executors. This res...

0 kudos

an hour ago

by vartyg • Visitor

yesterday

30 Views
2 replies
0 kudos

Scaling Declarative Streaming Pipelines for CDC from On-Prem Database to Lakehouse

We have a scenario where we need to mirror thousands of tables from on-premises Db2 databases to an Azure Lakehouse. The goal is to create mirror Delta tables in the Lakehouse.Since LakeFlow Connect currently does not support direct mirroring from on...

Data Engineering

30 Views
2 replies
0 kudos

yesterday

View Replies

Latest Reply

AbhaySingh
Databricks Employee

an hour ago

0 kudos

Yes, a databricks labs project seems perfect for your scenario. https://databrickslabs.github.io/dlt-meta/index.html

0 kudos

an hour ago

1 More Replies

by Nis • New Contributor II

05-17-2023 7:01:35 AM

2388 Views
2 replies
2 kudos

Best sequence of using Vacuum, optimize, fsck repair and refresh commands.

I have a delta table whose size will increases gradually now we have around 1.5 crores of rows while running vacuum command on that table i am getting the below error.ERROR: Job aborted due to stage failure: Task 7 in stage 491.0 failed 4 times, most...

Data Engineering

2388 Views
2 replies
2 kudos

05-17-2023 7:01:35 AM

View Replies

Latest Reply

alex307
New Contributor

2 hours ago

2 kudos

In my opinion Best order: Optimize → Vacuum → FSCK Repair → Refresh.Your error is likely a timeout — try more cluster resources or a longer retention period.

2 kudos

2 hours ago

1 More Replies

by jano • New Contributor III

Monday

30 Views
1 replies
0 kudos

DABs with multi github sources

I want to deploy a dabs that has dev using a github branch and prod using a github release tag. I can't seem to find a way to make this part dynamic based on the target. Things I've tried:- Setting the git varaible in the databricks.yml- making the g...

Data Engineering

30 Views
1 replies
0 kudos

Monday

View Replies

Latest Reply

AbhaySingh
Databricks Employee

3 hours ago

0 kudos

You may want to look into SHA-Based Versioning. For more details, look here: https://towardsdev.com/ci-cd-strategies-for-databricks-asset-bundles-e4aaf921823e

0 kudos

3 hours ago

by hgm251 • New Contributor

yesterday

83 Views
3 replies
1 kudos

online tables to synced table, why is it creating a different service principal everytime?

Hello!We started to move our online tables to synced_tables. We just couldnt figure out why it is creating a new service principal everytime we ran the same code we use for online tables?try: fe.create_feature_spec(name=feature_spec_name ...

Data Engineering

83 Views
3 replies
1 kudos

yesterday

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

yesterday

1 kudos

Greetings @hgm251 , here are some things to consider. Things are working as designed: when you create a new Feature Serving or Model Serving endpoint, Databricks automatically provisions a dedicated service principal for that endpoint, and a fresh...

1 kudos

yesterday

2 More Replies

by DaPo • New Contributor III

03-19-2025 10:30:13 AM

3382 Views
2 replies
2 kudos

Resolved! DLT Streaming With Watermark fails, suggesting I should add watermarks

Hi all,I have the following Problem: I have two streaming tables containing time-series measurements from different sensor data, each feed by multiple sensors. (Imagine: Multiple Temperature Sensors for the first table, and multiple humidity sensors ...

Data Engineering

3382 Views
2 replies
2 kudos

03-19-2025 10:30:13 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

7 hours ago

2 kudos

To resolve the DLT streaming aggregation error about unsupported output modes and watermarks in Databricks, you need to carefully set watermarks on the original event timestamp rather than on computed columns like "time_window" and carefully consider...

2 kudos

7 hours ago

1 More Replies

by Dave_Nithio • Contributor II

03-11-2025 11:40:40 AM

3245 Views
1 replies
0 kudos

Transaction Log Failed Integrity Checks

I have started to receive the following error message - that the transaction log has failed integrity checks - when attempting to optimize and run compaction on a table. It also occurs when I attempt to alter this table.This blocks my pipeline from r...

Data Engineering

3245 Views
1 replies
0 kudos

03-11-2025 11:40:40 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

6 hours ago

0 kudos

Your issue—encountering "the transaction log has failed integrity checks" in Databricks Delta Lake—indicates metadata corruption or an inconsistency in the Delta transaction log (_delta_log). This commonly disrupts DML operations like OPTIMIZE, DELET...

0 kudos

6 hours ago

by OmarE • New Contributor II

02-19-2025 6:03:41 PM

3624 Views
1 replies
1 kudos

Streamlit Databricks App Compute Scaling

I have a streamlit Databricks app and I’m looking to increase the compute resources. According to the documentation and the current settings, the app is limited to 2 vCPUs and 6 GB of memory. Is there a way to adjust these limits or add more resource...

Data Engineering

3624 Views
1 replies
1 kudos

02-19-2025 6:03:41 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

6 hours ago

1 kudos

You can increase compute resources for your Streamlit Databricks app, but this requires explicitly configuring the compute size in the Databricks app management UI or via deployment configuration—environment variables like DATABRICKS_CLUSTER_ID alone...

1 kudos

6 hours ago

by Arunraja • New Contributor II

02-18-2025 9:14:13 AM

3297 Views
1 replies
0 kudos

AI BI Genie throwing internal error

For any prompt I am getting INTERNAL_ERROR: AI service did not respond with a valid answer

Data Engineering

3297 Views
1 replies
0 kudos

02-18-2025 9:14:13 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

6 hours ago

0 kudos

The "INTERNAL_ERROR: AI service did not respond with a valid answer" in Databricks AI/BI Genie typically means the Genie service failed to process your query, often due to one of a few common issues. This can include problems with the table existence...

0 kudos

6 hours ago

by turagittech • Contributor

02-16-2025 10:19:49 PM

3438 Views
1 replies
0 kudos

Finding all folder paths in a blob store connected via UC external connetion

Hi All,I need to easily find all the paths in a blob store to find the files and load them. I have tried using Azure Blob storage connection in python and I have a solution that works it is very slow. I was speaking to a data engineer, and he suggest...

Data Engineering

3438 Views
1 replies
0 kudos

02-16-2025 10:19:49 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

6 hours ago

0 kudos

The most efficient way to list all file paths in an Azure Blob Storage container from Databricks, especially when Hierarchical Namespace (HNS) is not enabled, is to use Azure SDKs targeting the blob flat namespace directly rather than filesystem prot...

0 kudos

6 hours ago

by Sega2 • New Contributor III

02-21-2025 2:55:11 AM

3597 Views
2 replies
1 kudos

Debugger freezes when calling spark.sql with dbx connect

I have just created a simple bundle with databricks, and is using Databricks connect to debug locally. This is my script:from pyspark.sql import SparkSession, DataFrame def get_taxis(spark: SparkSession) -> DataFrame: return spark.read.table("samp...

Data Engineering

3597 Views
2 replies
1 kudos

02-21-2025 2:55:11 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

6 hours ago

1 kudos

The issue you're experiencing—where your script freezes in VS Code when running spark.sql locally using Databricks Connect, but works correctly when deployed—can result from several common causes related to Databricks Connect configuration, networkin...

1 kudos

6 hours ago

1 More Replies

by akshaym0056 • New Contributor

02-12-2025 6:42:29 AM

3542 Views
2 replies
0 kudos

How to Define Constants at Bundle Level in Databricks Asset Bundles for Use in Notebooks?

I'm working with Databricks Asset Bundles and need to define constants at the bundle level based on the target environment. These constants will be used inside Databricks notebooks.For example, I want a constant gold_catalog to take different values ...

Data Engineering

3542 Views
2 replies
0 kudos

02-12-2025 6:42:29 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

6 hours ago

0 kudos

Yes, you can define environment-specific constants at the bundle level in Databricks Asset Bundles and make them accessible inside Databricks notebooks, without relying on task-level parameters. This can be done using environment variables, bundle co...

0 kudos

6 hours ago

1 More Replies

by Databricks36 • New Contributor

02-27-2025 9:31:40 AM

3474 Views
1 replies
0 kudos

Accessing Databricks Delta table in ADF using system-defined managed identity

I am using Lookup activity in ADF which will read the delta table values from databricks. Currently using the system-defined managed identity of the ADF to connect Databricks delta table. I am unable to see my unity catalog database names in the look...

Data Engineering

3474 Views
1 replies
0 kudos

02-27-2025 9:31:40 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

6 hours ago

0 kudos

You are experiencing an issue in Azure Data Factory (ADF) where the Lookup activity does not show your Unity Catalog databases in the configuration dropdown, even though connectivity from ADF to Databricks is successful and you have followed all reco...

0 kudos

6 hours ago

Databricks Community

Forum Posts

Resolved! Is it possible to get Job Run ID of notebook run by dbutils.notbook.run?

Resolved! Column comments in DLT python notebook

How to Stop Driver Node from Overloading When Using ThreadPoolExecutor in Databricks

Scaling Declarative Streaming Pipelines for CDC from On-Prem Database to Lakehouse

Best sequence of using Vacuum, optimize, fsck repair and refresh commands.

DABs with multi github sources

online tables to synced table, why is it creating a different service principal everytime?

Resolved! DLT Streaming With Watermark fails, suggesting I should add watermarks

Transaction Log Failed Integrity Checks

Streamlit Databricks App Compute Scaling

AI BI Genie throwing internal error

Finding all folder paths in a blob store connected via UC external connetion

Debugger freezes when calling spark.sql with dbx connect

How to Define Constants at Bundle Level in Databricks Asset Bundles for Use in Notebooks?

Accessing Databricks Delta table in ADF using system-defined managed identity

Join Us as a Local Community Builder!

DLT Streaming With Watermark fails, suggesting I s...

Bug in Asset Bundle Sync

Migrating from on-premises HDFS to Unity Catalog -...

Webinars

[FREE TRIAL] Missing All-Purpose Clusters Access -...