cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

alex307
by New Contributor II
  • 319 Views
  • 1 replies
  • 2 kudos

Resolved! How to Stop Driver Node from Overloading When Using ThreadPoolExecutor in Databricks

Hi everyone,I'm using a ThreadPoolExecutor in Databricks to run multiple notebooks at the same time. The problem is that it seems like all the processing happens on the driver node, while the executor nodes are idle. This causes the driver to run out...

  • 319 Views
  • 1 replies
  • 2 kudos
Latest Reply
mmayorga
Databricks Employee
  • 2 kudos

Greetings @alex307 and thank you for sending your question. When using ThreadPoolExecutor to run multiple notebooks concurrently in Databricks, the workload is being executed on the driver node rather than distributed across Spark executors. This res...

  • 2 kudos
vartyg
by New Contributor
  • 266 Views
  • 2 replies
  • 0 kudos

Scaling Declarative Streaming Pipelines for CDC from On-Prem Database to Lakehouse

We have a scenario where we need to mirror thousands of tables from on-premises Db2 databases to an Azure Lakehouse. The goal is to create mirror Delta tables in the Lakehouse.Since LakeFlow Connect currently does not support direct mirroring from on...

  • 266 Views
  • 2 replies
  • 0 kudos
Latest Reply
AbhaySingh
Databricks Employee
  • 0 kudos

Yes, a databricks labs project seems perfect for your scenario. https://databrickslabs.github.io/dlt-meta/index.html  

  • 0 kudos
1 More Replies
Nis
by New Contributor II
  • 2558 Views
  • 2 replies
  • 2 kudos

Best sequence of using Vacuum, optimize, fsck repair and refresh commands.

I have a delta table whose size will increases gradually now we have around 1.5 crores of rows while running vacuum command on that table i am getting the below error.ERROR: Job aborted due to stage failure: Task 7 in stage 491.0 failed 4 times, most...

  • 2558 Views
  • 2 replies
  • 2 kudos
Latest Reply
alex307
New Contributor II
  • 2 kudos

In my opinion Best order: Optimize → Vacuum → FSCK Repair → Refresh.Your error is likely a timeout — try more cluster resources or a longer retention period.

  • 2 kudos
1 More Replies
hgm251
by New Contributor II
  • 367 Views
  • 3 replies
  • 1 kudos

online tables to synced table, why is it creating a different service principal everytime?

Hello!We started to move our online tables to synced_tables. We just couldnt figure out why it is creating a new service principal everytime we ran the same code we use for online tables?try: fe.create_feature_spec(name=feature_spec_name ...

  • 367 Views
  • 3 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

Greetings @hgm251 , here are some things to consider.    Things are working as designed: when you create a new Feature Serving or Model Serving endpoint, Databricks automatically provisions a dedicated service principal for that endpoint, and a fresh...

  • 1 kudos
2 More Replies
DaPo
by New Contributor III
  • 3857 Views
  • 2 replies
  • 2 kudos

Resolved! DLT Streaming With Watermark fails, suggesting I should add watermarks

Hi all,I have the following Problem: I have two streaming tables containing time-series measurements from different sensor data, each feed by multiple sensors. (Imagine: Multiple Temperature Sensors for the first table, and multiple humidity sensors ...

  • 3857 Views
  • 2 replies
  • 2 kudos
Latest Reply
mark_ott
Databricks Employee
  • 2 kudos

To resolve the DLT streaming aggregation error about unsupported output modes and watermarks in Databricks, you need to carefully set watermarks on the original event timestamp rather than on computed columns like "time_window" and carefully consider...

  • 2 kudos
1 More Replies
Dave_Nithio
by Contributor II
  • 3640 Views
  • 1 replies
  • 0 kudos

Transaction Log Failed Integrity Checks

I have started to receive the following error message - that the transaction log has failed integrity checks - when attempting to optimize and run compaction on a table. It also occurs when I attempt to alter this table.This blocks my pipeline from r...

Dave_Nithio_1-1741718129217.png
  • 3640 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Your issue—encountering "the transaction log has failed integrity checks" in Databricks Delta Lake—indicates metadata corruption or an inconsistency in the Delta transaction log (_delta_log). This commonly disrupts DML operations like OPTIMIZE, DELET...

  • 0 kudos
OmarE
by New Contributor II
  • 3929 Views
  • 1 replies
  • 2 kudos

Streamlit Databricks App Compute Scaling

I have a streamlit Databricks app and I’m looking to increase the compute resources. According to the documentation and the current settings, the app is limited to 2 vCPUs and 6 GB of memory. Is there a way to adjust these limits or add more resource...

  • 3929 Views
  • 1 replies
  • 2 kudos
Latest Reply
mark_ott
Databricks Employee
  • 2 kudos

You can increase compute resources for your Streamlit Databricks app, but this requires explicitly configuring the compute size in the Databricks app management UI or via deployment configuration—environment variables like DATABRICKS_CLUSTER_ID alone...

  • 2 kudos
Arunraja
by New Contributor II
  • 3599 Views
  • 1 replies
  • 0 kudos

AI BI Genie throwing internal error

For any prompt I am getting INTERNAL_ERROR: AI service did not respond with a valid answer

  • 3599 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The "INTERNAL_ERROR: AI service did not respond with a valid answer" in Databricks AI/BI Genie typically means the Genie service failed to process your query, often due to one of a few common issues. This can include problems with the table existence...

  • 0 kudos
turagittech
by Contributor
  • 3797 Views
  • 1 replies
  • 1 kudos

Finding all folder paths in a blob store connected via UC external connetion

Hi All,I need to easily find all the paths in a blob store to find the files and load them. I have tried using Azure Blob storage connection in python and I have a solution that works it is very slow. I was speaking to a data engineer, and he suggest...

  • 3797 Views
  • 1 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

The most efficient way to list all file paths in an Azure Blob Storage container from Databricks, especially when Hierarchical Namespace (HNS) is not enabled, is to use Azure SDKs targeting the blob flat namespace directly rather than filesystem prot...

  • 1 kudos
Sega2
by New Contributor III
  • 3950 Views
  • 2 replies
  • 1 kudos

Debugger freezes when calling spark.sql with dbx connect

I have just created a simple bundle with databricks, and is using Databricks connect to debug locally. This is my script:from pyspark.sql import SparkSession, DataFrame def get_taxis(spark: SparkSession) -> DataFrame: return spark.read.table("samp...

Sega2_1-1740135258051.png Sega2_0-1740135225882.png
  • 3950 Views
  • 2 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

The issue you're experiencing—where your script freezes in VS Code when running spark.sql locally using Databricks Connect, but works correctly when deployed—can result from several common causes related to Databricks Connect configuration, networkin...

  • 1 kudos
1 More Replies
akshaym0056
by New Contributor
  • 3921 Views
  • 2 replies
  • 0 kudos

How to Define Constants at Bundle Level in Databricks Asset Bundles for Use in Notebooks?

I'm working with Databricks Asset Bundles and need to define constants at the bundle level based on the target environment. These constants will be used inside Databricks notebooks.For example, I want a constant gold_catalog to take different values ...

  • 3921 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Yes, you can define environment-specific constants at the bundle level in Databricks Asset Bundles and make them accessible inside Databricks notebooks, without relying on task-level parameters. This can be done using environment variables, bundle co...

  • 0 kudos
1 More Replies
Databricks36
by New Contributor
  • 3872 Views
  • 1 replies
  • 0 kudos

Accessing Databricks Delta table in ADF using system-defined managed identity

I am using Lookup activity in ADF which will read the delta table values from databricks. Currently using the system-defined managed identity of the ADF to connect Databricks delta table. I am unable to see my unity catalog database names in the look...

  • 3872 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

You are experiencing an issue in Azure Data Factory (ADF) where the Lookup activity does not show your Unity Catalog databases in the configuration dropdown, even though connectivity from ADF to Databricks is successful and you have followed all reco...

  • 0 kudos
jordanpinder
by New Contributor
  • 3882 Views
  • 1 replies
  • 0 kudos

Native geometry Parquet support

Hi there!With the recent GeoParquet 2.0 announcements, I'm curious to understand how this impacts storing geospatial data in Databricks and Delta. For reference:the Parquet specification officially adopting geospatial guidance allowing native storage...

  • 3882 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

GeoParquet 2.0’s formalization within the Apache Parquet specification is a significant step for native geospatial data storage across the modern data ecosystem, particularly for platforms like Databricks and Delta Lake. In summary, Delta Lake's reli...

  • 0 kudos
Dave_Nithio
by Contributor II
  • 3276 Views
  • 1 replies
  • 0 kudos

Preset Partner Connect Schema Changes

When using partner connect to connect Serverless Databricks to my BI tool Preset, you must manually define the schema that Preset has access to. In my case, I individually selected all databases currently in my hive_metastore:The problem is, once cre...

Dave_Nithio_0-1742329042090.png
  • 3276 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

No, there is currently no simple, direct way to add new schema access to an existing Serverless Databricks SQL warehouse connection through Partner Connect for Preset—neither through Databricks UI, BI tool configuration, nor the Databricks service pr...

  • 0 kudos
fscaravelli
by New Contributor
  • 3599 Views
  • 1 replies
  • 0 kudos

Ingest files from GCS with Auto Loader in DLT pipeline running on AWS

I have some DLT pipelines working fine ingesting files from S3. Now I'm trying to build a pipeline to ingest files from GCS using Auto Loader. I'm running Databricks on AWS.The code I have:import dlt import json from pyspark.sql.functions import col ...

  • 3599 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Your error is due to how Databricks on AWS is trying to access GCS: it's defaulting to using the GCP metadata server (which only exists on Google Cloud VMs), not the service account key you provided. This is a common issue when connecting GCS from no...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels