cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

gudurusreddy99
by New Contributor II
  • 74 Views
  • 1 replies
  • 1 kudos

DLT or DP: How to do full refresh of Delta table from DLT Pipeline to consider all records from Tbl

RequirementI have a Kafka streaming pipeline that ingests Pixels data. For each incoming record, I need to validate the Pixels key against an existing Delta table (pixel_tracking_data), which contains over 2 billion records accumulated over the past ...

  • 74 Views
  • 1 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

Matching streaming data in real time against a massive, fast-changing Delta table requires careful architectural choices. In your case, latency is high for the most recent records, and the solution only matches against data ≥10 minutes old. This is a...

  • 1 kudos
der
by Contributor II
  • 463 Views
  • 10 replies
  • 0 kudos

Rasterio on shared/standard cluster has no access to proj.db

We try to use rasterio on a Databricks shared/standard cluster with DBR 17.1. Rasterio is directly installed on the cluster as library. Code:import rasterio rasterio.show_versions()Output: rasterio info:rasterio: 1.4.3GDAL: 3.9.3PROJ: 9.4.1GEOS: 3.11...

  • 463 Views
  • 10 replies
  • 0 kudos
Latest Reply
der
Contributor II
  • 0 kudos

Current Workaround:If you select the "Photon" engine on a Standard/Shared Cluster, they change the access rights of /databricks/native/proj-data and rasterio works fine.The downside:Pay for "Photon" compute to use a Python library, which do not use S...

  • 0 kudos
9 More Replies
jano
by New Contributor III
  • 98 Views
  • 2 replies
  • 0 kudos

Resolved! DABs with multi github sources

I want to deploy a dabs that has dev using a github branch and prod using a github release tag. I can't seem to find a way to make this part dynamic based on the target. Things I've tried:- Setting the git varaible in the databricks.yml- making the g...

  • 98 Views
  • 2 replies
  • 0 kudos
Latest Reply
jano
New Contributor III
  • 0 kudos

I ended up finding this discussion which mostly ended up working. What was not mentioned is the first resources block should be in the job.yml and the overwrite parameters mentioned below are in the databricks.yml. You cannot put both in the databric...

  • 0 kudos
1 More Replies
Volker
by Contributor
  • 3284 Views
  • 5 replies
  • 4 kudos

Asset Bundles cannot run job with single node job cluster

Hello community,we are deploying a job using asset bundles and the job should run on a single node job cluster. Here is the DAB job definition:resources: jobs: example_job: name: example_job tasks: - task_key: main_task ...

  • 3284 Views
  • 5 replies
  • 4 kudos
Latest Reply
kunalmishra9
Contributor
  • 4 kudos

In case this is now breaking for anyone (as it is for me), there's an update here to follow along with on how to define single node compute!https://github.com/databricks/databricks-sdk-py/issues/881

  • 4 kudos
4 More Replies
hanspetter
by New Contributor III
  • 65484 Views
  • 21 replies
  • 7 kudos

Resolved! Is it possible to get Job Run ID of notebook run by dbutils.notbook.run?

When running a notebook using dbutils.notebook.run from a master-notebook, an url to that running notebook is printed, i.e.: Notebook job #223150 Notebook job #223151 Are there any ways to capture that Job Run ID (#223150 or #223151)? We have 50 or ...

  • 65484 Views
  • 21 replies
  • 7 kudos
Latest Reply
no2
New Contributor II
  • 7 kudos

Thanks for the response @Manoj5 - I had to use this "safeToJson()" option too because all of the previous suggestions in this thread were erroring out for me with a message like "py4j.security.Py4JSecurityException: Method public java.lang.String com...

  • 7 kudos
20 More Replies
Richard_547342
by New Contributor III
  • 3484 Views
  • 2 replies
  • 2 kudos

Resolved! Column comments in DLT python notebook

The SQL API specification in the DLT docs shows an option for adding column comments when creating a table. Is there an equivalent way to do this when creating a DLT pipeline with a python notebook? The Python API specification in the DLT docs does n...

  • 3484 Views
  • 2 replies
  • 2 kudos
Latest Reply
jonathandbyrd
New Contributor
  • 2 kudos

this works in a readStream writeStream scenario for us, but the exact same code fails when put in a DLT

  • 2 kudos
1 More Replies
alex307
by New Contributor
  • 74 Views
  • 1 replies
  • 2 kudos

How to Stop Driver Node from Overloading When Using ThreadPoolExecutor in Databricks

Hi everyone,I'm using a ThreadPoolExecutor in Databricks to run multiple notebooks at the same time. The problem is that it seems like all the processing happens on the driver node, while the executor nodes are idle. This causes the driver to run out...

  • 74 Views
  • 1 replies
  • 2 kudos
Latest Reply
mmayorga
Databricks Employee
  • 2 kudos

Greetings @alex307 and thank you for sending your question. When using ThreadPoolExecutor to run multiple notebooks concurrently in Databricks, the workload is being executed on the driver node rather than distributed across Spark executors. This res...

  • 2 kudos
vartyg
by New Contributor
  • 84 Views
  • 2 replies
  • 0 kudos

Scaling Declarative Streaming Pipelines for CDC from On-Prem Database to Lakehouse

We have a scenario where we need to mirror thousands of tables from on-premises Db2 databases to an Azure Lakehouse. The goal is to create mirror Delta tables in the Lakehouse.Since LakeFlow Connect currently does not support direct mirroring from on...

  • 84 Views
  • 2 replies
  • 0 kudos
Latest Reply
AbhaySingh
Databricks Employee
  • 0 kudos

Yes, a databricks labs project seems perfect for your scenario. https://databrickslabs.github.io/dlt-meta/index.html  

  • 0 kudos
1 More Replies
Nis
by New Contributor II
  • 2428 Views
  • 2 replies
  • 2 kudos

Best sequence of using Vacuum, optimize, fsck repair and refresh commands.

I have a delta table whose size will increases gradually now we have around 1.5 crores of rows while running vacuum command on that table i am getting the below error.ERROR: Job aborted due to stage failure: Task 7 in stage 491.0 failed 4 times, most...

  • 2428 Views
  • 2 replies
  • 2 kudos
Latest Reply
alex307
New Contributor
  • 2 kudos

In my opinion Best order: Optimize → Vacuum → FSCK Repair → Refresh.Your error is likely a timeout — try more cluster resources or a longer retention period.

  • 2 kudos
1 More Replies
hgm251
by New Contributor
  • 128 Views
  • 3 replies
  • 1 kudos

online tables to synced table, why is it creating a different service principal everytime?

Hello!We started to move our online tables to synced_tables. We just couldnt figure out why it is creating a new service principal everytime we ran the same code we use for online tables?try: fe.create_feature_spec(name=feature_spec_name ...

  • 128 Views
  • 3 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

Greetings @hgm251 , here are some things to consider.    Things are working as designed: when you create a new Feature Serving or Model Serving endpoint, Databricks automatically provisions a dedicated service principal for that endpoint, and a fresh...

  • 1 kudos
2 More Replies
DaPo
by New Contributor III
  • 3454 Views
  • 2 replies
  • 2 kudos

Resolved! DLT Streaming With Watermark fails, suggesting I should add watermarks

Hi all,I have the following Problem: I have two streaming tables containing time-series measurements from different sensor data, each feed by multiple sensors. (Imagine: Multiple Temperature Sensors for the first table, and multiple humidity sensors ...

  • 3454 Views
  • 2 replies
  • 2 kudos
Latest Reply
mark_ott
Databricks Employee
  • 2 kudos

To resolve the DLT streaming aggregation error about unsupported output modes and watermarks in Databricks, you need to carefully set watermarks on the original event timestamp rather than on computed columns like "time_window" and carefully consider...

  • 2 kudos
1 More Replies
Dave_Nithio
by Contributor II
  • 3291 Views
  • 1 replies
  • 0 kudos

Transaction Log Failed Integrity Checks

I have started to receive the following error message - that the transaction log has failed integrity checks - when attempting to optimize and run compaction on a table. It also occurs when I attempt to alter this table.This blocks my pipeline from r...

Dave_Nithio_1-1741718129217.png
  • 3291 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Your issue—encountering "the transaction log has failed integrity checks" in Databricks Delta Lake—indicates metadata corruption or an inconsistency in the Delta transaction log (_delta_log). This commonly disrupts DML operations like OPTIMIZE, DELET...

  • 0 kudos
OmarE
by New Contributor II
  • 3669 Views
  • 1 replies
  • 1 kudos

Streamlit Databricks App Compute Scaling

I have a streamlit Databricks app and I’m looking to increase the compute resources. According to the documentation and the current settings, the app is limited to 2 vCPUs and 6 GB of memory. Is there a way to adjust these limits or add more resource...

  • 3669 Views
  • 1 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

You can increase compute resources for your Streamlit Databricks app, but this requires explicitly configuring the compute size in the Databricks app management UI or via deployment configuration—environment variables like DATABRICKS_CLUSTER_ID alone...

  • 1 kudos
Arunraja
by New Contributor II
  • 3339 Views
  • 1 replies
  • 0 kudos

AI BI Genie throwing internal error

For any prompt I am getting INTERNAL_ERROR: AI service did not respond with a valid answer

  • 3339 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The "INTERNAL_ERROR: AI service did not respond with a valid answer" in Databricks AI/BI Genie typically means the Genie service failed to process your query, often due to one of a few common issues. This can include problems with the table existence...

  • 0 kudos
turagittech
by Contributor
  • 3478 Views
  • 1 replies
  • 0 kudos

Finding all folder paths in a blob store connected via UC external connetion

Hi All,I need to easily find all the paths in a blob store to find the files and load them. I have tried using Azure Blob storage connection in python and I have a solution that works it is very slow. I was speaking to a data engineer, and he suggest...

  • 3478 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The most efficient way to list all file paths in an Azure Blob Storage container from Databricks, especially when Hierarchical Namespace (HNS) is not enabled, is to use Azure SDKs targeting the blob flat namespace directly rather than filesystem prot...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels