cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

RJTECHY210
by New Contributor II
  • 633 Views
  • 3 replies
  • 1 kudos

Resolved! Azure Databricks Streamlit Application - Doubts

Hi Databricks community, I am currently tasked with creating a stream lit application with the help of data bricks application feature, I have currently created a lake base instance to sync the delta table located at the unity catalog and I have also...

  • 633 Views
  • 3 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @RJTECHY210 ,Yes, it's possible. You can use python sdk to achieve what you want. Here's a sample code for a reference:from databricks.sdk import WorkspaceClient from databricks.sdk.service.database import DatabaseInstance # Initialize the Worksp...

  • 1 kudos
2 More Replies
GANAPATI_HEGDE
by New Contributor III
  • 398 Views
  • 3 replies
  • 0 kudos

Unable to configure custom compute for DLT pipeline

I am trying to configure cluster for a pipeline like above, However dlt keeps using the small cluster as usual, how to resolve this? 

GANAPATI_HEGDE_0-1762754316899.png GANAPATI_HEGDE_1-1762754398253.png
  • 398 Views
  • 3 replies
  • 0 kudos
Latest Reply
GANAPATI_HEGDE
New Contributor III
  • 0 kudos

i updated my CLI and deployed the job, still i dont see the clusters updates in  pipeline

  • 0 kudos
2 More Replies
hgm251
by New Contributor II
  • 1276 Views
  • 3 replies
  • 3 kudos

badrequest: cannot create online table is being deprecated. creating new online table is not allowed

Hello!This seems so sudden that we cannot create online tables anymore? Is there a workaround to being able to create online tables temporarily as we need more time to move to synced tables? #online_tables 

  • 1276 Views
  • 3 replies
  • 3 kudos
Latest Reply
nayan_wylde
Esteemed Contributor II
  • 3 kudos

Yes, the Databricks online tables (legacy) are being deprecated, and after January 15, 2026, you will no longer be able to access or create them.https://docs.databricks.com/aws/en/machine-learning/feature-store/migrate-from-online-tablesHere are few ...

  • 3 kudos
2 More Replies
pooja_bhumandla
by New Contributor III
  • 583 Views
  • 3 replies
  • 1 kudos

Best Practice for Updating Data Skipping Statistics for Additional Columns

Hi Community,I have a scenario where I’ve already calculated delta statistics for the first 32 columns after enabling the dataskipping property. Now, I need to include 10 more frequently used columns that were not part of the original 32.Goal:I want ...

  • 583 Views
  • 3 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @pooja_bhumandla ,Updating any of two below options does not automatically recompute statistics for existing data. Rather, it impacts the behavior of future statistics collection when adding or updating data in the table.- delta.dataSkippingNumInd...

  • 1 kudos
2 More Replies
absan
by Contributor
  • 498 Views
  • 4 replies
  • 6 kudos

Resolved! How integrate unique PK expectation into LDP pipeline graph

Hi everyone,I'm working on a LDP and need help ensuring a downstream table only runs if a primary key unique validation check passes. In something like dbt this is very easy to configure but with LDP it seems to require creating a separate view. Addi...

  • 498 Views
  • 4 replies
  • 6 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 6 kudos

I know your solution is quite popular (just I don't get SELECT MAX(load_date) ). Another one is to use AUTO CDC even if you don't have CDC, as there is KEY option. If MAX(load_date) means that the last snapshot is most essential for you, please check...

  • 6 kudos
3 More Replies
hidden
by New Contributor II
  • 736 Views
  • 3 replies
  • 0 kudos

Resolved! replicate the behaviour of DLT create auto cdc flow

I want to custom write the behaviour of DLT create auto cdc flow . how can we do it  

  • 736 Views
  • 3 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 0 kudos

And you need to handle dozens of exceptions, such as late-arriving data, duplicate data, data in the wrong order, etc.

  • 0 kudos
2 More Replies
ismaelhenzel
by Contributor III
  • 699 Views
  • 5 replies
  • 5 kudos

Resolved! delta live tables - collaborative development

I would like to know the best practice for collaborating on a Delta Live Tables pipeline. I was thinking that each developer should have their own DLT pipeline in the development workspace. Currently, each domain has its development catalog, like sal...

  • 699 Views
  • 5 replies
  • 5 kudos
Latest Reply
Poorva21
Contributor II
  • 5 kudos

Yes—each developer should have their own DLT pipeline and their own schema. It’s the correct paradigm.It keeps DLT ownership clean and prevents pipeline conflicts.Dev naming doesn’t need to be pretty; QA/Prod are where structure matters.

  • 5 kudos
4 More Replies
excavator-matt
by Contributor III
  • 451 Views
  • 3 replies
  • 1 kudos

ABAC tag support for for Streaming tables (Spark Lakeflow Declarative Pipelines)?

Hi!We're using Spark Lakeflow Declarative Pipelines for ingesting data from various data sources. However, in order to achieve compliance with GDPR, we are planning to start using ABAC tagging.However, I don't understand how we are supposed to use th...

Data Engineering
abac
LakeFlow
Streaming tables
tags
  • 451 Views
  • 3 replies
  • 1 kudos
Latest Reply
excavator-matt
Contributor III
  • 1 kudos

Correction. Trying this will result in this error> ABAC policies are not supported on tables defined within a pipeline. Remove the policies or contact Databricks support.So it isn't supported

  • 1 kudos
2 More Replies
feliximmanuel
by New Contributor II
  • 2915 Views
  • 2 replies
  • 2 kudos

Error: oidc: fetch .well-known: Get "https://%E2%80%93host/oidc/.well-known/oauth-authorization-serv

I'm trying to authenticate databricks using WSL but suddenly getting this error./databricks-asset-bundle$ databricks auth login –host https://<XXXXXXXXX>.12.azuredatabricks.netDatabricks Profile Name:<XXXXXXXXX>Error: oidc: fetch .well-known: Get "ht...

  • 2915 Views
  • 2 replies
  • 2 kudos
Latest Reply
guptadeepak
New Contributor II
  • 2 kudos

Great, these are amazing resources! I'm using them to test my IAM apps and flow.

  • 2 kudos
1 More Replies
saicharandeepb
by Contributor
  • 398 Views
  • 1 replies
  • 2 kudos

Decision Tree for Selecting the Right VM Types in Databricks – Looking for Feedback & Improvements!

Hi everyone,I’ve been working on an updated VM selection decision tree for Azure Databricks, designed to help teams quickly identify the most suitable worker types based on workload behavior. I’m sharing the latest version (In this updated version I’...

saicharandeepb_0-1763118168705.png
  • 398 Views
  • 1 replies
  • 2 kudos
Latest Reply
Sahil_Kumar
Databricks Employee
  • 2 kudos

Hi saicharandeepb, You can enrich your chart by adding GPU-accelerated VMs. For computationally challenging tasks that demand high performance, like those associated with deep learning, Azure Databricks supports compute resources that are accelerated...

  • 2 kudos
singhanuj2803
by Contributor
  • 600 Views
  • 4 replies
  • 1 kudos

Troubleshooting Azure Databricks Cluster Pools & spot_bid_max_price Validation Error

Hope you’re doing well!I’m reaching out for some guidance on an issue I’ve encountered while setting up Azure Databricks Cluster Pools to reduce cluster spin-up and scale times for our jobs.Background:To optimize job execution wait times, I’ve create...

  • 600 Views
  • 4 replies
  • 1 kudos
Latest Reply
Poorva21
Contributor II
  • 1 kudos

Possible reasons:1. Setting spot_bid_max_price = -1 is not accepted by Azure poolsAzure Databricks only accepts:0 → on-demand onlypositive numbers → max spot price-1 is allowed in cluster policies, but not inside pools, so validation never completes....

  • 1 kudos
3 More Replies
molopocho
by New Contributor
  • 253 Views
  • 1 replies
  • 0 kudos

Can't create a new ETL because of compute (?)

I just create a databricks workspace with GCP with "Use existing cloud account (Storage & compute)" option. I already add a few cluster for my task but when i try to create ETL, i always get this error notification. The file is created on the specifi...

molopocho_0-1764086991435.jpeg
  • 253 Views
  • 1 replies
  • 0 kudos
Latest Reply
Saritha_S
Databricks Employee
  • 0 kudos

Hi @molopocho  We need to enable the feature in the workspace. If you don't see the option, then you need to reach out to the accounts team or create a ticket to databricks support team t get it enabled at the workspace level.   

  • 0 kudos
Poorva21
by Contributor II
  • 1357 Views
  • 1 replies
  • 1 kudos

Resolved! Best Practices for Optimizing Databricks Costs in Production Workloads?

Hi everyone,I'm working on optimizing Databricks costs for a production-grade data pipeline (Spark + Delta Lake) on Azure. I’m looking for practical, field-tested strategies to reduce compute and storage spend without impacting performance.So far, I’...

  • 1357 Views
  • 1 replies
  • 1 kudos
Latest Reply
K_Anudeep
Databricks Employee
  • 1 kudos

Hello @Poorva21 , Below are the answers to your questions: Q1. What are the most impactful cost optimisations for production pipelines? I have worked with multiple Cx and based on my knowledge, below are a high-level optimisations one must have: The ...

  • 1 kudos
mordex
by New Contributor III
  • 557 Views
  • 4 replies
  • 1 kudos

Resolved! Why is spark creating 5 jobs and 200 tasks?

I am trying to read 1000 small csv files each 30 kb size which are stored in databricks volume. Below is the query i am doing:df=spark.read.csv.options(header=true).load('/path')df.collect() Why is it creating 5 jobs? Why 1-3 jobs have 200 tasks,4 ha...

030a9798-9c6f-4ab3-be53-7f6e4a5f7289.jfif
  • 557 Views
  • 4 replies
  • 1 kudos
Latest Reply
Raman_Unifeye
Honored Contributor III
  • 1 kudos

@mordex - yes, Spark caps the parallelism for file listing at 200 tasks, regardless of whether you have 1,000 or 10,000 files. it is controlled by spark.sql.sources.parallelPartitionDiscovery.parallelism. Run below command to get value of it. spark.c...

  • 1 kudos
3 More Replies
crami
by New Contributor III
  • 343 Views
  • 2 replies
  • 0 kudos

Declarative Pipeline Re-Deployment and existing managed tables exception

Hi,I am facing a issue regarding re deployment of declarative pipeline using asset bundle. On first deployment, I am able to run the pipeline successfully. On execution, pipeline, as expected create tables. However, when I try to re-deploy the pipeli...

  • 343 Views
  • 2 replies
  • 0 kudos
Latest Reply
Poorva21
Contributor II
  • 0 kudos

Managed tables are “owned” by a DLT pipeline. Re-deploying a pipeline that references the same managed tables will fail unless you either:Drop the existing tables firstUse external tables that are not owned by DLTUse a separate development schema/pip...

  • 0 kudos
1 More Replies
Labels