cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

saicharandeepb
by Contributor
  • 265 Views
  • 1 replies
  • 2 kudos

Decision Tree for Selecting the Right VM Types in Databricks – Looking for Feedback & Improvements!

Hi everyone,I’ve been working on an updated VM selection decision tree for Azure Databricks, designed to help teams quickly identify the most suitable worker types based on workload behavior. I’m sharing the latest version (In this updated version I’...

saicharandeepb_0-1763118168705.png
  • 265 Views
  • 1 replies
  • 2 kudos
Latest Reply
Sahil_Kumar
Databricks Employee
  • 2 kudos

Hi saicharandeepb, You can enrich your chart by adding GPU-accelerated VMs. For computationally challenging tasks that demand high performance, like those associated with deep learning, Azure Databricks supports compute resources that are accelerated...

  • 2 kudos
singhanuj2803
by Contributor
  • 360 Views
  • 4 replies
  • 1 kudos

Troubleshooting Azure Databricks Cluster Pools & spot_bid_max_price Validation Error

Hope you’re doing well!I’m reaching out for some guidance on an issue I’ve encountered while setting up Azure Databricks Cluster Pools to reduce cluster spin-up and scale times for our jobs.Background:To optimize job execution wait times, I’ve create...

  • 360 Views
  • 4 replies
  • 1 kudos
Latest Reply
Poorva21
New Contributor III
  • 1 kudos

Possible reasons:1. Setting spot_bid_max_price = -1 is not accepted by Azure poolsAzure Databricks only accepts:0 → on-demand onlypositive numbers → max spot price-1 is allowed in cluster policies, but not inside pools, so validation never completes....

  • 1 kudos
3 More Replies
Eduard
by New Contributor II
  • 118577 Views
  • 6 replies
  • 1 kudos

Cluster xxxxxxx was terminated during the run.

Hello,I have a problem with the autoscaling of a cluster. Every time the autoscaling is activated I get this error. Does anyone have any idea why this could be?"Cluster xxxxxxx was terminated during the run (cluster state message: Lost communication ...

  • 118577 Views
  • 6 replies
  • 1 kudos
Latest Reply
marykline
New Contributor II
  • 1 kudos

Hello Databricks Community,The driver node was lost, which might occur as a result of network problems or malfunctioning instances, according to the error message. Here are some potential causes and remedies:Instance Instability: Consider switching t...

  • 1 kudos
5 More Replies
molopocho
by New Contributor
  • 221 Views
  • 1 replies
  • 0 kudos

Can't create a new ETL because of compute (?)

I just create a databricks workspace with GCP with "Use existing cloud account (Storage & compute)" option. I already add a few cluster for my task but when i try to create ETL, i always get this error notification. The file is created on the specifi...

molopocho_0-1764086991435.jpeg
  • 221 Views
  • 1 replies
  • 0 kudos
Latest Reply
Saritha_S
Databricks Employee
  • 0 kudos

Hi @molopocho  We need to enable the feature in the workspace. If you don't see the option, then you need to reach out to the accounts team or create a ticket to databricks support team t get it enabled at the workspace level.   

  • 0 kudos
Poorva21
by New Contributor III
  • 630 Views
  • 1 replies
  • 1 kudos

Resolved! Best Practices for Optimizing Databricks Costs in Production Workloads?

Hi everyone,I'm working on optimizing Databricks costs for a production-grade data pipeline (Spark + Delta Lake) on Azure. I’m looking for practical, field-tested strategies to reduce compute and storage spend without impacting performance.So far, I’...

  • 630 Views
  • 1 replies
  • 1 kudos
Latest Reply
K_Anudeep
Databricks Employee
  • 1 kudos

Hello @Poorva21 , Below are the answers to your questions: Q1. What are the most impactful cost optimisations for production pipelines? I have worked with multiple Cx and based on my knowledge, below are a high-level optimisations one must have: The ...

  • 1 kudos
Jpeterson
by New Contributor III
  • 6599 Views
  • 9 replies
  • 4 kudos

Databricks SQL Warehouse, Tableau and spark.driver.maxResultSize error

I'm attempting to create a tableau extract on tableau server with a connection to databricks large sql warehouse. The extract process fails due to spark.driver.maxResultSize error.Using a databricks interactive cluster in the data science & engineer...

  • 6599 Views
  • 9 replies
  • 4 kudos
Latest Reply
CallumDean
New Contributor II
  • 4 kudos

I ran into a similar issue exporting data from Databricks to a BI tool. What helped was limiting columns, aggregating before export, and splitting large extracts into smaller chunks instead of one massive pull. I also test such tweaks in a safer envi...

  • 4 kudos
8 More Replies
mordex
by New Contributor III
  • 406 Views
  • 4 replies
  • 1 kudos

Resolved! Why is spark creating 5 jobs and 200 tasks?

I am trying to read 1000 small csv files each 30 kb size which are stored in databricks volume. Below is the query i am doing:df=spark.read.csv.options(header=true).load('/path')df.collect() Why is it creating 5 jobs? Why 1-3 jobs have 200 tasks,4 ha...

030a9798-9c6f-4ab3-be53-7f6e4a5f7289.jfif
  • 406 Views
  • 4 replies
  • 1 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 1 kudos

@mordex - yes, Spark caps the parallelism for file listing at 200 tasks, regardless of whether you have 1,000 or 10,000 files. it is controlled by spark.sql.sources.parallelPartitionDiscovery.parallelism. Run below command to get value of it. spark.c...

  • 1 kudos
3 More Replies
crami
by New Contributor II
  • 255 Views
  • 2 replies
  • 0 kudos

Declarative Pipeline Re-Deployment and existing managed tables exception

Hi,I am facing a issue regarding re deployment of declarative pipeline using asset bundle. On first deployment, I am able to run the pipeline successfully. On execution, pipeline, as expected create tables. However, when I try to re-deploy the pipeli...

  • 255 Views
  • 2 replies
  • 0 kudos
Latest Reply
Poorva21
New Contributor III
  • 0 kudos

Managed tables are “owned” by a DLT pipeline. Re-deploying a pipeline that references the same managed tables will fail unless you either:Drop the existing tables firstUse external tables that are not owned by DLTUse a separate development schema/pip...

  • 0 kudos
1 More Replies
cgrant
by Databricks Employee
  • 20301 Views
  • 5 replies
  • 6 kudos

What is the difference between OPTIMIZE and Auto Optimize?

I see that Delta Lake has an OPTIMIZE command and also table properties for Auto Optimize. What are the differences between these and when should I use one over the other?

  • 20301 Views
  • 5 replies
  • 6 kudos
Latest Reply
Poorva21
New Contributor III
  • 6 kudos

Auto Optimize = automatically reduces small files during writes. Best for ongoing ETL.OPTIMIZE = manual compaction + Z-ORDER for improving performance on existing data.They are complementary, not competing. Most teams use Auto Optimize for daily inge...

  • 6 kudos
4 More Replies
analyticsnerd
by New Contributor III
  • 433 Views
  • 5 replies
  • 3 kudos

Resolved! Row tracking in Delta tables

What exactly is row tracking and why should we use it for our delta tables? Could you explain with an example how it works internally and is it mandatory to use? 

  • 433 Views
  • 5 replies
  • 3 kudos
Latest Reply
Poorva21
New Contributor III
  • 3 kudos

Row tracking gives each Delta row a stable internal ID, so Delta can track inserts/updates/deletes across table versions—even when files are rewritten or compacted.Suppose we have a Delta table:id value 1A2BWhen row tracking is enabled, Delta Lake st...

  • 3 kudos
4 More Replies
__Aziz__
by New Contributor II
  • 251 Views
  • 1 replies
  • 1 kudos

Resolved! mongodb connector duplicate writes

Hi everyone,Has anyone run into this issue? I’m using the MongoDB Spark Connector on Databricks to expose data from Delta Lake to MongoDB. My workflow is:overwrite the collection (very fast),then create the indexes.Occasionally, I’m seeing duplicates...

  • 251 Views
  • 1 replies
  • 1 kudos
Latest Reply
bianca_unifeye
Contributor
  • 1 kudos

Hi Aziz,What you’re seeing is an expected behaviour when combining Spark retries with non-idempotent writes.Spark’s write path is task-based and fault-tolerant. If a task fails part-way through writing to MongoDB, Spark will retry that task.From Spar...

  • 1 kudos
abetogi
by New Contributor III
  • 1805 Views
  • 3 replies
  • 2 kudos

AI

At Chevron we actively use Databricks to provide answers to business users. It was extremely interesting to see the use LakeHouseIQ initiatives as it can expedite how fast our users can receive their answers/reports. Is there any documentation that I...

  • 1805 Views
  • 3 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Guys, this thread was created in 2023. And the user who created it was last seen in 2023. I think there’s no point in resurrecting this thread

  • 2 kudos
2 More Replies
radha_krishna
by New Contributor
  • 420 Views
  • 4 replies
  • 1 kudos

"ai_parse_document()" is not a full OCR engine ? It's not extracting text from high quality image

 I used "ai_parse_document()" to parse a PNG file that contains cat images and text. From the image, I wanted to extract all the cat names, but the response returned nothing. It seems that "ai_parse_document()" does not support rich image extraction....

  • 420 Views
  • 4 replies
  • 1 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 1 kudos

@szymon_dybczak - yes, as it relies on AI models, there are chances of missing few cases due to non-deterministic nature of it. I have used it with vast number of PDFs in anger and it has worked pretty well in all those cases. Have not tried with PNG...

  • 1 kudos
3 More Replies
Michael_Galli
by Contributor III
  • 15081 Views
  • 5 replies
  • 8 kudos

Resolved! Monitoring Azure Databricks in an Azure Log Analytics Workspace

Does anyone have experience with the mspnp/spark-monitoring library ?Is this best practice, or are there better ways to monitor a Databricks Cluster?

  • 15081 Views
  • 5 replies
  • 8 kudos
Latest Reply
vr
Valued Contributor
  • 8 kudos

Interesting that Microsoft deleted this project. Was there any announcement as to when, why, and what to do now?

  • 8 kudos
4 More Replies
Ravikumashi
by Contributor
  • 3340 Views
  • 4 replies
  • 1 kudos

Resolved! Issue with Logging Spark Events to LogAnalytics after Upgrading to Databricks 11.3 LTS

We have recently been in the process of upgrading our Databricks clusters to version 11.3 LTS. As part of this upgrade, we have been working on integrating the logging of Spark events to LogAnalytics using the repository available at https://github.c...

  • 3340 Views
  • 4 replies
  • 1 kudos
Latest Reply
vr
Valued Contributor
  • 1 kudos

Anyone knows why was this repository deleted?https://github.com/mspnp/spark-monitoring

  • 1 kudos
3 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels