cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

analyticsnerd
by New Contributor II
  • 11 Views
  • 2 replies
  • 0 kudos

Row tracking in Delta tables

What exactly is row tracking and why should we use it for our delta tables? Could you explain with an example how it works internally and is it mandatory to use? 

  • 11 Views
  • 2 replies
  • 0 kudos
Latest Reply
K_Anudeep
Databricks Employee
  • 0 kudos

Hello @analyticsnerd , Enabling row tracking allows us to track row-level lineage in a Delta table across multiple versions. When enabled, delta creates/exposes two hidden metadata columns which can be accessed as _metadata.row_id and _metadata.row_c...

  • 0 kudos
1 More Replies
maurya_vish24
by New Contributor
  • 12 Views
  • 1 replies
  • 0 kudos

Workflow scheduling on particular working day of the month in ADB

Hi,I am looking to schedule a workflow to execute on 3rd working day. Working day here would be Mon-Fri of each month. I could not find any direct crontab solution but have created watcher file solution for it. Below code will create a watcher file a...

  • 12 Views
  • 1 replies
  • 0 kudos
Latest Reply
bianca_unifeye
New Contributor III
  • 0 kudos

Hi Vishal You’re right: there’s no single Quartz cron expression that says “run on the 3rd working day (Mon–Fri) of every month”. Quartz can handle “Nth weekday of month” (like 3#1 = first Wednesday), but not “Nth business day regardless of weekday”,...

  • 0 kudos
Raman_Unifeye
by Contributor III
  • 51 Views
  • 1 replies
  • 2 kudos

Spark Jobs View on a serverless cluster!! Naah.....

Have you ever noticed (and wondered) that the wonderful Spark Job UI is no longer available in the databricks notebook if the cell is executed using 'serverless' cluster?Tradionally, whenever we run the spark code (action command), we used to see the...

  • 51 Views
  • 1 replies
  • 2 kudos
Latest Reply
bianca_unifeye
New Contributor III
  • 2 kudos

Very good observation, Raman! Thank you for bring this to the community attention! 

  • 2 kudos
__Aziz__
by Visitor
  • 29 Views
  • 1 replies
  • 1 kudos

Resolved! mongodb connector duplicate writes

Hi everyone,Has anyone run into this issue? I’m using the MongoDB Spark Connector on Databricks to expose data from Delta Lake to MongoDB. My workflow is:overwrite the collection (very fast),then create the indexes.Occasionally, I’m seeing duplicates...

  • 29 Views
  • 1 replies
  • 1 kudos
Latest Reply
bianca_unifeye
New Contributor III
  • 1 kudos

Hi Aziz,What you’re seeing is an expected behaviour when combining Spark retries with non-idempotent writes.Spark’s write path is task-based and fault-tolerant. If a task fails part-way through writing to MongoDB, Spark will retry that task.From Spar...

  • 1 kudos
ismaelhenzel
by Contributor II
  • 20 Views
  • 1 replies
  • 0 kudos

delta live tables - collaborative development

I would like to know the best practice for collaborating on a Delta Live Tables pipeline. I was thinking that each developer should have their own DLT pipeline in the development workspace. Currently, each domain has its development catalog, like sal...

  • 20 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @ismaelhenzel ,Yep, that's true. But you can find quite interesting workaround/solution for this problem at below excellent blog post:https://www.advancinganalytics.co.uk/blog/avoid-delta-live-table-conflicts-with-databricks-asset-bundles

  • 0 kudos
abetogi
by New Contributor III
  • 1576 Views
  • 3 replies
  • 0 kudos

AI

At Chevron we actively use Databricks to provide answers to business users. It was extremely interesting to see the use LakeHouseIQ initiatives as it can expedite how fast our users can receive their answers/reports. Is there any documentation that I...

  • 1576 Views
  • 3 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Guys, this thread was created in 2023. And the user who created it was last seen in 2023. I think there’s no point in resurrecting this thread

  • 0 kudos
2 More Replies
mordex
by Visitor
  • 45 Views
  • 3 replies
  • 1 kudos

Resolved! Why is spark creating 5 jobs and 200 tasks?

I am trying to read 1000 small csv files each 30 kb size which are stored in databricks volume. Below is the query i am doing:df=spark.read.csv.options(header=true).load('/path')df.collect() Why is it creating 5 jobs? Why 1-3 jobs have 200 tasks,4 ha...

030a9798-9c6f-4ab3-be53-7f6e4a5f7289.jfif
  • 45 Views
  • 3 replies
  • 1 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 1 kudos

@mordex - yes, Spark caps the parallelism for file listing at 200 tasks, regardless of whether you have 1,000 or 10,000 files. it is controlled by spark.sql.sources.parallelPartitionDiscovery.parallelism. Run below command to get value of it. spark.c...

  • 1 kudos
2 More Replies
cgrant
by Databricks Employee
  • 19352 Views
  • 4 replies
  • 6 kudos

What is the difference between OPTIMIZE and Auto Optimize?

I see that Delta Lake has an OPTIMIZE command and also table properties for Auto Optimize. What are the differences between these and when should I use one over the other?

  • 19352 Views
  • 4 replies
  • 6 kudos
Latest Reply
basit
New Contributor II
  • 6 kudos

Is this still valid answer in 2025 ? https://docs.databricks.com/aws/en/delta/tune-file-size#auto-compaction-for-delta-lake-on-databricks 

  • 6 kudos
3 More Replies
radha_krishna
by New Contributor
  • 98 Views
  • 4 replies
  • 1 kudos

"ai_parse_document()" is not a full OCR engine ? It's not extracting text from high quality image

 I used "ai_parse_document()" to parse a PNG file that contains cat images and text. From the image, I wanted to extract all the cat names, but the response returned nothing. It seems that "ai_parse_document()" does not support rich image extraction....

  • 98 Views
  • 4 replies
  • 1 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 1 kudos

@szymon_dybczak - yes, as it relies on AI models, there are chances of missing few cases due to non-deterministic nature of it. I have used it with vast number of PDFs in anger and it has worked pretty well in all those cases. Have not tried with PNG...

  • 1 kudos
3 More Replies
Michael_Galli
by Contributor III
  • 14639 Views
  • 5 replies
  • 8 kudos

Resolved! Monitoring Azure Databricks in an Azure Log Analytics Workspace

Does anyone have experience with the mspnp/spark-monitoring library ?Is this best practice, or are there better ways to monitor a Databricks Cluster?

  • 14639 Views
  • 5 replies
  • 8 kudos
Latest Reply
vr
Valued Contributor
  • 8 kudos

Interesting that Microsoft deleted this project. Was there any announcement as to when, why, and what to do now?

  • 8 kudos
4 More Replies
Ravikumashi
by Contributor
  • 3187 Views
  • 4 replies
  • 1 kudos

Resolved! Issue with Logging Spark Events to LogAnalytics after Upgrading to Databricks 11.3 LTS

We have recently been in the process of upgrading our Databricks clusters to version 11.3 LTS. As part of this upgrade, we have been working on integrating the logging of Spark events to LogAnalytics using the repository available at https://github.c...

  • 3187 Views
  • 4 replies
  • 1 kudos
Latest Reply
vr
Valued Contributor
  • 1 kudos

Anyone knows why was this repository deleted?https://github.com/mspnp/spark-monitoring

  • 1 kudos
3 More Replies
LeoGaller
by New Contributor II
  • 8823 Views
  • 5 replies
  • 5 kudos

Resolved! What are the options for "spark_conf.spark.databricks.cluster.profile"?

Hey guys, I'm trying to find what are the options we can pass to spark_conf.spark.databricks.cluster.profileI know looking around that some of the available configs are singleNode and serverless, but there are others?Where is the documentation of it?...

  • 8823 Views
  • 5 replies
  • 5 kudos
Latest Reply
LeoGallerDbx
Databricks Employee
  • 5 kudos

Looking internally, I was able to find the following: For single node mode: the config should be set to 'singleNode' For standard mode: the config should NOT be set to 'singleNode' For serverless mode: the config should be set to 'serverless' So, in ...

  • 5 kudos
4 More Replies
crami
by New Contributor II
  • 49 Views
  • 1 replies
  • 0 kudos

Declarative Pipeline Re-Deployment and existing managed tables exception

Hi,I am facing a issue regarding re deployment of declarative pipeline using asset bundle. On first deployment, I am able to run the pipeline successfully. On execution, pipeline, as expected create tables. However, when I try to re-deploy the pipeli...

  • 49 Views
  • 1 replies
  • 0 kudos
Latest Reply
bianca_unifeye
New Contributor III
  • 0 kudos

 Redeploys with Asset Bundles can safely reuse existing DLT-managed tables.If you see “managed table already exists”, it usually means:You’re using plain CREATE TABLE  and you should have OR REFRESH part of itThe bundle created a new pipeline pointin...

  • 0 kudos
deng_dev
by New Contributor III
  • 76 Views
  • 3 replies
  • 3 kudos

Resolved! Databricks AutoLoader IncrementalListing mode changes

Hi everyone!I wan investigating how Databricks AutoLoader IncrementalListing mode changes will impact my current autoloader streams. Currently all of them are set to cloudFiles.useIncrementalListing: auto. So I wanted to check if any of streams is ac...

deng_dev_0-1764849398594.png
  • 76 Views
  • 3 replies
  • 3 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 3 kudos

Hi @deng_dev ,When cloudFiles.useIncrementalListing is set to auto, Auto Loader automatically detects whether a given directory is applicable for incremental listing by checking and comparing file paths of previously completed directory listings.To e...

  • 3 kudos
2 More Replies
Pratikmsbsvm
by Contributor
  • 105 Views
  • 2 replies
  • 2 kudos

Resolved! Establishing a Connection between ADLS Gen2, Databricks and ADF In Microsoft Azure

Hello,May, Someone please help me with establishing connection between ADLS Gen2, Databricks and ADF, full steps if possibble. Do I need to route through key-vault, this is i am first time doing in production,.May somebody please share detailed step ...

  • 105 Views
  • 2 replies
  • 2 kudos
Latest Reply
nayan_wylde
Esteemed Contributor
  • 2 kudos

For a production environment (ADF as orchestrator, ADLS Gen2 as storage, Databricks for PySpark transformations), follow Microsoft-recommended best practices:Databricks → ADLS Gen2: Use Unity Catalog with Azure Managed Identity (via Access Connector)...

  • 2 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels