cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

analyticsnerd
by New Contributor III
  • 479 Views
  • 5 replies
  • 3 kudos

Resolved! Row tracking in Delta tables

What exactly is row tracking and why should we use it for our delta tables? Could you explain with an example how it works internally and is it mandatory to use? 

  • 479 Views
  • 5 replies
  • 3 kudos
Latest Reply
Poorva21
New Contributor III
  • 3 kudos

Row tracking gives each Delta row a stable internal ID, so Delta can track inserts/updates/deletes across table versions—even when files are rewritten or compacted.Suppose we have a Delta table:id value 1A2BWhen row tracking is enabled, Delta Lake st...

  • 3 kudos
4 More Replies
__Aziz__
by New Contributor II
  • 283 Views
  • 1 replies
  • 1 kudos

Resolved! mongodb connector duplicate writes

Hi everyone,Has anyone run into this issue? I’m using the MongoDB Spark Connector on Databricks to expose data from Delta Lake to MongoDB. My workflow is:overwrite the collection (very fast),then create the indexes.Occasionally, I’m seeing duplicates...

  • 283 Views
  • 1 replies
  • 1 kudos
Latest Reply
bianca_unifeye
Contributor
  • 1 kudos

Hi Aziz,What you’re seeing is an expected behaviour when combining Spark retries with non-idempotent writes.Spark’s write path is task-based and fault-tolerant. If a task fails part-way through writing to MongoDB, Spark will retry that task.From Spar...

  • 1 kudos
abetogi
by New Contributor III
  • 1829 Views
  • 3 replies
  • 2 kudos

AI

At Chevron we actively use Databricks to provide answers to business users. It was extremely interesting to see the use LakeHouseIQ initiatives as it can expedite how fast our users can receive their answers/reports. Is there any documentation that I...

  • 1829 Views
  • 3 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Guys, this thread was created in 2023. And the user who created it was last seen in 2023. I think there’s no point in resurrecting this thread

  • 2 kudos
2 More Replies
radha_krishna
by New Contributor
  • 471 Views
  • 4 replies
  • 1 kudos

"ai_parse_document()" is not a full OCR engine ? It's not extracting text from high quality image

 I used "ai_parse_document()" to parse a PNG file that contains cat images and text. From the image, I wanted to extract all the cat names, but the response returned nothing. It seems that "ai_parse_document()" does not support rich image extraction....

  • 471 Views
  • 4 replies
  • 1 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 1 kudos

@szymon_dybczak - yes, as it relies on AI models, there are chances of missing few cases due to non-deterministic nature of it. I have used it with vast number of PDFs in anger and it has worked pretty well in all those cases. Have not tried with PNG...

  • 1 kudos
3 More Replies
Michael_Galli
by Contributor III
  • 15191 Views
  • 5 replies
  • 8 kudos

Resolved! Monitoring Azure Databricks in an Azure Log Analytics Workspace

Does anyone have experience with the mspnp/spark-monitoring library ?Is this best practice, or are there better ways to monitor a Databricks Cluster?

  • 15191 Views
  • 5 replies
  • 8 kudos
Latest Reply
vr
Valued Contributor
  • 8 kudos

Interesting that Microsoft deleted this project. Was there any announcement as to when, why, and what to do now?

  • 8 kudos
4 More Replies
Ravikumashi
by Contributor
  • 3360 Views
  • 4 replies
  • 1 kudos

Resolved! Issue with Logging Spark Events to LogAnalytics after Upgrading to Databricks 11.3 LTS

We have recently been in the process of upgrading our Databricks clusters to version 11.3 LTS. As part of this upgrade, we have been working on integrating the logging of Spark events to LogAnalytics using the repository available at https://github.c...

  • 3360 Views
  • 4 replies
  • 1 kudos
Latest Reply
vr
Valued Contributor
  • 1 kudos

Anyone knows why was this repository deleted?https://github.com/mspnp/spark-monitoring

  • 1 kudos
3 More Replies
LeoGaller
by New Contributor II
  • 9322 Views
  • 5 replies
  • 5 kudos

Resolved! What are the options for "spark_conf.spark.databricks.cluster.profile"?

Hey guys, I'm trying to find what are the options we can pass to spark_conf.spark.databricks.cluster.profileI know looking around that some of the available configs are singleNode and serverless, but there are others?Where is the documentation of it?...

  • 9322 Views
  • 5 replies
  • 5 kudos
Latest Reply
LeoGallerDbx
Databricks Employee
  • 5 kudos

Looking internally, I was able to find the following: For single node mode: the config should be set to 'singleNode' For standard mode: the config should NOT be set to 'singleNode' For serverless mode: the config should be set to 'serverless' So, in ...

  • 5 kudos
4 More Replies
deng_dev
by New Contributor III
  • 1041 Views
  • 3 replies
  • 3 kudos

Resolved! Databricks AutoLoader IncrementalListing mode changes

Hi everyone!I wan investigating how Databricks AutoLoader IncrementalListing mode changes will impact my current autoloader streams. Currently all of them are set to cloudFiles.useIncrementalListing: auto. So I wanted to check if any of streams is ac...

deng_dev_0-1764849398594.png
  • 1041 Views
  • 3 replies
  • 3 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 3 kudos

Hi @deng_dev ,When cloudFiles.useIncrementalListing is set to auto, Auto Loader automatically detects whether a given directory is applicable for incremental listing by checking and comparing file paths of previously completed directory listings.To e...

  • 3 kudos
2 More Replies
Pratikmsbsvm
by Contributor
  • 1087 Views
  • 2 replies
  • 2 kudos

Resolved! Establishing a Connection between ADLS Gen2, Databricks and ADF In Microsoft Azure

Hello,May, Someone please help me with establishing connection between ADLS Gen2, Databricks and ADF, full steps if possibble. Do I need to route through key-vault, this is i am first time doing in production,.May somebody please share detailed step ...

  • 1087 Views
  • 2 replies
  • 2 kudos
Latest Reply
nayan_wylde
Esteemed Contributor
  • 2 kudos

For a production environment (ADF as orchestrator, ADLS Gen2 as storage, Databricks for PySpark transformations), follow Microsoft-recommended best practices:Databricks → ADLS Gen2: Use Unity Catalog with Azure Managed Identity (via Access Connector)...

  • 2 kudos
1 More Replies
fly_high_five
by New Contributor III
  • 237 Views
  • 1 replies
  • 3 kudos

Unable to retrieve catalog, schema, tables using JDBC endpoint of SQL Warehouse

Hi,I am connecting to SQL Warehouse in UC using its JDBC endpoint via DBeaver. However, it doesn't list any catalogs, schemas and tables. I checked the permission of SQL WH by logging to ADB Workspace and queried the table (attached a dummy table exa...

fly_high_five_0-1764770250626.png fly_high_five_1-1764770371607.png fly_high_five_2-1764770788643.png
  • 237 Views
  • 1 replies
  • 3 kudos
Latest Reply
mitchellg-db
Databricks Employee
  • 3 kudos

Hi there, I'm not familiar with DBeaver specifically, but I have experienced DBSQL Warehouses being much stricter when enforcing permissions than All-Purpose Clusters. Warehouses check explicitly if that identity has access to those assets, where All...

  • 3 kudos
sumit2jha
by New Contributor III
  • 7749 Views
  • 7 replies
  • 5 kudos

Resolved! ADE 2.1 Unable to run Classroom-Setup-3.1

%run ../Includes/Classroom-Setup-3.1After running the above command, getting this error message, attached error screenshot.AnalysisException: You are trying to read a Delta table `spark_catalog`.`dbacademy_sumit_s_jha_hk_ey_com_adewd_3_1`.`date_looku...

  • 7749 Views
  • 7 replies
  • 5 kudos
Latest Reply
sumit2jha
New Contributor III
  • 5 kudos

First run this file before starting. Problem will be solved 

  • 5 kudos
6 More Replies
Mathias_Peters
by Contributor II
  • 310 Views
  • 2 replies
  • 1 kudos

Resolved! Streamed DLT Pipeline using a lookup table

Hi, I need to join three streams/streamed data sets in a DLT pipeline. I am reading from a Kinesis data stream a sequence of events per group key. The logically first of the events per group contains a marker which determines whether that group is re...

  • 310 Views
  • 2 replies
  • 1 kudos
Latest Reply
Mathias_Peters
Contributor II
  • 1 kudos

hi @mark_ott , thank you for your help. I have a follow up question regarding data completeness and out of order processing. I have decided to go with the delta table option since super low latency is not an issue and since this option has (seemingly...

  • 1 kudos
1 More Replies
hidden
by New Contributor II
  • 242 Views
  • 1 replies
  • 0 kudos

Resolved! Delta live tables upsert logic without apply changes or autocdc logic

i want to create  delta live tables which should be streaming  and i want to use the manual upsert logic without using the apply changes api or autocdc api . how can i do it  

  • 242 Views
  • 1 replies
  • 0 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 0 kudos

Hello @hidden ,  Creating Streaming Delta Live Tables with Manual Upsert Logic   Let’s dig in… this question comes up a lot when folks want upsert behavior in DLT but aren’t using APPLY CHANGES or Auto-CDC. The short version: DLT doesn’t let you drop...

  • 0 kudos
dhruvs2
by New Contributor II
  • 937 Views
  • 4 replies
  • 5 kudos

How to trigger a Databricks job only after multiple other jobs have completed

We have a use case where Job C should start only after both Job A and Job B have successfully completed.In Airflow, we achieve this using an ExternalTaskSensor to set dependencies across different DAGs.Is there a way to configure something similar in...

  • 937 Views
  • 4 replies
  • 5 kudos
Latest Reply
BS_THE_ANALYST
Esteemed Contributor III
  • 5 kudos

Hi @dhruvs2  .A Lakeflow Job consists of tasks. The tasks can be things like notebooks or other jobs. If you want to orchestrate many jobs, I'd agree that having a job to do this is your best bet . Then you can setup the dependencies as you require.I...

  • 5 kudos
3 More Replies
andreacfm
by New Contributor II
  • 268 Views
  • 1 replies
  • 1 kudos

Resolved! Simple append only in DLT

I am facing an issue trying to find a way to insert some computed rows into a table in the context of a dlt pipeline.My use case is extremely simple. Moving from bronze to silver I update several tables using a mix of streaming and materialized table...

  • 268 Views
  • 1 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

Greetings @andreacfm,  You’re not missing a thing. What you’re seeing is a known limitation in how DLT/Lakeflow pipelines handle append_flow. It really does expect a streaming source, and the once=True flag only fires during the first run of the pipe...

  • 1 kudos
Labels