Data Engineering

Forum Posts

Sorted by:

by dsoat • New Contributor

08-08-2025 4:02:03 AM

2202 Views
2 replies
0 kudos

Performance Issue with MinHash + Approx Similarity Join for Fuzzy Duplicate Detection

Hello Community,We have implemented a fuzzy matching logic in Databricks using the MinHash algorithm along with the approxSimilarityJoin API to identify duplicate records in a large dataset. While the logic is working correctly, we are facing a signi...

Data Engineering

2202 Views
2 replies
0 kudos

08-08-2025 4:02:03 AM

View Replies

Latest Reply

RheaC
New Contributor II

03-04-2026 12:58:36 AM

0 kudos

On a dataset with millions of rows, approxSimilarityJoin(df, df, …) can become slow because it has to build a large list of candidate pairs (rows that might match) before it can score and filter them.Candidate explosion means your settings produce to...

0 kudos

03-04-2026 12:58:36 AM

1 More Replies

by saurabh_aher • New Contributor III

07-30-2025 11:47:50 PM

3748 Views
9 replies
1 kudos

RECURSION_ROW_LIMIT - how to increase more than 1M ?

I have usecase where we requires rows more than 1M. buts recursion is limited to 1M. how to increase this limit in Recursive CTE ?

Data Engineering

3748 Views
9 replies
1 kudos

07-30-2025 11:47:50 PM

View Replies

Latest Reply

KapilPatil
New Contributor II

03-02-2026 9:21:01 AM

1 kudos

Hi saurabh_aher,I was also facing the same issue. I resolved it by using the LIMIT ALL clause where the recursive CTE is used in the SELECT clause. Additionally, the Databricks Runtime (DBR) version must be 17.2 or above.

1 kudos

03-02-2026 9:21:01 AM

8 More Replies

by NehaR • New Contributor III

11-18-2024 9:52:56 AM

2535 Views
5 replies
1 kudos

Way to enforce partition column in where clause

Hi All,I want to know if is it possible to enforce that all queries must include a partition filter if the delta table is a partition table in databricks?I tried the below option and set the required property but it doesn't work and I can still query...

Data Engineering

databricks delta table

Delta table

partition

2535 Views
5 replies
1 kudos

11-18-2024 9:52:56 AM

View Replies

Latest Reply

balajij8
Contributor III

03-01-2026 7:54:02 AM

1 kudos

Liquid clustering is flexible and handles most of the issues automatically. You can use liquid clustering instead of forcing teams to use partition filter.

1 kudos

03-01-2026 7:54:02 AM

4 More Replies

by Danish11052000 • Contributor

02-17-2026 8:57:20 PM

636 Views
1 replies
0 kudos

Resolved! How should I correctly extract the full table name from request_params in audit logs?

I’m trying to build a UC usage/refresh tracking table for every workspace. For each workspace, I want to know how many times a UC table was refreshed or accessed each month. To do this, I’m reading the Databricks audit logs and I need to extract only...

Data Engineering

636 Views
1 replies
0 kudos

02-17-2026 8:57:20 PM

View Replies

Latest Reply

Ashwin_DSA
Databricks Employee

02-27-2026 1:18:30 AM

0 kudos

Hi @Danish11052000, Is there a reason you prefer building your own table for this? I'm asking because there are simpler and more reliable patterns than hand-parsing. If the account has system tables enabled, you can query system.access.audit directly...

0 kudos

02-27-2026 1:18:30 AM

by Skcmsa007 • New Contributor

02-03-2026 8:34:08 AM

702 Views
1 replies
0 kudos

Databrciks app 504 Upstream request timeout

I have deployed my fast api application in databricks apps and I have given keep alive timeout 1200.Issue:From databricks swagger I am getting 504 "upstream request timeout" after 2 mins while my api takes 3 min to respond. But in backend my task got...

Data Engineering

702 Views
1 replies
0 kudos

02-03-2026 8:34:08 AM

View Replies

Latest Reply

Lu_Wang_ENB_DBX
Databricks Employee

02-26-2026 8:30:23 PM

0 kudos

TLDR: You cannot increase the upstream gateway timeout in Databricks Apps. The best practice and quick solution to handle operations that take longer than the gateway limit is to implement a "status pull" (polling) pattern.Why the Timeout Occurs Data...

0 kudos

02-26-2026 8:30:23 PM

by Hsn • New Contributor II

11-03-2025 1:32:38 AM

925 Views
5 replies
1 kudos

Resolved! Suggest about data engineer

Hey, I'm Hasan Sayyed, currently pursuing SYBCA. I want to become a Data Engineer, but as a beginner, I’ve wasted some time learning other languages and technologies due to a lack of proper knowledge about this field. If someone could guide and teach...

Data Engineering

925 Views
5 replies
1 kudos

11-03-2025 1:32:38 AM

View Replies

Latest Reply

xandermuchanga
New Contributor II

02-26-2026 2:45:15 PM

1 kudos

1 kudos

02-26-2026 2:45:15 PM

4 More Replies

by raimundovidal • New Contributor II

02-18-2026 5:00:10 AM

478 Views
1 replies
0 kudos

Resolved! Managed File Events: Are reads from the file events cache independent per pipeline?

We have two Databricks workspaces (staging and production) attached to the same Unity Catalog metastore. Both workspaces run DLT pipelines that use Auto Loader with cloudFiles.useManagedFileEvents = "true" to ingest from the sameexternal location (sa...

Data Engineering

478 Views
1 replies
0 kudos

02-18-2026 5:00:10 AM

View Replies

Latest Reply

Ashwin_DSA
Databricks Employee

02-26-2026 2:01:46 PM

0 kudos

Hi @raimundovidal, You’re safe to run both staging and production Lakeflow Spark Declarative Pipelines with cloudFiles.useManagedFileEvents = "true" against the same external location (same S3 path) and same Unity Catalog metastore, as long as each p...

0 kudos

02-26-2026 2:01:46 PM

by Eibraao • New Contributor II

02-10-2026 3:48:53 AM

923 Views
6 replies
0 kudos

Disable the dashboard sharing field for dashboard creators

"How can I disable the dashboard sharing field for dashboard creators who are not admins? I tried changing the creator’s permission from 'CAN_MANAGE' to 'CAN_READ', but it had no effect — the creator still retains the 'CAN_MANAGE' permission

Data Engineering

923 Views
6 replies
0 kudos

02-10-2026 3:48:53 AM

View Replies

Latest Reply

Eibraao
New Contributor II

02-10-2026 3:49:43 AM

0 kudos

0 kudos

02-10-2026 3:49:43 AM

5 More Replies

by Datalight • Contributor

11-13-2025 9:58:49 PM

965 Views
2 replies
0 kudos

Resolved! Design Oracle Fusion SCM to Azure Databricks

Hello Techie,I am planning to migrate All module of Oracle fusion scm data to Azure Databricks.Do we have only option of BICC (Business Intelligence Cloud Connector), OR any other option avaialble.Can anyone please help me with reference architecture...

Data Engineering

965 Views
2 replies
0 kudos

11-13-2025 9:58:49 PM

View Replies

Latest Reply

Datalight
Contributor

02-26-2026 3:22:19 AM

0 kudos

@mark_ott : Thanks a ton. sorry for late reply, as Client was not sure on the approach. your solution helps a lot. Thanks Again.

0 kudos

02-26-2026 3:22:19 AM

1 More Replies

by rvo19941 • Databricks Partner

11-04-2024 12:03:28 PM

5852 Views
3 replies
0 kudos

Auto Loader File Notification Mode not working with ADLS Gen2 and files written as a stream

Dear,I am working on a real-time use case and am therefore using Auto Loader with file notification to ingest json files from a Gen2 Azure Storage Account in real-time. Full refreshes of my table work fine but I noticed Auto Loader was not picking up...

Data Engineering

ADLS

Auto Loader

Event Subscription

File Notification

Queue Storage

5852 Views
3 replies
0 kudos

11-04-2024 12:03:28 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-17-2025 4:00:03 AM

0 kudos

Auto Loader file notification in Databricks relies on Azure Event Grid’s BlobCreated event to trigger notifications for newly created files in Azure Data Lake Gen2. The issue you’re experiencing is a known limitation when files are written via certai...

0 kudos

11-17-2025 4:00:03 AM

2 More Replies

by janm2 • New Contributor II

07-01-2025 5:27:11 AM

2822 Views
6 replies
1 kudos

Autoloader cleansource option does not take any effect

Hello everyone,I was very keen to try out the Autoloader's new cleanSource option so we can clean up our landing folder easily.However I found out it does not have any effect whatsoever. As I cannot create a support case I am creating this post.A sim...

Data Engineering

2822 Views
6 replies
1 kudos

07-01-2025 5:27:11 AM

View Replies

Latest Reply

awhorton
New Contributor II

02-25-2026 8:02:02 AM

1 kudos

I had the same issue, which was caused by colons in the filenames. It quietly failed in the app, but log4j contained warnings like this:26/02/20 07:11:07 WARN CleanSourceFileMover: [queryId = f0e53] Unexpected exception when cleaning: /Volumes/prod/...

1 kudos

02-25-2026 8:02:02 AM

5 More Replies

by Sneeze7432 • New Contributor III

07-09-2025 2:05:34 PM

5932 Views
14 replies
2 kudos

File Trigger Not Triggering Multiple Runs

I have a job with one task which is to run a notebook. The job run is setup with a File arrival trigger with my blob storage as the location. The trigger works and the job will start when a new file arrives in the location, but it does not run for ...

Data Engineering

5932 Views
14 replies
2 kudos

07-09-2025 2:05:34 PM

View Replies

Latest Reply

maddy08
New Contributor II

02-25-2026 4:52:17 AM

2 kudos

@Sneeze7432 did you solve ?File arrival group the files when it executes, I verified this with Databricks team.you may encounter Multiple source matched error during MERGE operations. to overcome, It’s better to APPEND only into to Raw/bronze layer, ...

2 kudos

02-25-2026 4:52:17 AM

13 More Replies

by Shimon • New Contributor II

12-14-2025 4:37:00 AM

1400 Views
3 replies
0 kudos

Jackson version conflict

Hi,I am trying to implement the Spark TableProvider api and i am experiencing a jar conflict (I am using the 17.3 runtime). com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.15.2 requires Jackson Databind version >= 2.15.0 and < 2.1...

Data Engineering

1400 Views
3 replies
0 kudos

12-14-2025 4:37:00 AM

View Replies

Latest Reply

emanuele_m
Databricks Employee

02-25-2026 4:05:06 AM

0 kudos

Hi,this problem occurs if you have dynamic module registration, e.g.new ObjectMapper().findAndRegisterModules()and the way to solve it is to use something like this insteadval jsonMapper = new ObjectMapper() jsonMapper.registerModule(DefaultScalaModu...

0 kudos

02-25-2026 4:05:06 AM

2 More Replies

by hello_world • Databricks Partner

12-26-2022 5:20:35 PM

4768 Views
2 replies
5 kudos

What is the purpose of the USAGE privilege?

I watched a couple of courses on Databricks Academy, none of which clearly explains or demonstrates the purpose of the USAGE privilege.USAGE: does not give any abilities, but is an additional requirement to perform any action on a schema object.I hav...

Data Engineering

4768 Views
2 replies
5 kudos

12-26-2022 5:20:35 PM

View Replies

Latest Reply

Celebal2
Databricks Partner

02-25-2026 3:58:56 AM

5 kudos

In Databricks (Unity Catalog), USAGE is a basic access privilege that allows a user to access a container object but not read or modify data inside it.Think like:“Permission to enter the building, but not open any rooms.”

5 kudos

02-25-2026 3:58:56 AM

1 More Replies

by dan11 • New Contributor II

03-04-2016 8:46:20 PM

7098 Views
5 replies
1 kudos

sql delete?

<pre> Hello databricks people, I started working with databricks today. I have a sql script which I developed with sqlite3 on a laptop. I want to port the script to databricks. I started with two sql statements: select count(prop_id) from prop0; del...

Data Engineering

7098 Views
5 replies
1 kudos

03-04-2016 8:46:20 PM

View Replies

Latest Reply

oliverstonez
New Contributor III

02-25-2026 3:05:28 AM

1 kudos

You aren't doing anything wrong logically, but Databricks requires row-level changes to happen on Delta Lake tables. Standard Spark tables (like those backed by raw Parquet) are often immutable. Have a look at the Language Manual for DELETE to ensure...

1 kudos

02-25-2026 3:05:28 AM

4 More Replies

Databricks Community

Forum Posts

Performance Issue with MinHash + Approx Similarity Join for Fuzzy Duplicate Detection

RECURSION_ROW_LIMIT - how to increase more than 1M ?

Way to enforce partition column in where clause

Resolved! How should I correctly extract the full table name from request_params in audit logs?

Databrciks app 504 Upstream request timeout

Resolved! Suggest about data engineer

Resolved! Managed File Events: Are reads from the file events cache independent per pipeline?

Disable the dashboard sharing field for dashboard creators

Resolved! Design Oracle Fusion SCM to Azure Databricks

Auto Loader File Notification Mode not working with ADLS Gen2 and files written as a stream

Autoloader cleansource option does not take any effect

File Trigger Not Triggering Multiple Runs

Jackson version conflict

What is the purpose of the USAGE privilege?

sql delete?

Databricks Runtime, Pyspark and Spark Versions

Create External Catalog when dbname has special ch...

Azure Databricks Serverless – SFTP Connectivity (e...

How can retrieve backfill run parameter in Python?

Custom and community connectors