cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

dsoat
by New Contributor
  • 2202 Views
  • 2 replies
  • 0 kudos

Performance Issue with MinHash + Approx Similarity Join for Fuzzy Duplicate Detection

Hello Community,We have implemented a fuzzy matching logic in Databricks using the MinHash algorithm along with the approxSimilarityJoin API to identify duplicate records in a large dataset. While the logic is working correctly, we are facing a signi...

  • 2202 Views
  • 2 replies
  • 0 kudos
Latest Reply
RheaC
New Contributor II
  • 0 kudos

On a dataset with millions of rows, approxSimilarityJoin(df, df, …) can become slow because it has to build a large list of candidate pairs (rows that might match) before it can score and filter them.Candidate explosion means your settings produce to...

  • 0 kudos
1 More Replies
saurabh_aher
by New Contributor III
  • 3748 Views
  • 9 replies
  • 1 kudos

RECURSION_ROW_LIMIT - how to increase more than 1M ?

 I have usecase where we requires rows more than 1M. buts recursion is limited to 1M. how to increase this limit in Recursive CTE ?   

saurabh_aher_0-1753944326907.png saurabh_aher_1-1753944347987.png
  • 3748 Views
  • 9 replies
  • 1 kudos
Latest Reply
KapilPatil
New Contributor II
  • 1 kudos

Hi saurabh_aher,I was also facing the same issue. I resolved it by using the LIMIT ALL clause where the recursive CTE is used in the SELECT clause. Additionally, the Databricks Runtime (DBR) version must be 17.2 or above.

  • 1 kudos
8 More Replies
NehaR
by New Contributor III
  • 2535 Views
  • 5 replies
  • 1 kudos

Way to enforce partition column in where clause

Hi All,I want to know if is it possible to enforce that all queries must include a partition filter if the delta table is a partition table in databricks?I tried the below option and set the required property but it doesn't work and I can still query...

Data Engineering
databricks delta table
Delta table
partition
  • 2535 Views
  • 5 replies
  • 1 kudos
Latest Reply
balajij8
Contributor III
  • 1 kudos

Liquid clustering is flexible and handles most of the issues automatically. You can use liquid clustering instead of forcing teams to use partition filter.

  • 1 kudos
4 More Replies
Danish11052000
by Contributor
  • 636 Views
  • 1 replies
  • 0 kudos

Resolved! How should I correctly extract the full table name from request_params in audit logs?

I’m trying to build a UC usage/refresh tracking table for every workspace. For each workspace, I want to know how many times a UC table was refreshed or accessed each month. To do this, I’m reading the Databricks audit logs and I need to extract only...

  • 636 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 0 kudos

Hi @Danish11052000, Is there a reason you prefer building your own table for this? I'm asking because there are simpler and more reliable patterns than hand-parsing. If the account has system tables enabled, you can query system.access.audit directly...

  • 0 kudos
Skcmsa007
by New Contributor
  • 702 Views
  • 1 replies
  • 0 kudos

Databrciks app 504 Upstream request timeout

I have deployed my fast api application in databricks apps and I have given keep alive timeout 1200.Issue:From databricks swagger I am getting 504 "upstream request timeout" after 2 mins while my api takes 3 min to respond. But in backend my task got...

  • 702 Views
  • 1 replies
  • 0 kudos
Latest Reply
Lu_Wang_ENB_DBX
Databricks Employee
  • 0 kudos

TLDR: You cannot increase the upstream gateway timeout in Databricks Apps. The best practice and quick solution to handle operations that take longer than the gateway limit is to implement a "status pull" (polling) pattern.Why the Timeout Occurs Data...

  • 0 kudos
Hsn
by New Contributor II
  • 925 Views
  • 5 replies
  • 1 kudos

Resolved! Suggest about data engineer

Hey, I'm Hasan Sayyed, currently pursuing SYBCA. I want to become a Data Engineer, but as a beginner, I’ve wasted some time learning other languages and technologies due to a lack of proper knowledge about this field. If someone could guide and teach...

  • 925 Views
  • 5 replies
  • 1 kudos
Latest Reply
xandermuchanga
New Contributor II
  • 1 kudos

2x

  • 1 kudos
4 More Replies
raimundovidal
by New Contributor II
  • 478 Views
  • 1 replies
  • 0 kudos

Resolved! Managed File Events: Are reads from the file events cache independent per pipeline?

We have two Databricks workspaces (staging and production) attached to the same Unity Catalog metastore. Both workspaces run DLT pipelines that use Auto Loader with cloudFiles.useManagedFileEvents = "true" to ingest from the sameexternal location (sa...

  • 478 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 0 kudos

Hi @raimundovidal, You’re safe to run both staging and production Lakeflow Spark Declarative Pipelines with cloudFiles.useManagedFileEvents = "true" against the same external location (same S3 path) and same Unity Catalog metastore, as long as each p...

  • 0 kudos
Eibraao
by New Contributor II
  • 923 Views
  • 6 replies
  • 0 kudos

Disable the dashboard sharing field for dashboard creators

"How can I disable the dashboard sharing field for dashboard creators who are not admins? I tried changing the creator’s permission from 'CAN_MANAGE' to 'CAN_READ', but it had no effect — the creator still retains the 'CAN_MANAGE' permission

  • 923 Views
  • 6 replies
  • 0 kudos
Latest Reply
Eibraao
New Contributor II
  • 0 kudos

 

  • 0 kudos
5 More Replies
Datalight
by Contributor
  • 965 Views
  • 2 replies
  • 0 kudos

Resolved! Design Oracle Fusion SCM to Azure Databricks

Hello Techie,I am planning to migrate All module of Oracle fusion scm data to Azure Databricks.Do we have only option of BICC (Business Intelligence Cloud Connector), OR any other option avaialble.Can anyone please help me with reference architecture...

  • 965 Views
  • 2 replies
  • 0 kudos
Latest Reply
Datalight
Contributor
  • 0 kudos

@mark_ott : Thanks a ton. sorry for late reply, as Client was not sure on the approach. your solution helps a lot. Thanks Again.

  • 0 kudos
1 More Replies
rvo19941
by Databricks Partner
  • 5852 Views
  • 3 replies
  • 0 kudos

Auto Loader File Notification Mode not working with ADLS Gen2 and files written as a stream

Dear,I am working on a real-time use case and am therefore using Auto Loader with file notification to ingest json files from a Gen2 Azure Storage Account in real-time. Full refreshes of my table work fine but I noticed Auto Loader was not picking up...

Data Engineering
ADLS
Auto Loader
Event Subscription
File Notification
Queue Storage
  • 5852 Views
  • 3 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Auto Loader file notification in Databricks relies on Azure Event Grid’s BlobCreated event to trigger notifications for newly created files in Azure Data Lake Gen2. The issue you’re experiencing is a known limitation when files are written via certai...

  • 0 kudos
2 More Replies
janm2
by New Contributor II
  • 2822 Views
  • 6 replies
  • 1 kudos

Autoloader cleansource option does not take any effect

Hello everyone,I was very keen to try out the Autoloader's new cleanSource option so we can clean up our landing folder easily.However I found out it does not have any effect whatsoever. As I cannot create a support case I am creating this post.A sim...

  • 2822 Views
  • 6 replies
  • 1 kudos
Latest Reply
awhorton
New Contributor II
  • 1 kudos

I had the same issue, which was caused by colons in the filenames.  It quietly failed in the app, but log4j contained warnings like this:26/02/20 07:11:07 WARN CleanSourceFileMover: [queryId = f0e53] Unexpected exception when cleaning: /Volumes/prod/...

  • 1 kudos
5 More Replies
Sneeze7432
by New Contributor III
  • 5932 Views
  • 14 replies
  • 2 kudos

File Trigger Not Triggering Multiple Runs

I have a job with one task which is to run a notebook.  The job run is setup with a File arrival trigger with my blob storage as the location.  The trigger works and the job will start when a new file arrives in the location, but it does not run for ...

  • 5932 Views
  • 14 replies
  • 2 kudos
Latest Reply
maddy08
New Contributor II
  • 2 kudos

@Sneeze7432 did you solve ?File arrival group the files when it executes, I verified this with Databricks team.you may encounter Multiple source matched error during MERGE operations. to overcome, It’s better to APPEND only into to Raw/bronze layer, ...

  • 2 kudos
13 More Replies
Shimon
by New Contributor II
  • 1400 Views
  • 3 replies
  • 0 kudos

Jackson version conflict

Hi,I am trying to implement the Spark TableProvider api and i am experiencing a jar conflict (I am using the 17.3 runtime). com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.15.2 requires Jackson Databind version >= 2.15.0 and < 2.1...

  • 1400 Views
  • 3 replies
  • 0 kudos
Latest Reply
emanuele_m
Databricks Employee
  • 0 kudos

Hi,this problem occurs if you have dynamic module registration, e.g.new ObjectMapper().findAndRegisterModules()and the way to solve it is to use something like this insteadval jsonMapper = new ObjectMapper() jsonMapper.registerModule(DefaultScalaModu...

  • 0 kudos
2 More Replies
hello_world
by Databricks Partner
  • 4768 Views
  • 2 replies
  • 5 kudos

What is the purpose of the USAGE privilege?

I watched a couple of courses on Databricks Academy, none of which clearly explains or demonstrates the purpose of the USAGE privilege.USAGE: does not give any abilities, but is an additional requirement to perform any action on a schema object.I hav...

  • 4768 Views
  • 2 replies
  • 5 kudos
Latest Reply
Celebal2
Databricks Partner
  • 5 kudos

In Databricks (Unity Catalog), USAGE is a basic access privilege that allows a user to access a container object but not read or modify data inside it.Think like:“Permission to enter the building, but not open any rooms.”

  • 5 kudos
1 More Replies
dan11
by New Contributor II
  • 7098 Views
  • 5 replies
  • 1 kudos

sql delete?

<pre> Hello databricks people, I started working with databricks today. I have a sql script which I developed with sqlite3 on a laptop. I want to port the script to databricks. I started with two sql statements: select count(prop_id) from prop0; del...

  • 7098 Views
  • 5 replies
  • 1 kudos
Latest Reply
oliverstonez
New Contributor III
  • 1 kudos

You aren't doing anything wrong logically, but Databricks requires row-level changes to happen on Delta Lake tables. Standard Spark tables (like those backed by raw Parquet) are often immutable. Have a look at the Language Manual for DELETE to ensure...

  • 1 kudos
4 More Replies
Labels