cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

datastrange
by New Contributor
  • 472 Views
  • 1 replies
  • 1 kudos

Best pattern for ingesting data from hundreds of separate ADLS Gen2 containers into Databricks?

We're building a lakehouse on Azure Databricks with Unity Catalog. Our data lands in Azure Data Lake Storage Gen2 (Hierarchical Namespace enabled) as JSON files. The challenge is multi-tenancy: each tenant's data is written to a separate container in...

  • 472 Views
  • 1 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @datastrange, Great question -- this is a common architectural challenge in multi-tenant Azure Databricks environments, and you have already identified the key constraint: Auto Loader does not support wildcards in the container portion of the abfs...

  • 1 kudos
aonurdemir
by Contributor
  • 921 Views
  • 3 replies
  • 1 kudos

Resolved! Conflict between Predictive Optimization and High Frequency Writes

(Dear Moderators, why do you remove this question? It is a genuine question. Do not, please. )We have a continuous dlt pipeline that has tables updating every minute and partitioned by "partition_key" column. Table is 4 TB. Has 16k files. Sometimes w...

  • 921 Views
  • 3 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @aonurdemir, This is a well-known conflict pattern in Delta Lake, and the root cause is clearly documented. Let me break it down and give you the concrete options. ROOT CAUSE The Databricks documentation on isolation levels and write conflicts exp...

  • 1 kudos
2 More Replies
damirg
by New Contributor
  • 484 Views
  • 3 replies
  • 0 kudos

Switching Branches using code in notebooks?

Hi,I’m working on a project in a Databricks notebook and I’m trying to implement the following workflow:Create a new branch from Python codeIn the next cell, switch the notebook to that newly created branchI’m able to create the branch without issues...

  • 484 Views
  • 3 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi, Great question! Yes, you can switch Git branches programmatically in Databricks -- there are a few approaches depending on your use case. OPTION 1: DATABRICKS PYTHON SDK (RECOMMENDED FOR NOTEBOOKS) The simplest approach from within a notebook is ...

  • 0 kudos
2 More Replies
Vivek_Patil1
by New Contributor
  • 458 Views
  • 1 replies
  • 0 kudos

Config-Driven Data Harmonization Framework in Databricks (Silver → Harmonized_Silver)

Hi Community,We are currently designing a Data Harmonization framework in Databricks and would appreciate insights from anyone who has implemented something similar at scale.Context:We are ingesting data from multiple source systems where:- Different...

  • 458 Views
  • 1 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @Vivek_Patil1, Great question -- this is a pattern we see frequently in enterprise data platforms, especially in healthcare and financial services where multi-source harmonization is critical. Here is a comprehensive architecture recommendation us...

  • 0 kudos
Datalight
by Contributor
  • 732 Views
  • 2 replies
  • 0 kudos

Data Observability in Databricks

This is very General question more on the Design Side on Observability.There are 500+ Data Pipeline build in healthcare domain using Azure and AWS Databricks.May someone please help me how to design a system :-1. Continuous track system health and be...

  • 732 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @Datalight, Great question, and one that many organizations at your scale face. With 500+ pipelines across both Azure and AWS, you will want a layered observability approach that combines Databricks-native capabilities. Let me walk through a pract...

  • 0 kudos
1 More Replies
swzzzsw
by Databricks Partner
  • 12846 Views
  • 5 replies
  • 9 kudos

"Run now with different parameters" - different parameters not recognized by jobs involving multiple tasks

I'm running a databricks job involving multiple tasks and would like to run the job with different set of task parameters. I can achieve that by edit each task and and change the parameter values. However, it gets very manual when I have a lot of tas...

  • 12846 Views
  • 5 replies
  • 9 kudos
Latest Reply
Dali1
New Contributor III
  • 9 kudos

Hello Anyone found a better solution for this ? 

  • 9 kudos
4 More Replies
manugarri
by New Contributor II
  • 23336 Views
  • 13 replies
  • 2 kudos

Fuzzy text matching in Spark

I have a list of client provided data, a list of company names. I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark ...

  • 23336 Views
  • 13 replies
  • 2 kudos
Latest Reply
RheaC
New Contributor II
  • 2 kudos

+1 on LLMs. I would check this article on using Similarity API instead of rapidfuzz in 2026 especially for larger/growing datasets https://medium.com/p/0854593e380a

  • 2 kudos
12 More Replies
mebinjoy
by Databricks Partner
  • 6329 Views
  • 7 replies
  • 8 kudos

Resolved! Certificate not received.

I had completed the Data Engineering Associate V3 certification today morning and I'm yet to receive my certification. I had received a mail stating that I had passed and the certification would be mailed.

  • 6329 Views
  • 7 replies
  • 8 kudos
Latest Reply
varsha2
New Contributor II
  • 8 kudos

I completed my exam last week still not received certificate. Please help as soon as possible Its really urgent

  • 8 kudos
6 More Replies
neerajaN
by New Contributor II
  • 324 Views
  • 1 replies
  • 1 kudos

Resolved! schema check

hi , i am running the below query in databricks , first job5 created with 10 partitions .and again job6 started where actual processing started.in job5 is it identifying schema , when schema check will be done for the new dataset. is it checked by dr...

schema check.png
  • 324 Views
  • 1 replies
  • 1 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 1 kudos

Hi @neerajaN, You are right. Job 5 is Schema Inference job. You can identify Job 5 as a schema/header inference job because it triggers immediately upon spark.read. Since header=True is set without a manual .schema(), Spark must launch a job to look ...

  • 1 kudos
FranPérez
by New Contributor III
  • 17798 Views
  • 9 replies
  • 6 kudos

set PYTHONPATH when executing workflows

I set up a workflow using 2 tasks. Just for demo purposes, I'm using an interactive cluster for running the workflow. { "task_key": "prepare", "spark_python_task": { "python_file": "file...

  • 17798 Views
  • 9 replies
  • 6 kudos
Latest Reply
kenmyers-8451
Contributor II
  • 6 kudos

Just checking in again if there is a way to do this in the last few years? As Fran mentioned, `sys.path.append("/Workspace/Repos/devops/mlhub-mlops-dev/src")` is not a great "fix" for the reasons already mentioned. I've found that you can do `pip ins...

  • 6 kudos
8 More Replies
dsoat
by New Contributor
  • 1910 Views
  • 2 replies
  • 0 kudos

Performance Issue with MinHash + Approx Similarity Join for Fuzzy Duplicate Detection

Hello Community,We have implemented a fuzzy matching logic in Databricks using the MinHash algorithm along with the approxSimilarityJoin API to identify duplicate records in a large dataset. While the logic is working correctly, we are facing a signi...

  • 1910 Views
  • 2 replies
  • 0 kudos
Latest Reply
RheaC
New Contributor II
  • 0 kudos

On a dataset with millions of rows, approxSimilarityJoin(df, df, …) can become slow because it has to build a large list of candidate pairs (rows that might match) before it can score and filter them.Candidate explosion means your settings produce to...

  • 0 kudos
1 More Replies
saurabh_aher
by New Contributor III
  • 3104 Views
  • 9 replies
  • 1 kudos

RECURSION_ROW_LIMIT - how to increase more than 1M ?

 I have usecase where we requires rows more than 1M. buts recursion is limited to 1M. how to increase this limit in Recursive CTE ?   

saurabh_aher_0-1753944326907.png saurabh_aher_1-1753944347987.png
  • 3104 Views
  • 9 replies
  • 1 kudos
Latest Reply
KapilPatil
New Contributor II
  • 1 kudos

Hi saurabh_aher,I was also facing the same issue. I resolved it by using the LIMIT ALL clause where the recursive CTE is used in the SELECT clause. Additionally, the Databricks Runtime (DBR) version must be 17.2 or above.

  • 1 kudos
8 More Replies
NehaR
by New Contributor III
  • 2287 Views
  • 5 replies
  • 1 kudos

Way to enforce partition column in where clause

Hi All,I want to know if is it possible to enforce that all queries must include a partition filter if the delta table is a partition table in databricks?I tried the below option and set the required property but it doesn't work and I can still query...

Data Engineering
databricks delta table
Delta table
partition
  • 2287 Views
  • 5 replies
  • 1 kudos
Latest Reply
balajij8
Contributor
  • 1 kudos

Liquid clustering is flexible and handles most of the issues automatically. You can use liquid clustering instead of forcing teams to use partition filter.

  • 1 kudos
4 More Replies
Danish11052000
by Contributor
  • 426 Views
  • 1 replies
  • 0 kudos

Resolved! How should I correctly extract the full table name from request_params in audit logs?

I’m trying to build a UC usage/refresh tracking table for every workspace. For each workspace, I want to know how many times a UC table was refreshed or accessed each month. To do this, I’m reading the Databricks audit logs and I need to extract only...

  • 426 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 0 kudos

Hi @Danish11052000, Is there a reason you prefer building your own table for this? I'm asking because there are simpler and more reliable patterns than hand-parsing. If the account has system tables enabled, you can query system.access.audit directly...

  • 0 kudos
Skcmsa007
by New Contributor
  • 524 Views
  • 1 replies
  • 0 kudos

Databrciks app 504 Upstream request timeout

I have deployed my fast api application in databricks apps and I have given keep alive timeout 1200.Issue:From databricks swagger I am getting 504 "upstream request timeout" after 2 mins while my api takes 3 min to respond. But in backend my task got...

  • 524 Views
  • 1 replies
  • 0 kudos
Latest Reply
Lu_Wang_ENB_DBX
Databricks Employee
  • 0 kudos

TLDR: You cannot increase the upstream gateway timeout in Databricks Apps. The best practice and quick solution to handle operations that take longer than the gateway limit is to implement a "status pull" (polling) pattern.Why the Timeout Occurs Data...

  • 0 kudos
Labels