cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

DoredlaCharan
by New Contributor III
  • 631 Views
  • 5 replies
  • 1 kudos

MongoDB to databricks driver killed and compute re-attached

I started reading the data from the mongodb using the spark read it uses mongo-spark-connector, by default there will be sample size as 1000 meaning referring only 1000 documents in the collection to make them as columns in the dataframe, so i increa...

  • 631 Views
  • 5 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @DoredlaCharan, The root cause here is straightforward: setting sampleSize to 100,000 forces the MongoDB Spark Connector to pull 100K documents onto your driver node just for schema inference. With 100+ keys per document and mergeSchema enabled, t...

  • 1 kudos
4 More Replies
DylanStout
by Contributor
  • 1187 Views
  • 1 replies
  • 0 kudos

ODBC driver installation - help needed

Hello, I’m trying to use pyodbc inside Databricks to connect to a SQL Server database, but I’m working in a restricted, offline Databricks workspace (no outbound internet).What I’ve learned so far:Databricks clusters do not include Microsoft’s ODBC D...

  • 1187 Views
  • 1 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @DylanStout, This is worth walking through carefully. It sounds like you have already done solid research on the constraints. Let me walk you through the most likely reason your init script is hanging and provide a complete working approach for of...

  • 0 kudos
ravipal-global
by New Contributor II
  • 1179 Views
  • 4 replies
  • 0 kudos

delete and reload append only delta live tables with autoloader

We have a set of streaming dlt pipelines following a medallion pattern where s3 bucket -> autoloader -> bronze delta tables -> silver delta tables -> gold delta tables. All delta tables are in a unity catalog under separate schemas. We need a solutio...

  • 1179 Views
  • 4 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @ravipal-global, I have seen this pattern before. The behavior you are seeing is expected. Let me explain why it happens and then walk through several approaches that can help you achieve delete-and-reload without requiring a full refresh of your ...

  • 0 kudos
3 More Replies
Pratikmsbsvm
by Contributor
  • 2908 Views
  • 2 replies
  • 1 kudos

Data Migration from SAP S/4HANA to Databricks

May someone please help me designing the Migration of SAP S/4 HANA to Databricks. How to design this. what all we need to consider as LLD.1. How Data needs to be extracted and by which tool ? near–real-time replication is required2. Each layer for Da...

  • 2908 Views
  • 2 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @Pratikmsbsvm, Happy to help with this one. SAP S/4HANA to Databricks is one of the most common enterprise data migration scenarios, and there are several well-proven approaches depending on your requirements for data freshness, volume, and budget...

  • 1 kudos
1 More Replies
bunny_9090
by New Contributor
  • 591 Views
  • 1 replies
  • 0 kudos

Precision Variance Observed in FLOAT to DOUBLE Data Migration to Delta Tables

Hi Team,We would like to bring to your attention a precision-related variance observed during data migration from our legacy platform into db Delta tables.In the legacy system, several numeric columns are defined using the FLOAT data type. During ing...

  • 591 Views
  • 1 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @bunny_9090, Let me walk you through this. Your analysis of the root cause is spot on. Let me expand on what is happening and walk through the recommended approaches to address it. WHY THIS HAPPENS -- IEEE 754 FLOATING-POINT REPRESENTATION Both FL...

  • 0 kudos
Ham
by New Contributor II
  • 1597 Views
  • 1 replies
  • 1 kudos

Resolved! Best-practice guidance for routing Databricks SDK (Python)ingestion logs into AzureMonitor/Analytics

Hi everyone!I’m running a config-driven ingestion stack that uses the Databricks SDK (Python notebooks + GitHub Actions). All logging currently uses the standard Python logging module inside notebooks/jobs (example: ingest.py, logger.py).I’d like to ...

  • 1597 Views
  • 1 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @Ham, This is a common scenario, and there are good solutions. There are several layers to getting "Databricks SDK (Python) ingestion logs" into Azure Monitor, depending on exactly which logs you need. I will walk through each approach from simple...

  • 1 kudos
deployment_fail
by New Contributor
  • 782 Views
  • 2 replies
  • 0 kudos

CONVERT TO DELTA fails to merge file schema

I have a directory of Parquet files in Azure Data Lake Storage that I want to convert to a Delta Lake table. I run this:CONVERT TO DELTA parquet.`abfss://container@storage_account.dfs.core.windows.net/directory_name`;But it throws this error: "SparkE...

  • 782 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @deployment_fail, Good timing on this question. Let me explain what is happening and walk you through several approaches to resolve it. WHAT IS HAPPENING When you run CONVERT TO DELTA, Databricks reads the Parquet footer metadata from every file i...

  • 0 kudos
1 More Replies
ramsai
by New Contributor II
  • 351 Views
  • 2 replies
  • 0 kudos

Jobs

Is there a way to find out how many workers or cores are being utilized in a job cluster? If so, could you please explain how to check this?

  • 351 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @ramsai, Great question! There are several ways to check how many workers and cores are being utilized in a Databricks job cluster. I will walk through each option from simplest to most advanced. OPTION 1: CLUSTER METRICS TAB (QUICKEST WAY) While ...

  • 0 kudos
1 More Replies
datastrange
by New Contributor
  • 691 Views
  • 1 replies
  • 1 kudos

Best pattern for ingesting data from hundreds of separate ADLS Gen2 containers into Databricks?

We're building a lakehouse on Azure Databricks with Unity Catalog. Our data lands in Azure Data Lake Storage Gen2 (Hierarchical Namespace enabled) as JSON files. The challenge is multi-tenancy: each tenant's data is written to a separate container in...

  • 691 Views
  • 1 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @datastrange, Great question -- this is a common architectural challenge in multi-tenant Azure Databricks environments, and you have already identified the key constraint: Auto Loader does not support wildcards in the container portion of the abfs...

  • 1 kudos
aonurdemir
by Contributor
  • 1278 Views
  • 3 replies
  • 1 kudos

Resolved! Conflict between Predictive Optimization and High Frequency Writes

(Dear Moderators, why do you remove this question? It is a genuine question. Do not, please. )We have a continuous dlt pipeline that has tables updating every minute and partitioned by "partition_key" column. Table is 4 TB. Has 16k files. Sometimes w...

  • 1278 Views
  • 3 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @aonurdemir, This is a well-known conflict pattern in Delta Lake, and the root cause is clearly documented. Let me break it down and give you the concrete options. ROOT CAUSE The Databricks documentation on isolation levels and write conflicts exp...

  • 1 kudos
2 More Replies
damirg
by New Contributor
  • 929 Views
  • 3 replies
  • 0 kudos

Switching Branches using code in notebooks?

Hi,I’m working on a project in a Databricks notebook and I’m trying to implement the following workflow:Create a new branch from Python codeIn the next cell, switch the notebook to that newly created branchI’m able to create the branch without issues...

  • 929 Views
  • 3 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi, Great question! Yes, you can switch Git branches programmatically in Databricks -- there are a few approaches depending on your use case. OPTION 1: DATABRICKS PYTHON SDK (RECOMMENDED FOR NOTEBOOKS) The simplest approach from within a notebook is ...

  • 0 kudos
2 More Replies
Vivek_Patil1
by New Contributor
  • 619 Views
  • 1 replies
  • 0 kudos

Config-Driven Data Harmonization Framework in Databricks (Silver → Harmonized_Silver)

Hi Community,We are currently designing a Data Harmonization framework in Databricks and would appreciate insights from anyone who has implemented something similar at scale.Context:We are ingesting data from multiple source systems where:- Different...

  • 619 Views
  • 1 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @Vivek_Patil1, Great question -- this is a pattern we see frequently in enterprise data platforms, especially in healthcare and financial services where multi-source harmonization is critical. Here is a comprehensive architecture recommendation us...

  • 0 kudos
Datalight
by Contributor
  • 1255 Views
  • 2 replies
  • 0 kudos

Data Observability in Databricks

This is very General question more on the Design Side on Observability.There are 500+ Data Pipeline build in healthcare domain using Azure and AWS Databricks.May someone please help me how to design a system :-1. Continuous track system health and be...

  • 1255 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @Datalight, Great question, and one that many organizations at your scale face. With 500+ pipelines across both Azure and AWS, you will want a layered observability approach that combines Databricks-native capabilities. Let me walk through a pract...

  • 0 kudos
1 More Replies
swzzzsw
by Databricks Partner
  • 13053 Views
  • 5 replies
  • 9 kudos

"Run now with different parameters" - different parameters not recognized by jobs involving multiple tasks

I'm running a databricks job involving multiple tasks and would like to run the job with different set of task parameters. I can achieve that by edit each task and and change the parameter values. However, it gets very manual when I have a lot of tas...

  • 13053 Views
  • 5 replies
  • 9 kudos
Latest Reply
Dali1
New Contributor III
  • 9 kudos

Hello Anyone found a better solution for this ? 

  • 9 kudos
4 More Replies
manugarri
by New Contributor II
  • 23977 Views
  • 13 replies
  • 2 kudos

Fuzzy text matching in Spark

I have a list of client provided data, a list of company names. I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark ...

  • 23977 Views
  • 13 replies
  • 2 kudos
Latest Reply
RheaC
New Contributor II
  • 2 kudos

+1 on LLMs. I would check this article on using Similarity API instead of rapidfuzz in 2026 especially for larger/growing datasets https://medium.com/p/0854593e380a

  • 2 kudos
12 More Replies
Labels