cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Pratikmsbsvm
by Contributor
  • 3423 Views
  • 2 replies
  • 1 kudos

Data Migration from SAP S/4HANA to Databricks

May someone please help me designing the Migration of SAP S/4 HANA to Databricks. How to design this. what all we need to consider as LLD.1. How Data needs to be extracted and by which tool ? near–real-time replication is required2. Each layer for Da...

  • 3423 Views
  • 2 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @Pratikmsbsvm, Happy to help with this one. SAP S/4HANA to Databricks is one of the most common enterprise data migration scenarios, and there are several well-proven approaches depending on your requirements for data freshness, volume, and budget...

  • 1 kudos
1 More Replies
bunny_9090
by New Contributor
  • 684 Views
  • 1 replies
  • 0 kudos

Precision Variance Observed in FLOAT to DOUBLE Data Migration to Delta Tables

Hi Team,We would like to bring to your attention a precision-related variance observed during data migration from our legacy platform into db Delta tables.In the legacy system, several numeric columns are defined using the FLOAT data type. During ing...

  • 684 Views
  • 1 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @bunny_9090, Let me walk you through this. Your analysis of the root cause is spot on. Let me expand on what is happening and walk through the recommended approaches to address it. WHY THIS HAPPENS -- IEEE 754 FLOATING-POINT REPRESENTATION Both FL...

  • 0 kudos
Ham
by New Contributor II
  • 1874 Views
  • 1 replies
  • 1 kudos

Resolved! Best-practice guidance for routing Databricks SDK (Python)ingestion logs into AzureMonitor/Analytics

Hi everyone!I’m running a config-driven ingestion stack that uses the Databricks SDK (Python notebooks + GitHub Actions). All logging currently uses the standard Python logging module inside notebooks/jobs (example: ingest.py, logger.py).I’d like to ...

  • 1874 Views
  • 1 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @Ham, This is a common scenario, and there are good solutions. There are several layers to getting "Databricks SDK (Python) ingestion logs" into Azure Monitor, depending on exactly which logs you need. I will walk through each approach from simple...

  • 1 kudos
deployment_fail
by New Contributor
  • 918 Views
  • 2 replies
  • 0 kudos

CONVERT TO DELTA fails to merge file schema

I have a directory of Parquet files in Azure Data Lake Storage that I want to convert to a Delta Lake table. I run this:CONVERT TO DELTA parquet.`abfss://container@storage_account.dfs.core.windows.net/directory_name`;But it throws this error: "SparkE...

  • 918 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @deployment_fail, Good timing on this question. Let me explain what is happening and walk you through several approaches to resolve it. WHAT IS HAPPENING When you run CONVERT TO DELTA, Databricks reads the Parquet footer metadata from every file i...

  • 0 kudos
1 More Replies
ramsai
by New Contributor II
  • 374 Views
  • 2 replies
  • 0 kudos

Jobs

Is there a way to find out how many workers or cores are being utilized in a job cluster? If so, could you please explain how to check this?

  • 374 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @ramsai, Great question! There are several ways to check how many workers and cores are being utilized in a Databricks job cluster. I will walk through each option from simplest to most advanced. OPTION 1: CLUSTER METRICS TAB (QUICKEST WAY) While ...

  • 0 kudos
1 More Replies
datastrange
by New Contributor
  • 787 Views
  • 1 replies
  • 1 kudos

Best pattern for ingesting data from hundreds of separate ADLS Gen2 containers into Databricks?

We're building a lakehouse on Azure Databricks with Unity Catalog. Our data lands in Azure Data Lake Storage Gen2 (Hierarchical Namespace enabled) as JSON files. The challenge is multi-tenancy: each tenant's data is written to a separate container in...

  • 787 Views
  • 1 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @datastrange, Great question -- this is a common architectural challenge in multi-tenant Azure Databricks environments, and you have already identified the key constraint: Auto Loader does not support wildcards in the container portion of the abfs...

  • 1 kudos
aonurdemir
by Contributor
  • 1417 Views
  • 3 replies
  • 1 kudos

Resolved! Conflict between Predictive Optimization and High Frequency Writes

(Dear Moderators, why do you remove this question? It is a genuine question. Do not, please. )We have a continuous dlt pipeline that has tables updating every minute and partitioned by "partition_key" column. Table is 4 TB. Has 16k files. Sometimes w...

  • 1417 Views
  • 3 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @aonurdemir, This is a well-known conflict pattern in Delta Lake, and the root cause is clearly documented. Let me break it down and give you the concrete options. ROOT CAUSE The Databricks documentation on isolation levels and write conflicts exp...

  • 1 kudos
2 More Replies
damirg
by New Contributor
  • 1140 Views
  • 3 replies
  • 0 kudos

Switching Branches using code in notebooks?

Hi,I’m working on a project in a Databricks notebook and I’m trying to implement the following workflow:Create a new branch from Python codeIn the next cell, switch the notebook to that newly created branchI’m able to create the branch without issues...

  • 1140 Views
  • 3 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi, Great question! Yes, you can switch Git branches programmatically in Databricks -- there are a few approaches depending on your use case. OPTION 1: DATABRICKS PYTHON SDK (RECOMMENDED FOR NOTEBOOKS) The simplest approach from within a notebook is ...

  • 0 kudos
2 More Replies
Vivek_Patil1
by New Contributor
  • 670 Views
  • 1 replies
  • 0 kudos

Config-Driven Data Harmonization Framework in Databricks (Silver → Harmonized_Silver)

Hi Community,We are currently designing a Data Harmonization framework in Databricks and would appreciate insights from anyone who has implemented something similar at scale.Context:We are ingesting data from multiple source systems where:- Different...

  • 670 Views
  • 1 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @Vivek_Patil1, Great question -- this is a pattern we see frequently in enterprise data platforms, especially in healthcare and financial services where multi-source harmonization is critical. Here is a comprehensive architecture recommendation us...

  • 0 kudos
Datalight
by Contributor
  • 1459 Views
  • 2 replies
  • 0 kudos

Data Observability in Databricks

This is very General question more on the Design Side on Observability.There are 500+ Data Pipeline build in healthcare domain using Azure and AWS Databricks.May someone please help me how to design a system :-1. Continuous track system health and be...

  • 1459 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @Datalight, Great question, and one that many organizations at your scale face. With 500+ pipelines across both Azure and AWS, you will want a layered observability approach that combines Databricks-native capabilities. Let me walk through a pract...

  • 0 kudos
1 More Replies
swzzzsw
by Databricks Partner
  • 13133 Views
  • 5 replies
  • 9 kudos

"Run now with different parameters" - different parameters not recognized by jobs involving multiple tasks

I'm running a databricks job involving multiple tasks and would like to run the job with different set of task parameters. I can achieve that by edit each task and and change the parameter values. However, it gets very manual when I have a lot of tas...

  • 13133 Views
  • 5 replies
  • 9 kudos
Latest Reply
Dali1
New Contributor III
  • 9 kudos

Hello Anyone found a better solution for this ? 

  • 9 kudos
4 More Replies
manugarri
by New Contributor II
  • 24199 Views
  • 13 replies
  • 2 kudos

Fuzzy text matching in Spark

I have a list of client provided data, a list of company names. I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark ...

  • 24199 Views
  • 13 replies
  • 2 kudos
Latest Reply
RheaC
New Contributor II
  • 2 kudos

+1 on LLMs. I would check this article on using Similarity API instead of rapidfuzz in 2026 especially for larger/growing datasets https://medium.com/p/0854593e380a

  • 2 kudos
12 More Replies
mebinjoy
by Databricks Partner
  • 6566 Views
  • 7 replies
  • 8 kudos

Resolved! Certificate not received.

I had completed the Data Engineering Associate V3 certification today morning and I'm yet to receive my certification. I had received a mail stating that I had passed and the certification would be mailed.

  • 6566 Views
  • 7 replies
  • 8 kudos
Latest Reply
varsha2
New Contributor II
  • 8 kudos

I completed my exam last week still not received certificate. Please help as soon as possible Its really urgent

  • 8 kudos
6 More Replies
neerajaN
by New Contributor II
  • 441 Views
  • 1 replies
  • 1 kudos

Resolved! schema check

hi , i am running the below query in databricks , first job5 created with 10 partitions .and again job6 started where actual processing started.in job5 is it identifying schema , when schema check will be done for the new dataset. is it checked by dr...

schema check.png
  • 441 Views
  • 1 replies
  • 1 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 1 kudos

Hi @neerajaN, You are right. Job 5 is Schema Inference job. You can identify Job 5 as a schema/header inference job because it triggers immediately upon spark.read. Since header=True is set without a manual .schema(), Spark must launch a job to look ...

  • 1 kudos
FranPérez
by New Contributor III
  • 18463 Views
  • 9 replies
  • 6 kudos

set PYTHONPATH when executing workflows

I set up a workflow using 2 tasks. Just for demo purposes, I'm using an interactive cluster for running the workflow. { "task_key": "prepare", "spark_python_task": { "python_file": "file...

  • 18463 Views
  • 9 replies
  • 6 kudos
Latest Reply
kenmyers-8451
Contributor II
  • 6 kudos

Just checking in again if there is a way to do this in the last few years? As Fran mentioned, `sys.path.append("/Workspace/Repos/devops/mlhub-mlops-dev/src")` is not a great "fix" for the reasons already mentioned. I've found that you can do `pip ins...

  • 6 kudos
8 More Replies
Labels