cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

seefoods
by Valued Contributor
  • 700 Views
  • 2 replies
  • 0 kudos

Resolved! databricks clusters failed

Hello guyz, when i run process to parse pdf with  docling on serveless cluster using wheel python i get this error. Someone know what's happend?Cordially INTERNAL: [ENVIRONMENT_SETUP_ERROR.PYTHON_NOTEBOOK_ENVIRONMENT] An internal error occurred while...

  • 700 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @seefoods, Interesting scenario. docling is a powerful PDF parsing library and it is great that you are exploring it on Databricks. The ENVIRONMENT_SETUP_ERROR.PYTHON_NOTEBOOK_ENVIRONMENT error you are seeing is related to how serverless compute h...

  • 0 kudos
1 More Replies
NishantTiwari
by New Contributor II
  • 739 Views
  • 5 replies
  • 1 kudos

Cluster Issue

Driver: c5.4xlarge · Workers: c5.4xlarge · 8 workers · On-demand and Spot · fall back to On-demand · DBR: 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12) · us-east-1cIn my databricks job there is a step NDS download which we used to download files ...

  • 739 Views
  • 5 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @NishantTiwari, I see you have already upgraded to DBR 14.3+ but are still hitting the same SSL errors. That makes sense, and here is why: the two errors you are seeing point to the 3rd party server using weak or outdated SSL certificates, not an ...

  • 1 kudos
4 More Replies
QueryingQuail
by New Contributor III
  • 3435 Views
  • 6 replies
  • 1 kudos

Best practice for adding fixed metadata columns at point of ingestion

Hello all,We are currently working with ingestion of data from source systems using a mix of custom code and managed connectors (e.g. the Dynamics 365 (Synapse Link) connector) in conjunction with Auto CDC / Auto CDC from snapshot. I’m trying to unde...

  • 3435 Views
  • 6 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @QueryingQuail, Good question -- I can see from the follow-up discussion that you are looking for practical guidance that goes beyond what a generic AI prompt would give you -- specifically how to handle this across both managed connectors (like t...

  • 1 kudos
5 More Replies
DoredlaCharan
by New Contributor III
  • 679 Views
  • 5 replies
  • 1 kudos

MongoDB to databricks driver killed and compute re-attached

I started reading the data from the mongodb using the spark read it uses mongo-spark-connector, by default there will be sample size as 1000 meaning referring only 1000 documents in the collection to make them as columns in the dataframe, so i increa...

  • 679 Views
  • 5 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @DoredlaCharan, The root cause here is straightforward: setting sampleSize to 100,000 forces the MongoDB Spark Connector to pull 100K documents onto your driver node just for schema inference. With 100+ keys per document and mergeSchema enabled, t...

  • 1 kudos
4 More Replies
DylanStout
by Contributor
  • 1376 Views
  • 1 replies
  • 0 kudos

ODBC driver installation - help needed

Hello, I’m trying to use pyodbc inside Databricks to connect to a SQL Server database, but I’m working in a restricted, offline Databricks workspace (no outbound internet).What I’ve learned so far:Databricks clusters do not include Microsoft’s ODBC D...

  • 1376 Views
  • 1 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @DylanStout, This is worth walking through carefully. It sounds like you have already done solid research on the constraints. Let me walk you through the most likely reason your init script is hanging and provide a complete working approach for of...

  • 0 kudos
ravipal-global
by New Contributor II
  • 1344 Views
  • 4 replies
  • 0 kudos

delete and reload append only delta live tables with autoloader

We have a set of streaming dlt pipelines following a medallion pattern where s3 bucket -> autoloader -> bronze delta tables -> silver delta tables -> gold delta tables. All delta tables are in a unity catalog under separate schemas. We need a solutio...

  • 1344 Views
  • 4 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @ravipal-global, I have seen this pattern before. The behavior you are seeing is expected. Let me explain why it happens and then walk through several approaches that can help you achieve delete-and-reload without requiring a full refresh of your ...

  • 0 kudos
3 More Replies
Pratikmsbsvm
by Contributor
  • 3452 Views
  • 2 replies
  • 1 kudos

Data Migration from SAP S/4HANA to Databricks

May someone please help me designing the Migration of SAP S/4 HANA to Databricks. How to design this. what all we need to consider as LLD.1. How Data needs to be extracted and by which tool ? near–real-time replication is required2. Each layer for Da...

  • 3452 Views
  • 2 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @Pratikmsbsvm, Happy to help with this one. SAP S/4HANA to Databricks is one of the most common enterprise data migration scenarios, and there are several well-proven approaches depending on your requirements for data freshness, volume, and budget...

  • 1 kudos
1 More Replies
bunny_9090
by New Contributor
  • 694 Views
  • 1 replies
  • 0 kudos

Precision Variance Observed in FLOAT to DOUBLE Data Migration to Delta Tables

Hi Team,We would like to bring to your attention a precision-related variance observed during data migration from our legacy platform into db Delta tables.In the legacy system, several numeric columns are defined using the FLOAT data type. During ing...

  • 694 Views
  • 1 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @bunny_9090, Let me walk you through this. Your analysis of the root cause is spot on. Let me expand on what is happening and walk through the recommended approaches to address it. WHY THIS HAPPENS -- IEEE 754 FLOATING-POINT REPRESENTATION Both FL...

  • 0 kudos
Ham
by New Contributor II
  • 1893 Views
  • 1 replies
  • 1 kudos

Resolved! Best-practice guidance for routing Databricks SDK (Python)ingestion logs into AzureMonitor/Analytics

Hi everyone!I’m running a config-driven ingestion stack that uses the Databricks SDK (Python notebooks + GitHub Actions). All logging currently uses the standard Python logging module inside notebooks/jobs (example: ingest.py, logger.py).I’d like to ...

  • 1893 Views
  • 1 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @Ham, This is a common scenario, and there are good solutions. There are several layers to getting "Databricks SDK (Python) ingestion logs" into Azure Monitor, depending on exactly which logs you need. I will walk through each approach from simple...

  • 1 kudos
deployment_fail
by New Contributor
  • 929 Views
  • 2 replies
  • 0 kudos

CONVERT TO DELTA fails to merge file schema

I have a directory of Parquet files in Azure Data Lake Storage that I want to convert to a Delta Lake table. I run this:CONVERT TO DELTA parquet.`abfss://container@storage_account.dfs.core.windows.net/directory_name`;But it throws this error: "SparkE...

  • 929 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @deployment_fail, Good timing on this question. Let me explain what is happening and walk you through several approaches to resolve it. WHAT IS HAPPENING When you run CONVERT TO DELTA, Databricks reads the Parquet footer metadata from every file i...

  • 0 kudos
1 More Replies
ramsai
by New Contributor II
  • 376 Views
  • 2 replies
  • 0 kudos

Jobs

Is there a way to find out how many workers or cores are being utilized in a job cluster? If so, could you please explain how to check this?

  • 376 Views
  • 2 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @ramsai, Great question! There are several ways to check how many workers and cores are being utilized in a Databricks job cluster. I will walk through each option from simplest to most advanced. OPTION 1: CLUSTER METRICS TAB (QUICKEST WAY) While ...

  • 0 kudos
1 More Replies
datastrange
by New Contributor
  • 800 Views
  • 1 replies
  • 1 kudos

Best pattern for ingesting data from hundreds of separate ADLS Gen2 containers into Databricks?

We're building a lakehouse on Azure Databricks with Unity Catalog. Our data lands in Azure Data Lake Storage Gen2 (Hierarchical Namespace enabled) as JSON files. The challenge is multi-tenancy: each tenant's data is written to a separate container in...

  • 800 Views
  • 1 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @datastrange, Great question -- this is a common architectural challenge in multi-tenant Azure Databricks environments, and you have already identified the key constraint: Auto Loader does not support wildcards in the container portion of the abfs...

  • 1 kudos
aonurdemir
by Contributor
  • 1428 Views
  • 3 replies
  • 1 kudos

Resolved! Conflict between Predictive Optimization and High Frequency Writes

(Dear Moderators, why do you remove this question? It is a genuine question. Do not, please. )We have a continuous dlt pipeline that has tables updating every minute and partitioned by "partition_key" column. Table is 4 TB. Has 16k files. Sometimes w...

  • 1428 Views
  • 3 replies
  • 1 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 1 kudos

Hi @aonurdemir, This is a well-known conflict pattern in Delta Lake, and the root cause is clearly documented. Let me break it down and give you the concrete options. ROOT CAUSE The Databricks documentation on isolation levels and write conflicts exp...

  • 1 kudos
2 More Replies
damirg
by New Contributor
  • 1157 Views
  • 3 replies
  • 0 kudos

Switching Branches using code in notebooks?

Hi,I’m working on a project in a Databricks notebook and I’m trying to implement the following workflow:Create a new branch from Python codeIn the next cell, switch the notebook to that newly created branchI’m able to create the branch without issues...

  • 1157 Views
  • 3 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi, Great question! Yes, you can switch Git branches programmatically in Databricks -- there are a few approaches depending on your use case. OPTION 1: DATABRICKS PYTHON SDK (RECOMMENDED FOR NOTEBOOKS) The simplest approach from within a notebook is ...

  • 0 kudos
2 More Replies
Vivek_Patil1
by New Contributor
  • 674 Views
  • 1 replies
  • 0 kudos

Config-Driven Data Harmonization Framework in Databricks (Silver → Harmonized_Silver)

Hi Community,We are currently designing a Data Harmonization framework in Databricks and would appreciate insights from anyone who has implemented something similar at scale.Context:We are ingesting data from multiple source systems where:- Different...

  • 674 Views
  • 1 replies
  • 0 kudos
Latest Reply
SteveOstrowski
Databricks Employee
  • 0 kudos

Hi @Vivek_Patil1, Great question -- this is a pattern we see frequently in enterprise data platforms, especially in healthcare and financial services where multi-source harmonization is critical. Here is a comprehensive architecture recommendation us...

  • 0 kudos
Labels