cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

vartyg
by Visitor
  • 10 Views
  • 1 replies
  • 0 kudos

Scaling Declarative Streaming Pipelines for CDC from On-Prem Database to Lakehouse

We have a scenario where we need to mirror thousands of tables from on-premises Db2 databases to an Azure Lakehouse. The goal is to create mirror Delta tables in the Lakehouse.Since LakeFlow Connect currently does not support direct mirroring from on...

  • 10 Views
  • 1 replies
  • 0 kudos
Latest Reply
bidek56
Contributor
  • 0 kudos

Just use https://flink.apache.org

  • 0 kudos
hgm251
by Visitor
  • 20 Views
  • 2 replies
  • 1 kudos

online tables to synced table, why is it creating a different service principal everytime?

Hello!We started to move our online tables to synced_tables. We just couldnt figure out why it is creating a new service principal everytime we ran the same code we use for online tables?try: fe.create_feature_spec(name=feature_spec_name ...

  • 20 Views
  • 2 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

Greetings @hgm251 , here are some things to consider.    Things are working as designed: when you create a new Feature Serving or Model Serving endpoint, Databricks automatically provisions a dedicated service principal for that endpoint, and a fresh...

  • 1 kudos
1 More Replies
Mathias_Peters
by Contributor II
  • 26 Views
  • 1 replies
  • 0 kudos

Reading MongoDB collections into an RDD

Hi, for a Spark job which does some custom computation, I need to access data from a MongoDB collection and access the elements as of type Document. The reason for this is, that I want to apply some custom type serialization which is already implemen...

  • 26 Views
  • 1 replies
  • 0 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 0 kudos

Greeting @Mathias_Peters , here are some suggestions for your consideration. Analysis You're encountering a common challenge when migrating to newer versions of the MongoDB Spark Connector. The architecture changed significantly between versions 2.x ...

  • 0 kudos
pooja_bhumandla
by New Contributor III
  • 12 Views
  • 1 replies
  • 0 kudos

Broadcast Join Failure in Streaming: Failed to store executor broadcast in BlockManager

Hi Databricks Community,I’m running a Structured Streaming job in Databricks with foreachBatch writing to a Delta table.Failed to store executor broadcast spark_join_relation_1622863(size = Some(67141632)) in BlockManager with storageLevel=StorageLev...

  • 12 Views
  • 1 replies
  • 0 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 0 kudos

Greetings @pooja_bhumandla , here are some helpful hints and tips. Diagnosis Your error indicates that a broadcast join operation is attempting to send ~64MB of data to executors, but the BlockManager cannot store it due to memory constraints. This c...

  • 0 kudos
pabloratache
by Visitor
  • 24 Views
  • 4 replies
  • 2 kudos

Resolved! [FREE TRIAL] Missing All-Purpose Clusters Access - New Account

Issue Description: I created a new Databricks Free Trial account ("For Work" plan with $400 credits) but I don't have access to All-Purpose Clusters or PySpark compute. My workspace only shows SQL-only features.Current Setup:- Account Email: ronel.ra...

  • 24 Views
  • 4 replies
  • 2 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 2 kudos

Ah, got it @pabloratache , I did some digging and here is what I found (learned a few things myself). Thanks for the detailed context — this behavior is expected for the current Databricks 14‑day Free Trial (“For Work” plan).   What’s happening with ...

  • 2 kudos
3 More Replies
Danish11052000
by New Contributor II
  • 14 Views
  • 1 replies
  • 0 kudos

Looking for Advice: Robust Backup Strategy for Databricks System Tables

HI,I’m planning to build a backup system for all Databricks system tables (audit, usage, price, history, etc.) to preserve data beyond retention limits. Currently, I’m using Spark Streaming with readStream + writeStream and checkpointing in LakeFlow ...

  • 14 Views
  • 1 replies
  • 0 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 0 kudos

Greetings @Danish11052000 , here’s a pragmatic way to choose, based on the nature of Databricks system tables and the guarantees you want.   Bottom line For ongoing replication to preserve data beyond free retention, a Lakeflow Declarative Pipeline w...

  • 0 kudos
SahiSammu
by New Contributor
  • 43 Views
  • 2 replies
  • 0 kudos

Resolved! Auto Loader vs Batch for Large File Loads

Hi everyone,I'm seeing a dramatic difference in processing times between batch and streaming (Auto Loader) approaches for reading about 250,000 files from S3 in Databricks. My goal is to read metadata from these files and register it as a table (even...

Data Engineering
autoloader
Directory Listing
ingestion
  • 43 Views
  • 2 replies
  • 0 kudos
Latest Reply
SahiSammu
New Contributor
  • 0 kudos

Thank you, Anudeep.I plan to tune Auto Loader by increasing the maxFilesPerTrigger parameter to optimize performance. My decision to use Auto Loader is primarily driven by its built-in backup functionality via cloudFiles.cleanSource.moveDestination, ...

  • 0 kudos
1 More Replies
noorbasha534
by Valued Contributor II
  • 2992 Views
  • 1 replies
  • 0 kudos

Databricks Jobs Failure Notification to Azure DevOps as incident

Dear all,Has anyone tried sending Databricks Jobs Failure Notification to Azure DevOps as incident? I see webhook as a OOTB destination for jobs. I am thinking to leverage it. But, like to hear any success stories of it or any other smart approaches....

  • 2992 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Yes, there are successful approaches and best practices for sending Databricks Job Failure notifications to Azure DevOps as incidents, primarily by leveraging the webhook feature as an out-of-the-box (OOTB) destination in Databricks Jobs. The workflo...

  • 0 kudos
aonurdemir
by Contributor
  • 133 Views
  • 3 replies
  • 5 kudos

Resolved! Broken s3 file paths in File Notifications for auto loader

Suddenly at "2025-10-23T14:12:48.409+00:00", coming file paths from file notification queue started to be urlencoded. Hence, our pipeline gets file not found exception. I think something has changed suddenly and broke notification system. Here are th...

  • 133 Views
  • 3 replies
  • 5 kudos
Latest Reply
K_Anudeep
Databricks Employee
  • 5 kudos

Hello @aonurdemir, Could you please re-run your pipeline now and check? This issue should be mitigated now. It is due to a recent internal bug that led to the unexpected handling of file paths with special characters. You should set ignoreMissingFile...

  • 5 kudos
2 More Replies
der
by Contributor
  • 12 Views
  • 1 replies
  • 0 kudos

EXCEL_DATA_SOURCE_NOT_ENABLED Excel data source is not enabled in this cluster

I want to read an Excel xlsx file on DBR 17.3. On the Cluster the library dev.mauch:spark-excel_2.13:4.0.0_0.31.2 is installed. V1 Implementation works fine:df = spark.read.format("dev.mauch.spark.excel").schema(schema).load(excel_file) display(df)V2...

  • 12 Views
  • 1 replies
  • 0 kudos
Latest Reply
der
Contributor
  • 0 kudos

If I build the spark-excel library with another short name (example "excelv2"), everything works fine. https://github.com/nightscape/spark-excel/issues/896#issuecomment-3486861693

  • 0 kudos
Dhruv-22
by Contributor II
  • 208 Views
  • 6 replies
  • 6 kudos

Reading empty json file in serverless gives error

I ran a databricks notebook to do incremental loads from files in raw layer to bronze layer tables. Today, I encountered a case where the delta file was empty. I tried running it manually on the serverless compute and encountered an error.df = spark....

  • 208 Views
  • 6 replies
  • 6 kudos
Latest Reply
K_Anudeep
Databricks Employee
  • 6 kudos

Hello @Dhruv-22 , Can you share the schema of the df? Do you have a _corrupt_record column in your dataframe? If yes.. where are you getting it from, because you said its an empty file correct?As per the design ,Spark blocks queries that only referen...

  • 6 kudos
5 More Replies
vinaykumar
by New Contributor III
  • 10522 Views
  • 7 replies
  • 0 kudos

Log files are not getting deleted automatically after logRetentionDuration internal

Hi team Log files are not getting deleted automatically after logRetentionDuration internal from delta log folder and after analysis , I see checkpoint files are not getting created after 10 commits . Below table properties using spark.sql(    f"""  ...

No checkpoint.parquet
  • 10522 Views
  • 7 replies
  • 0 kudos
Latest Reply
alex307
Visitor
  • 0 kudos

Any body get any solution?

  • 0 kudos
6 More Replies
somedeveloper
by New Contributor III
  • 934 Views
  • 2 replies
  • 1 kudos

Modifying size of /var/lib/lxc

Good morning,When running a library (sparkling water) for a very large dataset, I've noticed that during an export procedure the /var/lib/lxc storage is being used. Since the storage seems to be at a static 130GB of memory, this is a problem because ...

  • 934 Views
  • 2 replies
  • 1 kudos
Latest Reply
Walter_C
Databricks Employee
  • 1 kudos

Unfortunately this is a setting that cannot be increased on customer side

  • 1 kudos
1 More Replies
dbdev
by Contributor
  • 27 Views
  • 2 replies
  • 0 kudos

Lakehouse Federation - fetch size parameter for optimization

Hi,We use lakehouse federation to connect to a database.A performance recommendation is to use 'fetchSize':Lakehouse Federation performance recommendations - Azure Databricks | Microsoft Learn SELECT * FROM mySqlCatalog.schema.table WITH ('fetchSiz...

  • 27 Views
  • 2 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @dbdev ,You can try to set fetchSize using spark.read.option as they suggested at below article:Redshift queries using Lakehouse Federation taking longer than expected - Databricks

  • 0 kudos
1 More Replies
VikasSinha
by New Contributor
  • 6287 Views
  • 5 replies
  • 0 kudos

Which is better - Azure Databricks or GCP Databricks?

Which cloud hosting environment is best to use for Databricks? My question pins down to the fact that there must be some difference between the latency, throughput, result consistency & reproducibility between different cloud hosting environments of ...

  • 6287 Views
  • 5 replies
  • 0 kudos
Latest Reply
bidek56
Contributor
  • 0 kudos

@VikasSinha Databricks is not stable regardless of the cloud, jobs and clusters keep crashing. Use Polars or Duckdb instead.

  • 0 kudos
4 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels