cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

dvd_lg_bricks
by New Contributor
  • 293 Views
  • 10 replies
  • 3 kudos

Questions About Workers and Executors Configuration in Databricks

Hi everyone, sorry, I’m new here. I’m considering migrating to Databricks, but I need to clarify a few things first.When I define and launch an application, I see that I can specify the number of workers, and then later configure the number of execut...

  • 293 Views
  • 10 replies
  • 3 kudos
Latest Reply
Abeshek
New Contributor
  • 3 kudos

Your Databricks question about workers versus executors. Many teams encounter the same sizing and configuration issues when evaluating a migration. At Kanerika, we help companies plan cluster architecture, optimize Spark workloads, and avoid overspen...

  • 3 kudos
9 More Replies
singhanuj2803
by Contributor
  • 251 Views
  • 4 replies
  • 1 kudos

Troubleshooting Azure Databricks Cluster Pools & spot_bid_max_price Validation Error

Hope you’re doing well!I’m reaching out for some guidance on an issue I’ve encountered while setting up Azure Databricks Cluster Pools to reduce cluster spin-up and scale times for our jobs.Background:To optimize job execution wait times, I’ve create...

  • 251 Views
  • 4 replies
  • 1 kudos
Latest Reply
Poorva21
New Contributor II
  • 1 kudos

Possible reasons:1. Setting spot_bid_max_price = -1 is not accepted by Azure poolsAzure Databricks only accepts:0 → on-demand onlypositive numbers → max spot price-1 is allowed in cluster policies, but not inside pools, so validation never completes....

  • 1 kudos
3 More Replies
mordex
by New Contributor II
  • 304 Views
  • 4 replies
  • 1 kudos

Resolved! Why is spark creating 5 jobs and 200 tasks?

I am trying to read 1000 small csv files each 30 kb size which are stored in databricks volume. Below is the query i am doing:df=spark.read.csv.options(header=true).load('/path')df.collect() Why is it creating 5 jobs? Why 1-3 jobs have 200 tasks,4 ha...

030a9798-9c6f-4ab3-be53-7f6e4a5f7289.jfif
  • 304 Views
  • 4 replies
  • 1 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 1 kudos

@mordex - yes, Spark caps the parallelism for file listing at 200 tasks, regardless of whether you have 1,000 or 10,000 files. it is controlled by spark.sql.sources.parallelPartitionDiscovery.parallelism. Run below command to get value of it. spark.c...

  • 1 kudos
3 More Replies
__Aziz__
by New Contributor II
  • 148 Views
  • 1 replies
  • 1 kudos

Resolved! mongodb connector duplicate writes

Hi everyone,Has anyone run into this issue? I’m using the MongoDB Spark Connector on Databricks to expose data from Delta Lake to MongoDB. My workflow is:overwrite the collection (very fast),then create the indexes.Occasionally, I’m seeing duplicates...

  • 148 Views
  • 1 replies
  • 1 kudos
Latest Reply
bianca_unifeye
Contributor
  • 1 kudos

Hi Aziz,What you’re seeing is an expected behaviour when combining Spark retries with non-idempotent writes.Spark’s write path is task-based and fault-tolerant. If a task fails part-way through writing to MongoDB, Spark will retry that task.From Spar...

  • 1 kudos
SRJDB
by New Contributor II
  • 148 Views
  • 1 replies
  • 0 kudos

Why am I getting a cast invalid input error when using display()?

I have a spark data frame. It consists of a single column, in string format, with 28750 values in it. The values are all 10 digits long. I want to look at the data, like this:my_dataframe.display()But this returns the following error:[CAST_INVALID_IN...

  • 148 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @SRJDB ,Could you execute my_dataframe.printSchema() and attach result here? 

  • 0 kudos
iFoxz17
by New Contributor II
  • 179 Views
  • 3 replies
  • 1 kudos

Databricks academy error setup - Free Edition with Serverless Compute

Databricks is passing from the Community Edition to the Free Edition, which I am currently using.When executing the Includes/Classroom-setup notebooks the following exception is raised: [CONFIG_NOT_AVAILABLE] Configuration dbacademy.deprecation.loggi...

  • 179 Views
  • 3 replies
  • 1 kudos
Latest Reply
iFoxz17
New Contributor II
  • 1 kudos

@ManojkMohan as mentioned in the first post I already used dict(spark.conf.getAll()).get(key, default) where possible.However the problem stands when importing modules, like:- from dbacademy import dbgems- from dbacademy.dbhelper import DBAcademyHelp...

  • 1 kudos
2 More Replies
Charansai
by New Contributor III
  • 115 Views
  • 1 replies
  • 0 kudos

Serverless Compute – ADLS Gen2 Authorization Failure with RBAC

We are facing an authorization issue when using serverless compute with ADLS Gen2 storage. Queries fail with:Code AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation.", 403 AuthorizationFailureDetai...

  • 115 Views
  • 1 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

private link from serverless, as probably you are not allowing public internet access. Configure private connectivity to Azure resources - Azure Databricks | Microsoft Learn you need to add both dfs and blob

  • 0 kudos
jitendrajha11
by New Contributor II
  • 411 Views
  • 5 replies
  • 2 kudos

Want to see logs for lineage view run events

Hi All,I need your help, as I am running jobs it is getting successful, when I click on job and there we can find lineage > View run events option when click on it. I see below steps.Job Started: The job is triggered.Waiting for Cluster: The job wait...

  • 411 Views
  • 5 replies
  • 2 kudos
Latest Reply
mitchellg-db
Databricks Employee
  • 2 kudos

Hi there, I vibe-coded* a query where I was able to derive most of your events from the system tables: system.lakeflow.jobssystem.lakeflow.job_run_timelinesystem.lakeflow.job_task_run_timeline If you have SELECT access to system tables, this could b...

  • 2 kudos
4 More Replies
Brahmareddy
by Esteemed Contributor
  • 475 Views
  • 4 replies
  • 9 kudos

Future of Movie Discovery: How I Built an AI Movie Recommendation Agent on Databricks Free Edition

As a data engineer deeply passionate about how data and AI can come together to create real-world impact, I’m excited to share my project for the Databricks Free Edition Hackathon 2025 — Future of Movie Discovery (FMD). Built entirely on Databricks F...

  • 475 Views
  • 4 replies
  • 9 kudos
Latest Reply
AlbertaBode
New Contributor II
  • 9 kudos

Really cool project! The mood-based movie matching and conversational memory make the whole discovery experience feel way more intuitive. It’s interesting because most people still browse platforms manually — like on streaming App — but your system s...

  • 9 kudos
3 More Replies
Naveenkumar1811
by New Contributor III
  • 244 Views
  • 5 replies
  • 2 kudos

Reduce the Time for First Spark Streaming Run Kick off

Hi Team,Currently I have a Silver Delta Table(External) is loading on Streaming and the Gold is on Batch.I Need to Make the Gold Delta as well to Streaming. In My First Run I can the stream initializing process is taking an hour or so as my Silver ta...

  • 244 Views
  • 5 replies
  • 2 kudos
Latest Reply
Prajapathy_NKR
Contributor
  • 2 kudos

@Naveenkumar1811 Since your silver is a streaming job, there can be lots of files and metadata being created based on your write interval and frequency of new data. If there are more files being created in few mins, it potentially leads to small file...

  • 2 kudos
4 More Replies
Johan_Van_Noten
by New Contributor III
  • 247 Views
  • 3 replies
  • 2 kudos

Long-running Python http POST hangs

As one of the steps in my data engineering pipeline, I need to perform a POST request to a http (not -s) server.This all works fine, except for the situation described below: it then hangs indefinitely.Environment:Azure Databricks Runtime 13.3 LTSPyt...

  • 247 Views
  • 3 replies
  • 2 kudos
Latest Reply
siva-anantha
Contributor
  • 2 kudos

Hello,IMHO, having a HTTP related task in a Spark cluster is an anti-pattern. This kind of code executes at the Driver, it will be synchronous and adds overhead. This is one of the reasons, DLT (or SDP - Spark Declarative Pipeline) does not have REST...

  • 2 kudos
2 More Replies
mafzal669
by New Contributor
  • 142 Views
  • 1 replies
  • 0 kudos

Admin user creation

Hi,I have created an azure account using my personal email id. I want to create this email id as Group Id in databricks admin console. But when I am adding a new user it says the user with this email id already exist. Could someone please help. I use...

  • 142 Views
  • 1 replies
  • 0 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 0 kudos

As the User IDs and Group IDs share the same namespace in Databricks you cannot create a Group with the same email address that is already registered as a User in your Databricks account.You better rename the group.

  • 0 kudos
pooja_bhumandla
by New Contributor III
  • 214 Views
  • 2 replies
  • 1 kudos

Error: Executor Memory Issue with Broadcast Joins in Structured Streaming – Unable to Store 69–80 MB

Hi Community,I encountered the following error:      Failed to store executor broadcast spark_join_relation_1622863 (size = Some(67141632)) in BlockManager              with storageLevel=StorageLevel(memory, deserialized, 1 replicas)in a Structured S...

pooja_bhumandla_0-1764236942720.png
  • 214 Views
  • 2 replies
  • 1 kudos
Latest Reply
Yogesh_Verma_
Contributor II
  • 1 kudos

What Spark Does During a Broadcast Join-Spark identifies the smaller table (say 80MB).The driver collects this small table to a single JVM.The driver serializes the table into a broadcast variable.The broadcast variable is shipped to all executors.Ex...

  • 1 kudos
1 More Replies
RIDBX
by Contributor
  • 478 Views
  • 6 replies
  • 1 kudos

Pushing data from databricks (cloud) to Oracle (on-prem) instance?

Pushing data from databricks (cloud) to Oracle (on-prem) instance?===================================================Thanks for reviewing my threads. I find some threads on this subject dated in 2022 by @ Ajay-PandeyDatabricks to Oracle  We find many...

  • 478 Views
  • 6 replies
  • 1 kudos
Latest Reply
iyashk-DB
Databricks Employee
  • 1 kudos

Option 1: Spark JDBC write from Databricks to Oracle (recommended for “push”/ingestion) Use the built‑in Spark JDBC writer with Oracle’s JDBC driver. It’s the most direct path for writing into on‑prem Oracle and gives you control over batching, paral...

  • 1 kudos
5 More Replies
Labels