cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Joost1024
by New Contributor
  • 172 Views
  • 4 replies
  • 0 kudos

Read Array of Arrays of Objects JSON file using Spark

Hi Databricks Community! This is my first post in this forum, so I hope you can forgive me if it's not according to the forum best practices After lots of searching, I decided to share the peculiar issue I'm running into in this community.I try to lo...

  • 172 Views
  • 4 replies
  • 0 kudos
Latest Reply
Joost1024
New Contributor
  • 0 kudos

I guess I was a bit over enthusiastic by accepting the answer.When I run the following on the single object array of arrays (as shown in the original post) I get a single row with column "value" and value null. from pyspark.sql import functions as F,...

  • 0 kudos
3 More Replies
ismaelhenzel
by Contributor III
  • 25 Views
  • 0 replies
  • 0 kudos

Declarative Pipelines - Dynamic Overwrite

Regarding the limitations of declarative pipelines—specifically the inability to use replaceWhere—I discovered through testing that materialized views actually support dynamic overwrites. This handles several scenarios where replaceWhere would typica...

  • 25 Views
  • 0 replies
  • 0 kudos
Shimon
by New Contributor
  • 177 Views
  • 2 replies
  • 0 kudos

Jackson version conflict

Hi,I am trying to implement the Spark TableProvider api and i am experiencing a jar conflict (I am using the 17.3 runtime). com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.15.2 requires Jackson Databind version >= 2.15.0 and < 2.1...

  • 177 Views
  • 2 replies
  • 0 kudos
Latest Reply
Shimon
New Contributor
  • 0 kudos

For now we are trying to contact Databricks, In worst case scenario we were planning to shade the dependencies we need.would love to hear what has worked for you.Best,Shimon

  • 0 kudos
1 More Replies
dvd_lg_bricks
by New Contributor II
  • 486 Views
  • 10 replies
  • 3 kudos

Questions About Workers and Executors Configuration in Databricks

Hi everyone, sorry, I’m new here. I’m considering migrating to Databricks, but I need to clarify a few things first.When I define and launch an application, I see that I can specify the number of workers, and then later configure the number of execut...

  • 486 Views
  • 10 replies
  • 3 kudos
Latest Reply
Abeshek
New Contributor II
  • 3 kudos

Your Databricks question about workers versus executors. Many teams encounter the same sizing and configuration issues when evaluating a migration. At Kanerika, we help companies plan cluster architecture, optimize Spark workloads, and avoid overspen...

  • 3 kudos
9 More Replies
singhanuj2803
by Contributor
  • 283 Views
  • 4 replies
  • 1 kudos

Troubleshooting Azure Databricks Cluster Pools & spot_bid_max_price Validation Error

Hope you’re doing well!I’m reaching out for some guidance on an issue I’ve encountered while setting up Azure Databricks Cluster Pools to reduce cluster spin-up and scale times for our jobs.Background:To optimize job execution wait times, I’ve create...

  • 283 Views
  • 4 replies
  • 1 kudos
Latest Reply
Poorva21
New Contributor II
  • 1 kudos

Possible reasons:1. Setting spot_bid_max_price = -1 is not accepted by Azure poolsAzure Databricks only accepts:0 → on-demand onlypositive numbers → max spot price-1 is allowed in cluster policies, but not inside pools, so validation never completes....

  • 1 kudos
3 More Replies
mordex
by New Contributor II
  • 329 Views
  • 4 replies
  • 1 kudos

Resolved! Why is spark creating 5 jobs and 200 tasks?

I am trying to read 1000 small csv files each 30 kb size which are stored in databricks volume. Below is the query i am doing:df=spark.read.csv.options(header=true).load('/path')df.collect() Why is it creating 5 jobs? Why 1-3 jobs have 200 tasks,4 ha...

030a9798-9c6f-4ab3-be53-7f6e4a5f7289.jfif
  • 329 Views
  • 4 replies
  • 1 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 1 kudos

@mordex - yes, Spark caps the parallelism for file listing at 200 tasks, regardless of whether you have 1,000 or 10,000 files. it is controlled by spark.sql.sources.parallelPartitionDiscovery.parallelism. Run below command to get value of it. spark.c...

  • 1 kudos
3 More Replies
__Aziz__
by New Contributor II
  • 170 Views
  • 1 replies
  • 1 kudos

Resolved! mongodb connector duplicate writes

Hi everyone,Has anyone run into this issue? I’m using the MongoDB Spark Connector on Databricks to expose data from Delta Lake to MongoDB. My workflow is:overwrite the collection (very fast),then create the indexes.Occasionally, I’m seeing duplicates...

  • 170 Views
  • 1 replies
  • 1 kudos
Latest Reply
bianca_unifeye
Contributor
  • 1 kudos

Hi Aziz,What you’re seeing is an expected behaviour when combining Spark retries with non-idempotent writes.Spark’s write path is task-based and fault-tolerant. If a task fails part-way through writing to MongoDB, Spark will retry that task.From Spar...

  • 1 kudos
SRJDB
by New Contributor II
  • 171 Views
  • 1 replies
  • 0 kudos

Why am I getting a cast invalid input error when using display()?

I have a spark data frame. It consists of a single column, in string format, with 28750 values in it. The values are all 10 digits long. I want to look at the data, like this:my_dataframe.display()But this returns the following error:[CAST_INVALID_IN...

  • 171 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @SRJDB ,Could you execute my_dataframe.printSchema() and attach result here? 

  • 0 kudos
iFoxz17
by New Contributor II
  • 192 Views
  • 3 replies
  • 1 kudos

Databricks academy error setup - Free Edition with Serverless Compute

Databricks is passing from the Community Edition to the Free Edition, which I am currently using.When executing the Includes/Classroom-setup notebooks the following exception is raised: [CONFIG_NOT_AVAILABLE] Configuration dbacademy.deprecation.loggi...

  • 192 Views
  • 3 replies
  • 1 kudos
Latest Reply
iFoxz17
New Contributor II
  • 1 kudos

@ManojkMohan as mentioned in the first post I already used dict(spark.conf.getAll()).get(key, default) where possible.However the problem stands when importing modules, like:- from dbacademy import dbgems- from dbacademy.dbhelper import DBAcademyHelp...

  • 1 kudos
2 More Replies
Charansai
by New Contributor III
  • 138 Views
  • 1 replies
  • 0 kudos

Serverless Compute – ADLS Gen2 Authorization Failure with RBAC

We are facing an authorization issue when using serverless compute with ADLS Gen2 storage. Queries fail with:Code AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation.", 403 AuthorizationFailureDetai...

  • 138 Views
  • 1 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 0 kudos

private link from serverless, as probably you are not allowing public internet access. Configure private connectivity to Azure resources - Azure Databricks | Microsoft Learn you need to add both dfs and blob

  • 0 kudos
jitendrajha11
by New Contributor II
  • 456 Views
  • 5 replies
  • 2 kudos

Want to see logs for lineage view run events

Hi All,I need your help, as I am running jobs it is getting successful, when I click on job and there we can find lineage > View run events option when click on it. I see below steps.Job Started: The job is triggered.Waiting for Cluster: The job wait...

  • 456 Views
  • 5 replies
  • 2 kudos
Latest Reply
mitchellg-db
Databricks Employee
  • 2 kudos

Hi there, I vibe-coded* a query where I was able to derive most of your events from the system tables: system.lakeflow.jobssystem.lakeflow.job_run_timelinesystem.lakeflow.job_task_run_timeline If you have SELECT access to system tables, this could b...

  • 2 kudos
4 More Replies
Brahmareddy
by Esteemed Contributor
  • 515 Views
  • 4 replies
  • 9 kudos

Future of Movie Discovery: How I Built an AI Movie Recommendation Agent on Databricks Free Edition

As a data engineer deeply passionate about how data and AI can come together to create real-world impact, I’m excited to share my project for the Databricks Free Edition Hackathon 2025 — Future of Movie Discovery (FMD). Built entirely on Databricks F...

  • 515 Views
  • 4 replies
  • 9 kudos
Latest Reply
AlbertaBode
New Contributor II
  • 9 kudos

Really cool project! The mood-based movie matching and conversational memory make the whole discovery experience feel way more intuitive. It’s interesting because most people still browse platforms manually — like on streaming App — but your system s...

  • 9 kudos
3 More Replies
Naveenkumar1811
by New Contributor III
  • 264 Views
  • 5 replies
  • 2 kudos

Reduce the Time for First Spark Streaming Run Kick off

Hi Team,Currently I have a Silver Delta Table(External) is loading on Streaming and the Gold is on Batch.I Need to Make the Gold Delta as well to Streaming. In My First Run I can the stream initializing process is taking an hour or so as my Silver ta...

  • 264 Views
  • 5 replies
  • 2 kudos
Latest Reply
Prajapathy_NKR
Contributor
  • 2 kudos

@Naveenkumar1811 Since your silver is a streaming job, there can be lots of files and metadata being created based on your write interval and frequency of new data. If there are more files being created in few mins, it potentially leads to small file...

  • 2 kudos
4 More Replies
Johan_Van_Noten
by New Contributor III
  • 284 Views
  • 3 replies
  • 2 kudos

Long-running Python http POST hangs

As one of the steps in my data engineering pipeline, I need to perform a POST request to a http (not -s) server.This all works fine, except for the situation described below: it then hangs indefinitely.Environment:Azure Databricks Runtime 13.3 LTSPyt...

  • 284 Views
  • 3 replies
  • 2 kudos
Latest Reply
siva-anantha
Contributor
  • 2 kudos

Hello,IMHO, having a HTTP related task in a Spark cluster is an anti-pattern. This kind of code executes at the Driver, it will be synchronous and adds overhead. This is one of the reasons, DLT (or SDP - Spark Declarative Pipeline) does not have REST...

  • 2 kudos
2 More Replies
Labels