cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

smpa01
by New Contributor III
  • 50 Views
  • 1 replies
  • 1 kudos

Resolved! tbl name as paramater marker

I am getting an error here, when I do this//this works fine declare sqlStr = 'select col1 from catalog.schema.tbl LIMIT (?)'; declare arg1 = 500; EXECUTE IMMEDIATE sqlStr USING arg1; //this does not declare sqlStr = 'select col1 from (?) LIMIT (?)';...

  • 50 Views
  • 1 replies
  • 1 kudos
Latest Reply
LRALVA
Contributor III
  • 1 kudos

@smpa01 In SQL EXECUTE IMMEDIATE, you can only parameterize values, not identifiers like table names, column names, or database names.That is, placeholders (?) can only replace constant values, not object names (tables, schemas, columns, etc.).SELECT...

  • 1 kudos
397973
by New Contributor III
  • 57 Views
  • 1 replies
  • 0 kudos

Several unavoidable for loops are slowing this PySpark code. Is it possible to improve it?

Hi. I have a PySpark notebook that takes 25 minutes to run as opposed to one minute in on-prem Linux + Pandas. How can I speed it up?It's not a volume issue. The input is around 30k rows. Output is the same because there's no filtering or aggregation...

  • 57 Views
  • 1 replies
  • 0 kudos
Latest Reply
LRALVA
Contributor III
  • 0 kudos

@397973 Spark is optimized for 100s of GB or millions of rows, NOT small in-memory lookups with heavy control flow (unless engineered carefully).That's why Pandas is much faster for your specific case now.Pre-load and Broadcast All MappingsInstead of...

  • 0 kudos
minhhung0507
by Contributor III
  • 643 Views
  • 15 replies
  • 3 kudos

API for Restarting Individual Failed Tasks within a Job?

Hi everyone,I'm exploring ways to streamline my workflow in Databricks and could really use some expert advice. In my current setup, I have a job (named job_silver) with multiple tasks (e.g., task 1, task 2, task 3). When one of these tasks fails—say...

  • 643 Views
  • 15 replies
  • 3 kudos
Latest Reply
aayrm5
Valued Contributor III
  • 3 kudos

Hey @minhhung0507 - quick question - what is the cluster type you're using to run your workflow?I'm using a shared, interactive cluster, so I'm passing the parameter {'existing_cluster_id' : task['existing_cluster_id']}in the payload. This parameter ...

  • 3 kudos
14 More Replies
daan_dw
by New Contributor
  • 226 Views
  • 1 replies
  • 0 kudos

Writing files using multithreading to dbfs

Hello,I am reading in xml files from AWS S3 and storing them on dbfs:/ using multithreaded code. The code itself seems to be fine as for the first +- 100 000 files it works without issues and I can see the data arriving on DBFS.However it will always...

Screenshot 2025-04-11 at 16.14.04.png
  • 226 Views
  • 1 replies
  • 0 kudos
Latest Reply
SP_6721
New Contributor III
  • 0 kudos

Hi @daan_dw I think this issue mainly comes from using multithreading to handle XML files while interacting with both S3 and DBFS. When the thread count gets too high, it likely causes race conditions.To avoid this:Try reducing the number of threads....

  • 0 kudos
Mano99
by New Contributor II
  • 258 Views
  • 2 replies
  • 0 kudos

Resolved! Databricks External table row maximum size

Hi Databricks Team/ Community,We have created a Databricks External table on top of ADLS Gen 2. Both parquet and delta tables. we are loading nested json structure into a table. Few column will have huge nested json data. Im getting results too large...

  • 258 Views
  • 2 replies
  • 0 kudos
Latest Reply
dennis65
New Contributor II
  • 0 kudos

@Mano99 ktagwrote:Hi Databricks Team/ Community,We have created a Databricks External table on top of ADLS Gen 2. Both parquet and delta tables. we are loading nested json structure into a table. Few column will have huge nested json data. Im getting...

  • 0 kudos
1 More Replies
smpa01
by New Contributor III
  • 176 Views
  • 1 replies
  • 0 kudos

Resolved! global temp view issue

I am following the doc1 and doc2 but I am getting an error.I was under the impression from the documentation that it is doable in pure sql. What am I doing wrong?I know how to do this in python using dataframe API and I am not looking for that soluti...

smpa01_0-1745253862295.png
  • 176 Views
  • 1 replies
  • 0 kudos
Latest Reply
smpa01
New Contributor III
  • 0 kudos

it is missing ;

  • 0 kudos
minhhung0507
by Contributor III
  • 139 Views
  • 1 replies
  • 0 kudos

Handling Hanging Pipelines in Real-Time Environments: Leveraging Databricks’ Idle Event Monitoring

Hi everyone,I’m running multiple real-time pipelines on Databricks using a single job that submits them via a thread pool. While most pipelines are running smoothly, I’ve noticed that a few of them occasionally get “stuck” or hang for several hours w...

  • 139 Views
  • 1 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

may I ask why you use threadpools?  with jobs you can define multiple tasks which do the same.I'm asking because threadpools and spark resource management can intervene with each other.

  • 0 kudos
Dnirmania
by Contributor
  • 662 Views
  • 4 replies
  • 0 kudos

Read file from AWS S3 using Azure Databricks

Hi TeamI am currently working on a project to read CSV files from an AWS S3 bucket using an Azure Databricks notebook. My ultimate goal is to set up an autoloader in Azure Databricks that reads new files from S3 and loads the data incrementally. Howe...

Dnirmania_0-1744106993274.png
  • 662 Views
  • 4 replies
  • 0 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 0 kudos

no ,it is very easy follow this guide it will work - https://github.com/aviral-bhardwaj/MyPoCs/blob/main/SparkPOC/ETLProjectsAWS-S3toDatabricks.ipynb   

  • 0 kudos
3 More Replies
mrstevegross
by Contributor III
  • 541 Views
  • 3 replies
  • 0 kudos

Graviton & containers?

Currently, DBR does not permit a user to run a containerized job on a graviton machines (per these docs). In our case, we're running containerized jobs on a pool. We are exploring adopting Graviton, but--per those docs--DBR won't let us do that.Are t...

  • 541 Views
  • 3 replies
  • 0 kudos
Latest Reply
Isi
Contributor III
  • 0 kudos

Hey @mrstevegross Steve,I have found this docs from Databricks about enviroments, as you can see is in public preview... If you find my previous answer helpful, feel free to mark it as the solution so it can help others as well.Thanks!Isi

  • 0 kudos
2 More Replies
vishaldevarajan
by New Contributor II
  • 474 Views
  • 3 replies
  • 0 kudos

Unable to read excel files in the Azure databricks (UC enabled workspace)

Hello,After adding the maven library com.crealytics:spark-excel_2.12:0.13.5 under the artifact allowlist, I have installed it at the Azure databricks cluster level (shared, unity catalog enabled, runtime 15.4). Then I tried to create a df for the exc...

Data Engineering
Azure Databricks
Excel File
  • 474 Views
  • 3 replies
  • 0 kudos
Latest Reply
BigRoux
Databricks Employee
  • 0 kudos

I did a little more digging and found further information:   Unity Catalog does not natively support reading Excel files directly. Based on the provided context, there are a few key points to consider: Third-Party Libraries: Reading Excel files in D...

  • 0 kudos
2 More Replies
Tommabip
by New Contributor II
  • 289 Views
  • 3 replies
  • 2 kudos

Resolved! Databricks Cluster Policies

Hi, I' m trying to create a terraform script that does the following:- create a policy where I specify env variables and libraries- create a cluster that inherits from that policy and uses the env variables specified in the policy.I saw in the decume...

  • 289 Views
  • 3 replies
  • 2 kudos
Latest Reply
BigRoux
Databricks Employee
  • 2 kudos

You're correct in observing this discrepancy. When a cluster policy is defined and applied through the Databricks UI, fixed environment variables (`spark_env_vars`) specified in the policy automatically propagate to clusters created under that policy...

  • 2 kudos
2 More Replies
valde
by New Contributor
  • 144 Views
  • 1 replies
  • 0 kudos

Window function VS groupBy + map

Let's say we have an RDD like this:RDD(id: Int, measure: Int, date: LocalDate)Let's say we want to apply some function that compares 2 consecutive measures by date, outputs a number and we want to get the sum of those numbers by id. The function is b...

  • 144 Views
  • 1 replies
  • 0 kudos
Latest Reply
Renu_
Contributor
  • 0 kudos

Hi @valde, those two approaches give the same result, but they don’t work the same way under the hood. SparkSQL uses optimized window functions that handle things like shuffling and memory more efficiently, often making it faster and lighter.On the o...

  • 0 kudos
Labels