cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Johan_Van_Noten
by New Contributor III
  • 113 Views
  • 3 replies
  • 2 kudos

Long-running Python http POST hangs

As one of the steps in my data engineering pipeline, I need to perform a POST request to a http (not -s) server.This all works fine, except for the situation described below: it then hangs indefinitely.Environment:Azure Databricks Runtime 13.3 LTSPyt...

  • 113 Views
  • 3 replies
  • 2 kudos
Latest Reply
siva-anantha
New Contributor III
  • 2 kudos

Hello,IMHO, having a HTTP related task in a Spark cluster is an anti-pattern. This kind of code executes at the Driver, it will be synchronous and adds overhead. This is one of the reasons, DLT (or SDP - Spark Declarative Pipeline) does not have REST...

  • 2 kudos
2 More Replies
mafzal669
by New Contributor
  • 45 Views
  • 1 replies
  • 0 kudos

Admin user creation

Hi,I have created an azure account using my personal email id. I want to create this email id as Group Id in databricks admin console. But when I am adding a new user it says the user with this email id already exist. Could someone please help. I use...

  • 45 Views
  • 1 replies
  • 0 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 0 kudos

As the User IDs and Group IDs share the same namespace in Databricks you cannot create a Group with the same email address that is already registered as a User in your Databricks account.You better rename the group.

  • 0 kudos
pooja_bhumandla
by New Contributor III
  • 83 Views
  • 2 replies
  • 1 kudos

Error: Executor Memory Issue with Broadcast Joins in Structured Streaming – Unable to Store 69–80 MB

Hi Community,I encountered the following error:      Failed to store executor broadcast spark_join_relation_1622863 (size = Some(67141632)) in BlockManager              with storageLevel=StorageLevel(memory, deserialized, 1 replicas)in a Structured S...

pooja_bhumandla_0-1764236942720.png
  • 83 Views
  • 2 replies
  • 1 kudos
Latest Reply
Yogesh_Verma_
Contributor
  • 1 kudos

What Spark Does During a Broadcast Join-Spark identifies the smaller table (say 80MB).The driver collects this small table to a single JVM.The driver serializes the table into a broadcast variable.The broadcast variable is shipped to all executors.Ex...

  • 1 kudos
1 More Replies
RIDBX
by Contributor
  • 266 Views
  • 6 replies
  • 1 kudos

Pushing data from databricks (cloud) to Oracle (on-prem) instance?

Pushing data from databricks (cloud) to Oracle (on-prem) instance?===================================================Thanks for reviewing my threads. I find some threads on this subject dated in 2022 by @ Ajay-PandeyDatabricks to Oracle  We find many...

  • 266 Views
  • 6 replies
  • 1 kudos
Latest Reply
iyashk-DB
Databricks Employee
  • 1 kudos

Option 1: Spark JDBC write from Databricks to Oracle (recommended for “push”/ingestion) Use the built‑in Spark JDBC writer with Oracle’s JDBC driver. It’s the most direct path for writing into on‑prem Oracle and gives you control over batching, paral...

  • 1 kudos
5 More Replies
dkhodyriev1208
by New Contributor
  • 288 Views
  • 4 replies
  • 2 kudos

Spark SQL INITCAP not capitalizing letters after periods in abbreviations

Using SELECT INITCAP("text (e.g., text, text, etc.)") abbreviations with periods like e.g. are not being fully capitalized.Current behavior:Input: "text (e.g., text, text, etc.)"Output: "Text (e.g., Text, Text, Etc.)"Expected behavior:Output: "Text ...

  • 288 Views
  • 4 replies
  • 2 kudos
Latest Reply
iyashk-DB
Databricks Employee
  • 2 kudos

Yes similar to what @Coffee77 has told, you can alternatively create an SQL function and use it directly with the custom logic using the regexp: CREATE OR REPLACE FUNCTION PROPER_WITH_ABBREVIATIONS(input STRING)RETURNS STRINGRETURN regexp_replace(INI...

  • 2 kudos
3 More Replies
Swathik
by New Contributor II
  • 124 Views
  • 1 replies
  • 0 kudos

Resolved! Best Practices for implementing DLT, Autoloader in Workflows

I am in the process of designing a Medallion architecture where the data sources include REST API calls, JSON files, SQL Server, and Azure Event Hubs.For the Silver and Gold layers, I plan to leverage Delta Live Tables (DLT). However, I am seeking gu...

  • 124 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The optimal approach for implementing the Bronze layer in a Medallion architecture with Delta Live Tables (DLT) involves balancing batch and streaming ingestion patterns, especially when combining DLT and Autoloader. The trigger(availableNow=True) op...

  • 0 kudos
SupunK
by New Contributor
  • 107 Views
  • 1 replies
  • 1 kudos

Databricks always loads built-in BigQuery connector (0.22.2), can’t override with 0.43.x

I am using Databricks Runtime 15.4 (Spark 3.5 / Scala 2.12) on AWS.My goal is to use the latest Google BigQuery connector because I need the direct write method (BigQuery Storage Write API):option("writeMethod", "direct")This allows writing directly ...

  • 107 Views
  • 1 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

There is no supported way on Databricks Runtime 15.4 to override or replace the built-in BigQuery connector to use your own version (such as 0.43.x) in order to access the direct write method. Databricks clusters come preloaded with their own managed...

  • 1 kudos
Naveenkumar1811
by New Contributor II
  • 122 Views
  • 4 replies
  • 1 kudos

Reduce the Time for First Spark Streaming Run Kick off

Hi Team,Currently I have a Silver Delta Table(External) is loading on Streaming and the Gold is on Batch.I Need to Make the Gold Delta as well to Streaming. In My First Run I can the stream initializing process is taking an hour or so as my Silver ta...

  • 122 Views
  • 4 replies
  • 1 kudos
Latest Reply
Naveenkumar1811
New Contributor II
  • 1 kudos

Hi stbjelcevic,Our Silver source is loaded by streaming process... The gold right now is running every 10 mins batch and this is running in prod now...Since it is prod Scenario and source is streaming load I am worrying about the data loss we might g...

  • 1 kudos
3 More Replies
Naveenkumar1811
by New Contributor II
  • 278 Views
  • 9 replies
  • 1 kudos

How do i Create a workspace object with SP ownership

Hi Team,I have a scenario that i have a jar file(24MB) to be put on workspace directory. But the ownership should be associated to the SP with any Individual ID ownership. Tried the Databricks CLI export option but it has limitation of 10 MB max.Plea...

  • 278 Views
  • 9 replies
  • 1 kudos
Latest Reply
Coffee77
Contributor III
  • 1 kudos

Inspecting underlying HTTP traffic while using Databricks UI to import files in Workspace, it turns out (as expected) Databricks API is used, with requests similar to:So, @Naveenkumar1811 use Databricks API with SP identity in a similar way as expect...

  • 1 kudos
8 More Replies
Naveenkumar1811
by New Contributor II
  • 217 Views
  • 4 replies
  • 2 kudos

Resolved! SkipChangeCommit to True Scenario on Data Loss Possibility

Hi Team,I have Below Scenario,I have a Spark Streaming Job with trigger of Processing time as 3 secs Running Continuously 365 days.We are performing a weekly delete job from the source of this streaming job based on custom retention policy. it is a D...

  • 217 Views
  • 4 replies
  • 2 kudos
Latest Reply
Naveenkumar1811
New Contributor II
  • 2 kudos

Hi szymon/Raman,My Question was on the commit it performs with the insert/append via my streaming and the delete operation by the weekly maintenance Job... Is there a way that both transaction will fall into same commit. I need to understand that por...

  • 2 kudos
3 More Replies
oye
by New Contributor II
  • 227 Views
  • 4 replies
  • 3 kudos

Resolved! Using a cluster of type SINGLE_USER to run parallel python tasks in one job

Hi, I have set up a job of multiple spark python tasks running in parallel. I have only set up one job cluster, single node, data security mode SINGLE_USER, using Databricks Runtime version 14.3.x-scala2.12. These parallel spark python tasks share so...

  • 227 Views
  • 4 replies
  • 3 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 3 kudos

@oye - The variables scope is local to the individual task and do no interfere with other tasks even if the underlying cluster is same. In fact, the issue is normally other way round where if we have to share the variable across tasks - Then the solu...

  • 3 kudos
3 More Replies
200649021
by New Contributor
  • 136 Views
  • 0 replies
  • 0 kudos

Data System & Architecture - PySpark Assignment

Title: Spark Structured Streaming – Airport Counts by CountryThis notebook demonstrates how to set up a Spark Structured Streaming job in Databricks Community Edition.It reads new CSV files from a Unity Catalog volume, processes them to count airport...

  • 136 Views
  • 0 replies
  • 0 kudos
jitendrajha11
by New Contributor II
  • 257 Views
  • 4 replies
  • 1 kudos

Want to see logs for lineage view run events

Hi All,I need your help, as I am running jobs it is getting successful, when I click on job and there we can find lineage > View run events option when click on it. I see below steps.Job Started: The job is triggered.Waiting for Cluster: The job wait...

  • 257 Views
  • 4 replies
  • 1 kudos
Latest Reply
jitendrajha11
New Contributor II
  • 1 kudos

Hi Team/Member,As I am running jobs it is getting successful, when I click on job and there we can find lineage > View run events option when click on it. We find below steps and also added screenshot of it. I want screenshot stages logs, where i wil...

  • 1 kudos
3 More Replies
Sainath368
by Contributor
  • 279 Views
  • 4 replies
  • 5 kudos

Resolved! Autoloader Managed File events

Hi all,We are in the process of migrating from directory listing to managed file events in Azure Databricks. Our data is stored in an Azure Data Lake container with the following folder structure:To enable file events in Unity Catalog (UC), I created...

Sainath368_0-1763538057402.png
  • 279 Views
  • 4 replies
  • 5 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 5 kudos

Recommended approach to continue your existing pattern:Keep the External Location enabled for file events at the high-level path (/Landing).Run a separate Structured Streaming job for each table, specifying the full sub-path in the .load() function (...

  • 5 kudos
3 More Replies
smoortema
by Contributor
  • 259 Views
  • 3 replies
  • 4 kudos

Resolved! how to know which join type was used (broadcast, shuffle hash or sort merge join) for a query?

What is the best way to know what kind of join was used for a SQL query between broadcast, shuffle hash and sort merge? How can the spark UI or the query plan be interpreted?

  • 259 Views
  • 3 replies
  • 4 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 4 kudos

@smoortema , Spark performance tuning is one of the hardest topics to teach or learn, and it’s even tougher to do justice to in a forum thread. That said, I’m really glad to see you asking the question. Tuning is challenging precisely because there a...

  • 4 kudos
2 More Replies
Labels