cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

philHarasz
by New Contributor III
  • 3530 Views
  • 4 replies
  • 0 kudos

Resolved! Writing a small pyspark dataframe to a table is taking a very long time

My experience with Databricks pyspark up to this point has always been to execute a SQL query against existing Databricks tables, then write the resulting pyspark dataframe into a new table. For the first time, I am now getting data via an API which ...

  • 3530 Views
  • 4 replies
  • 0 kudos
Latest Reply
philHarasz
New Contributor III
  • 0 kudos

After reading the suggested documentation, I tried using the "Parse nested XML (from_xml and schema_of_xml)". I used this code from the doc: df = spark.createDataFrame([(8, xml_data)], ["number", "payload"]) schema = schema_of_xml(df.select("payload"...

  • 0 kudos
3 More Replies
vaibhavaher2025
by New Contributor
  • 3274 Views
  • 2 replies
  • 2 kudos

Serverless compute vs Job cluster

Hi Guys,For running the job with varying workload what should I use ? Serverless cluster or Job compute ?What are positives and negatives?(I'll be running my notebook from Azure data factory)

  • 3274 Views
  • 2 replies
  • 2 kudos
Latest Reply
KaranamS
Contributor III
  • 2 kudos

It depends on cost, performance and startup time needed for your use-case.Serverless compute is usually preferred choice because of its fast startup time and dynamic scaling. However, if your workload is long-running and predictable, job compute with...

  • 2 kudos
1 More Replies
Phani1
by Valued Contributor II
  • 1100 Views
  • 1 replies
  • 1 kudos

Databricks Vs Fabric use case

Hi Team,We've noticed that for some use cases, customers are proposing a architecture with A) Fabric in the Gold layer and reporting in Azure Power BI, while using Databricks for the Bronze and Silver layers. However, we can also have the B) Gold lay...

  • 1100 Views
  • 1 replies
  • 1 kudos
Latest Reply
MariuszK
Valued Contributor III
  • 1 kudos

Gold layer in Databricks and connect to Power BI - this is a good option.However, If you need to use some of Fabric capabilities, because your team has preferences to use T-SQL, Direct Lake, Python notebooks, low-code tools like Data Factory. MS Fabr...

  • 1 kudos
dzsuzs
by New Contributor II
  • 2594 Views
  • 3 replies
  • 2 kudos

OOM Issue in Streaming with foreachBatch()

I have a stateless streaming application that uses foreachBatch. This function executes between 10-400 times each hour based on custom logic.  The logic within foreachBatch includes: collect() on very small DataFrames (a few megabytes) --> driver mem...

  • 2594 Views
  • 3 replies
  • 2 kudos
Latest Reply
gardnmi1983
New Contributor II
  • 2 kudos

Did you ever figure out what is causing the memory leak?  We are experiencing a nearly identical issue where the memory gradually increases over time and OOM after a few days.  I did track down this open bug ticket that states there is a memory leak ...

  • 2 kudos
2 More Replies
robertomatus
by New Contributor II
  • 1526 Views
  • 3 replies
  • 1 kudos

Autoloader infering struct as a string when reading json data

Hi Everyone,Trying to read JSON files with autoloader is failing to infer the schema correctly, every nested or struct column is being inferred as a string.   spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .option("cloud...

  • 1526 Views
  • 3 replies
  • 1 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 1 kudos

Hi @robertomatus ,You're right—it would be much better if we didn’t have to rely on workarounds. The reason AutoLoader infers schema differently from spark.read.json() is that it's optimized for streaming large-scale data efficiently. Unlike spark.re...

  • 1 kudos
2 More Replies
Phani1
by Valued Contributor II
  • 2120 Views
  • 1 replies
  • 0 kudos

Databricks vs snowflake use case comparision

Hi Databricks Team,We see Databricks and Snowflake as very close in terms of features. When trying to convince customers about Databricks' products, we would like to know the key comparisons between Databricks and Snowflake by use case.Regards,Phani

  • 2120 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @Phani1, You can read these resources: https://www.databricks.com/databricks-vs-snowflake https://www.databricks.com/blog/2018/08/27/by-customer-demand-databricks-and-snowflake-integration.html

  • 0 kudos
Katalin555
by New Contributor II
  • 989 Views
  • 2 replies
  • 0 kudos

df.isEmpty() and df.fillna(0).isEmpty() throws error

In our code we usually use Single user cluster with 13.3 LTS with Spark 3.4.1 when loading data from delta table to Azure SQL Hyperscale, and we did not experience any issues, but starting last week our pipeline has been failing with the following er...

  • 989 Views
  • 2 replies
  • 0 kudos
Latest Reply
Katalin555
New Contributor II
  • 0 kudos

Hi @Alberto_Umana ,Yes I checked and did not see any other information. We are using Driver: Standard_DS5_v2 · Workers: Standard_E16a_v4 · 1-6 workers, at the stage when the pipeline fails the shuffle information was :Shuffle Read Size / Records: 257...

  • 0 kudos
1 More Replies
AnudeepKolluri
by New Contributor II
  • 862 Views
  • 4 replies
  • 0 kudos
  • 862 Views
  • 4 replies
  • 0 kudos
Latest Reply
Jim_Anderson
Databricks Employee
  • 0 kudos

It looks like your completion email was distributed on Feb 04, but I will DM you the certification discount code again for your reference    

  • 0 kudos
3 More Replies
nagsky1
by New Contributor II
  • 528 Views
  • 1 replies
  • 0 kudos

Resolved! Not recieved coupon Code

I have completed my Databricks Learning Festival and yet haven't received coupon Code

  • 528 Views
  • 1 replies
  • 0 kudos
Latest Reply
Jim_Anderson
Databricks Employee
  • 0 kudos

@nagsky1 Please feel free to DM me with the email address associated with your Academy account so I can verify your participation  

  • 0 kudos
LearnDB1234
by New Contributor III
  • 1325 Views
  • 4 replies
  • 0 kudos

How to store SQL query output columns as variables to be used as parameters for API data call in DAT

I have a sql query which provides me with the below output :Select FirstName,LastName,Title From Default.Name Tony Gonzalez Mr Tom Brady Mr Patricia Carroll MissI would like to store FirstName, LastName & title column output...

  • 1325 Views
  • 4 replies
  • 0 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 0 kudos

Hi @LearnDB1234 ,Here is the approach - You can make your API call dynamic by first running your SQL query and storing the results in a DataFrame. Then, you can loop through each row in the DataFrame and extract the FirstName and LastName values, pas...

  • 0 kudos
3 More Replies
SOlivero
by New Contributor III
  • 776 Views
  • 1 replies
  • 0 kudos

Scheduling Jobs with Multiple Git Repos on a Single Job Cluster

Hi,I'm trying to create a scheduled job that runs notebooks from three different repos. However, since a job can only be associated with one repo, I've had to create three separate jobs and a master job that triggers them sequentially.This setup work...

  • 776 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 0 kudos

Hi @SOlivero ,Try configuring a shared all-purpose cluster and set each job to use this existing cluster rather than creating new job-specific clusters, ensuring the cluster stays warm and avoiding startup delays. Another option is to restructure you...

  • 0 kudos
Erik
by Valued Contributor III
  • 5854 Views
  • 6 replies
  • 4 kudos

Resolved! Powerbi databricks connector should import column description

I posted this idea in ideas.powerbi.com as well, but it is quite unclear to me whether the powerbi databricks connector is in fact made by MS or Databricks, so I post it here as well!It is possible to add comments/descriptions to databricks database ...

  • 5854 Views
  • 6 replies
  • 4 kudos
Latest Reply
capstone
New Contributor II
  • 4 kudos

You can use this C# script in Tabular Editor to achieve this. Basically, all the comments can be accessed via the 'information_schema' in Databricks. Import the relevant columns from the schema using this query select * from samples.information_schem...

  • 4 kudos
5 More Replies
NaeemS
by New Contributor III
  • 2285 Views
  • 2 replies
  • 0 kudos

Handling Aggregations in Feature Function

Hi,Is it possible to cater aggregation using Feature Functions somehow. As we know that the logic defined in feature function is applied on a single row when a join is being performed. But do we have any mechanism to handle to aggregations too someho...

Data Engineering
Feature Functions
Feature Store
  • 2285 Views
  • 2 replies
  • 0 kudos
Latest Reply
rafaelsass
New Contributor II
  • 0 kudos

Hi @NaeemS !Have you managed to achieve this by any means? I'm facing the same question right now.

  • 0 kudos
1 More Replies
Paul92S
by New Contributor III
  • 13262 Views
  • 6 replies
  • 5 kudos

Resolved! DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Hi,I am having an issue of loading source data into a delta table/ unity catalog. The error we are recieving is the following:grpc_message:"[DELTA_EXCEED_CHAR_VARCHAR_LIMIT] Exceeds char/varchar type length limitation. Failed check: (isnull(\'metric_...

  • 13262 Views
  • 6 replies
  • 5 kudos
Latest Reply
willflwrs
New Contributor III
  • 5 kudos

Setting this config change before making the write command solved it for us:  spark.conf.set("spark.sql.legacy.charVarcharAsString", True) 

  • 5 kudos
5 More Replies
DataEnginerrOO1
by New Contributor II
  • 2950 Views
  • 5 replies
  • 0 kudos

Access for delta lake with serverless

I have an issue when trying to use the command display(dbutils.fs.ls("abfss://test@test.dfs.core.windows.net")). When I execute the command on my personal cluster, it works, and I can see the files. Before that, I set the following configurations:spa...

  • 2950 Views
  • 5 replies
  • 0 kudos
Latest Reply
Rjdudley
Honored Contributor
  • 0 kudos

Can your serverless compute access any storage in that storage account?  Something else to check is if your NCC is configured correctly: Configure private connectivity from serverless compute - Azure Databricks | Microsoft Learn.  However, if your se...

  • 0 kudos
4 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels