cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

seefoods
by Valued Contributor
  • 3644 Views
  • 6 replies
  • 7 kudos

Resolved! autoloader strategy write ( APPEND, MERGE, UPDATE, COMPLETE, OVERWRITE)

Hello Guys,  I want to know if operations like overwrite, merge, update in static write its the same when we using autoloader. I'm so confusing about the behavior of mode like ( complete, update and append). After that, i want to know what its the co...

  • 3644 Views
  • 6 replies
  • 7 kudos
Latest Reply
chanukya-pekala
Contributor III
  • 7 kudos

Thanks for discussion. I have a tiny suggestion. Based on my experience working with streaming loads, I often find the checkpoint location hard enough to actually check the offset information or delete that directory for fresh load of data. Hence I h...

  • 7 kudos
5 More Replies
SatyaKoduri
by New Contributor II
  • 1793 Views
  • 1 replies
  • 1 kudos

Resolved! Yaml file to Dataframe

Hi, I'm trying to read YAML files using pyyaml and convert them into a Spark DataFrame with createDataFrame, without specifying a schema—allowing flexibility for potential YAML schema changes over time. This approach worked as expected on Databricks ...

Screenshot 2025-06-10 at 10.36.20.png
  • 1793 Views
  • 1 replies
  • 1 kudos
Latest Reply
lingareddy_Alva
Esteemed Contributor
  • 1 kudos

Hi @SatyaKoduri This is a known issue with newer Spark versions (3.5+) that came with Databricks Runtime 15.4.The schema inference has become more strict and struggles with deeply nested structures like your YAML's nested maps.Here are a few solution...

  • 1 kudos
tuckera
by New Contributor
  • 558 Views
  • 1 replies
  • 0 kudos

Governance in pipelines

How does everyone track and deploy their pipelines and generated data assets? DABs? Terraform? Manual? Something else entirely?

  • 558 Views
  • 1 replies
  • 0 kudos
Latest Reply
lingareddy_Alva
Esteemed Contributor
  • 0 kudos

Hi @tuckera The data engineering landscape shows a pretty diverse mix of approaches for tracking and deploying pipelines and data assets, often varying by company size, maturity, and specific needs.Infrastructure as Code (IaC) tools like Terraform an...

  • 0 kudos
Edoa
by New Contributor
  • 1703 Views
  • 1 replies
  • 0 kudos

SFTP Connection Timeout on Job Cluster but Works on Serverless Compute

Hi all,I'm experiencing inconsistent behavior when connecting to an SFTP server using Paramiko in Databricks.When I run the code on Serverless Compute, the connection to xxx.yyy.com via SFTP works correctly.When I run the same code on a Job Cluster, ...

  • 1703 Views
  • 1 replies
  • 0 kudos
Latest Reply
lingareddy_Alva
Esteemed Contributor
  • 0 kudos

Hi @Edoa This is a common networking issue in Databricks related to the different network configurations between Serverless Compute and Job Clusters.Here are the key differences and potential solutions:Root CauseServerless Compute runs in Databricks'...

  • 0 kudos
oeztuerk82
by New Contributor II
  • 1461 Views
  • 2 replies
  • 3 kudos

Deletion of Resource Group on Azure and Impact on Databricks Workspace

Hello together,I would like to confirm the data retention and deletion behavior associated with an Azure Databricks workspace, particularly in the context of deleting an Azure resource group where a Databricks Workspace lays in.Recently, I deleted an...

  • 1461 Views
  • 2 replies
  • 3 kudos
Latest Reply
SAKBAR
New Contributor II
  • 3 kudos

Resource group once deleted cannot be recovered as like ADLS, so not possible to restore workspace or any resource under the resource group. May be Microsoft support can recover if under premium plan with them. For future perspective it is always bet...

  • 3 kudos
1 More Replies
DarioB
by New Contributor III
  • 1873 Views
  • 1 replies
  • 1 kudos

Resolved! DAB for_each_task - Passing task values

I am trying to deploy a job with a for_each_task using DAB and Terraform and I am unable to properly pass the task value into the subsequent task.These are my job tasks definition in the YAML:      tasks:        - task_key: FS_batching          job_c...

  • 1873 Views
  • 1 replies
  • 1 kudos
Latest Reply
DarioB
New Contributor III
  • 1 kudos

We have been testing and find out the issue (I just realized that my anonymization of the names removed the source of the error).We have tracked down to the inputs parameter of the for_each_task. It seems that is unable to call to task names with das...

  • 1 kudos
alonisser
by Contributor II
  • 1722 Views
  • 4 replies
  • 0 kudos

Controlling the name of the downloaded csv file from a notebook

I got a notebook with multiple display() commands in various cells, the users are currently downloading the result csv from each cellI want the downloads to be named after the name of the cell (or any other methods I can make each download have a dif...

  • 1722 Views
  • 4 replies
  • 0 kudos
Latest Reply
Isi
Honored Contributor III
  • 0 kudos

Hey @alonisser Once the file is stored in the volume —whether in S3, GCS, or ADLS— you’ll be able to see it with a custom name defined by the customer or project. Additionally, the files may be saved in different folders, making it easier to identify...

  • 0 kudos
3 More Replies
korasino
by New Contributor II
  • 1395 Views
  • 2 replies
  • 0 kudos

Photon and Predictive I/O vs. Liquid Clustering

Hi Quick question about optimizing our Delta tables. Photon and Predictive I/O vs. Liquid Clustering (LC).We have UUIDv4 columns (random, high-cardinality) used in both WHERE uuid = … filters and joins. From what I understand Photon (on Serverless wa...

  • 1395 Views
  • 2 replies
  • 0 kudos
Latest Reply
korasino
New Contributor II
  • 0 kudos

Hey, thanks for the reply. Could you share some documentation links around those bullet points in your answer? thanks!

  • 0 kudos
1 More Replies
seefoods
by Valued Contributor
  • 1839 Views
  • 3 replies
  • 1 kudos

Resolved! build autoloader pyspark job

Hello Guys, I have build an ETL in pyspark which use autolaoder. So i want to know what is best way to use autoader databricks? What is the best way to vaccum checkpoint files on /Volumes ? Hope to have your ideas about that. Cordially , 

  • 1839 Views
  • 3 replies
  • 1 kudos
Latest Reply
seefoods
Valued Contributor
  • 1 kudos

Hello @intuz , Thanks for your reply. Cordially 

  • 1 kudos
2 More Replies
yathish
by New Contributor II
  • 4464 Views
  • 6 replies
  • 0 kudos

upstream request timeout in databricks apps when using databricks sql connector

Hi, i am building an application in Databricks apps, Sometimes when i try to fetch data using Databricks SQL connector in an API, it takes time to hit the SQL warehouse and if the time exceeds more than 60 seconds it gives upstream timeout error. I h...

  • 4464 Views
  • 6 replies
  • 0 kudos
Latest Reply
epistoteles
Databricks Partner
  • 0 kudos

@Alberto_Umana Any news on this? I am having similar issues and am also using a (running) serverless SQL warehouse.

  • 0 kudos
5 More Replies
EndreM
by New Contributor III
  • 3017 Views
  • 8 replies
  • 1 kudos

Replay a stream after converting to liquid cluster failes

I have problem replaying a stream.I need to replay it because conversion from liquid clusterto partition doesnt work. I see a lot of garbage collectionand memory maxes out immediatly. Then the driver restarts.TO debug the problem I try to force only ...

  • 3017 Views
  • 8 replies
  • 1 kudos
Latest Reply
EndreM
New Contributor III
  • 1 kudos

After increasing the compute to one with 500 GB memory, the job was able to transfer ca 300 GB of data, but it produced a large amount of files, 26000. While the old table with partition and no liquid cluster had 4000 files with a total of 1.2 TB of ...

  • 1 kudos
7 More Replies
anilsampson
by New Contributor III
  • 878 Views
  • 1 replies
  • 0 kudos

Resolved! databricks dashboard via workflow task.

Hello, i am trying to trigger a databricks dashboard via workflow task.1.when i deploy the job triggering the dashboard task via local "Deploy bundle" command deployment is successful.2. when i try to deploy to a different environment via CICD while ...

  • 878 Views
  • 1 replies
  • 0 kudos
Latest Reply
anilsampson
New Contributor III
  • 0 kudos

i think i figured out the issue , it had to do with the version of cli, updated the CICD to use latest version of clicurl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh

  • 0 kudos
arendon
by New Contributor II
  • 1516 Views
  • 2 replies
  • 1 kudos

Resolved! Asset Bundles: How to mute job failure notifications until final retry?

I'm trying to configure a job to only send failure notifications on the final retry failure (not on intermediate retry failures). This feature is available in the Databricks UI as "Mute notifications until the last retry", but I can't get this to wor...

  • 1516 Views
  • 2 replies
  • 1 kudos
Latest Reply
arendon
New Contributor II
  • 1 kudos

Thank you for the response, @lingareddy_Alva!I'll take a look at the workarounds you shared. 

  • 1 kudos
1 More Replies
dataminion01
by New Contributor II
  • 815 Views
  • 1 replies
  • 0 kudos

create streaming table using variable file path

is it possible to use a variable for the file path based on dates?files are stores in folders in this format yyyy/mm CREATE OR REFRESH STREAMING TABLE test  AS SELECT * FROM STREAM read_files(    "/Volumes/.....",    format => "parquet"  );

  • 815 Views
  • 1 replies
  • 0 kudos
Latest Reply
Rishabh-Pandey
Databricks MVP
  • 0 kudos

@dataminion01  Yes, it is possible to use a variable or dynamic file path based on dates in some data processing frameworks , but not directly in static SQL DDL statements like CREATE OR REFRESH STREAMING TABLE unless the environment you're working e...

  • 0 kudos
Akshay_Petkar
by Valued Contributor
  • 2137 Views
  • 1 replies
  • 1 kudos

How to Convert MySQL SELECT INTO OUTFILE and LOAD DATA INFILE to Databricks SQL?

Hi Community,I have some existing MySQL code :SELECT * FROM [table_name]INTO OUTFILE 'file_path'FIELDS TERMINATED BY '\t'OPTIONALLY ENCLOSED BY '"'LINES TERMINATED BY '\n';LOAD DATA INFILE 'file_path' REPLACE INTO TABLE [database].[table_name]FIELDS ...

  • 2137 Views
  • 1 replies
  • 1 kudos
Latest Reply
krishnakhadka28
New Contributor II
  • 1 kudos

Databricks SQL does not directly support MySQL’s SELECT INTO OUTFILE or LOAD DATA INFILE syntax. However, equivalent functionality can be achieved using Databricks features like saving to and reading from external locations like dbfs, s3 etc. I have ...

  • 1 kudos
Labels