cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Linda22
by New Contributor II
  • 3766 Views
  • 7 replies
  • 5 kudos

Can we execute a single task in isolation from a multi task Databricks job

A task may be used to process some data. If we have 10 such tasks in a job and we want to process only a couple of datasets only through a couple of tasks, is that possible? 

  • 3766 Views
  • 7 replies
  • 5 kudos
Latest Reply
slimbnsalah
New Contributor II
  • 5 kudos

Generally available!

  • 5 kudos
6 More Replies
Livingstone
by New Contributor II
  • 1669 Views
  • 5 replies
  • 3 kudos

Install maven package to serverless cluster

My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, ...

  • 1669 Views
  • 5 replies
  • 3 kudos
Latest Reply
BigRoux
Databricks Employee
  • 3 kudos

As you stated, ou cannot install Maven packages on Databricks serverless clusters due to restricted library management capabilities. However, there are alternative approaches to export data to Excel with minimal latency. Solutions to Export Excel Fi...

  • 3 kudos
4 More Replies
VKe
by New Contributor III
  • 2299 Views
  • 5 replies
  • 5 kudos

Issue with HTML Table Styling in Databricks Alerts

Hi Community,I’m trying to create an alert in Databricks with a custom email notification that includes the results of a SQL query displayed in an HTML table. However, I am facing issues with styling the table, specifically with adding borders and ba...

  • 2299 Views
  • 5 replies
  • 5 kudos
Latest Reply
skyatall
New Contributor II
  • 5 kudos

I am facing the same issue. When I used {{#QUERY_RESULT_ROWS}} and {{/QUERY_RESULT_ROWS}} and it is giving me "Unable to display preview, an invalid templet was provided"

  • 5 kudos
4 More Replies
krishnakmr512
by New Contributor
  • 152 Views
  • 1 replies
  • 1 kudos

Resolved! Missed my certification Exam Reschedule is required

Hi Team, @data_help @helpdesk @Cert-Team @Cert-TeamOPS I have missed the certification exam schedule due to an emergency situation yesterday, Is there a possibility this can be reschedule to today anytime or tomorrow? I am not able to reschedule opti...

Data Engineering
@Cert-Team
  • 152 Views
  • 1 replies
  • 1 kudos
Latest Reply
Cert-Team
Databricks Employee
  • 1 kudos

@krishnakmr512 usually the fastest way to get assistance is filing a ticket with our support team. I was able to reschedule your exam to future date. Please log into your account and reschedule to a date and time that suits you.

  • 1 kudos
db_eswar
by New Contributor
  • 187 Views
  • 2 replies
  • 1 kudos

what is iowait, will it impact performance of my job

One job taking more than 7hrs, when i added below configuration its taking <2:30 mins but after deployment with same parameters again its taking 7+hrs. 1) spark.conf.set("spark.sql.shuffle.partitions", 500) --> spark.conf.set("spark.sql.shuffle.parti...

  • 187 Views
  • 2 replies
  • 1 kudos
Latest Reply
SP_6721
New Contributor III
  • 1 kudos

Hi @db_eswar High iowait in your Spark jobs is probably caused by storage or disk bottlenecks, not CPU or memory issues. The slowdown you're seeing could be due to a cold cache, slower disks, or increased resource usage.To troubleshoot, you can use t...

  • 1 kudos
1 More Replies
AlessandroM
by New Contributor
  • 191 Views
  • 1 replies
  • 1 kudos

PySpark Structured Streaming job doesn't unpersist DataFrames

Hi community,I am currently developing a pyspark job (running on runtime 14.3 LTS) using structured streaming.Our streaming job uses forEachBatch , and inside it we are calling persist (and subsequent unpersist) on two DataFrames. We are noticing fro...

  • 191 Views
  • 1 replies
  • 1 kudos
Latest Reply
BigRoux
Databricks Employee
  • 1 kudos

The issue you’re encountering—where unpersist() does not seem to release memory for persisted DataFrames in your Structured Streaming job—likely relates to nuances of the Spark caching mechanism and how it interacts with the lifecycle of micro-batch ...

  • 1 kudos
RameshChejarla
by New Contributor II
  • 195 Views
  • 2 replies
  • 1 kudos

Databricks

Hi Everyone,I have implemented Auto loader working as expected. i need to track the files which are loaded into stage table.Here is the issue, the file tracking table need to create in snowflake from here i need to track the files.How to connect data...

  • 195 Views
  • 2 replies
  • 1 kudos
Latest Reply
RameshChejarla
New Contributor II
  • 1 kudos

Thanks for your reply, will try and let you know

  • 1 kudos
1 More Replies
ankit001mittal
by New Contributor III
  • 478 Views
  • 8 replies
  • 3 kudos

DLT Query History

Hi guys,I can see that in DLT pipelines we have query history section where we can see the duration of each tables and number of rows read.Is this information stored somewhere in the systems catalog? can I query this information?

Screenshot 2025-04-22 145638.png
  • 478 Views
  • 8 replies
  • 3 kudos
Latest Reply
aayrm5
Honored Contributor
  • 3 kudos

Hey @ankit001mittal - if any of the above responses provided answer to your questions, kindly, mark it as a solution.Thanks,

  • 3 kudos
7 More Replies
Prashanth24
by New Contributor III
  • 178 Views
  • 2 replies
  • 0 kudos

Databricks Autoloader processing old files

I have implemented Databricks Autoloader and found that every time i executes the code, it is still reading all old existing files + new files. As per the concept of Autoloader, it should read and process only new files. Below is the code. Please hel...

  • 178 Views
  • 2 replies
  • 0 kudos
Latest Reply
RameshChejarla
New Contributor II
  • 0 kudos

Hi Prashanth, Auto loader for me its reading only new files, can you pls go through the below script.df = (spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").option("cloudFiles.schemaLocation", "path").option("recursiveFileLooku...

  • 0 kudos
1 More Replies
kaushalshelat
by New Contributor II
  • 273 Views
  • 2 replies
  • 4 kudos

Resolved! I cannot see the output when using pandas_api() on spark dataframe

Hi all, I started learning spark and databricks recently along with python. while running below line of code it did not throw any error and seem to run ok but didn't show me output either.test=cust_an_inc1.pandas_api()test.show()where cust_an_inc1 is...

  • 273 Views
  • 2 replies
  • 4 kudos
Latest Reply
aayrm5
Honored Contributor
  • 4 kudos

Hi @kaushalshelat Ideally, `test.show()` should've thrown an error as test is the pandas dataframe now.`.show()` is a spark df method and wouldn't work with pandas.If you want to see a subset of the data try `.head()` or `.tail(n)` rather then `.show...

  • 4 kudos
1 More Replies
jeremy98
by Contributor III
  • 338 Views
  • 4 replies
  • 0 kudos

how to fallback the entire job in case of failure of the cluster?

Hi community,My team and I are using a job that is triggered based on dynamic scheduling, with the schedule defined within some of the job's tasks. However, this job is attached to a cluster that is always running and never terminated.I understand th...

  • 338 Views
  • 4 replies
  • 0 kudos
Latest Reply
aayrm5
Honored Contributor
  • 0 kudos

Hey @jeremy98 Have you had a chance to experiment with Databricks Serverless offering? Ideally, serverless would spin up times are around ~1 min. It has inbuilt autoscaling based on the workload, seems good fit for your usecase. Check out more info f...

  • 0 kudos
3 More Replies
suja
by New Contributor
  • 149 Views
  • 1 replies
  • 0 kudos

Exploring parallelism for multiple tables

I am new to databricks. The app we need to build reads from hive tables, go thru bronze, silver and gold layers and store in relational db tables.  There are multiple hive tables with no dependencies. What is the best way to achieve parallelism. Do w...

  • 149 Views
  • 1 replies
  • 0 kudos
Latest Reply
LRALVA
Honored Contributor
  • 0 kudos

Hi @suja Use Databricks Workflows (Jobs) with Task ParallelismInstead of using threads within a single notebook, leverage Databricks Jobs to define multiple tasks, each responsible for a table. Tasks can:                     1. Run in parallel       ...

  • 0 kudos
ABINASH
by New Contributor
  • 185 Views
  • 1 replies
  • 0 kudos

Flattening VARIANT column.

Hi Team, I am facing an issue, i have a json file which is around 700kb and it contains only 1 record, so after reading the data and flattening the file the record is now 620 million. Now while i am writing the dataframe into delta lake it is taking ...

  • 185 Views
  • 1 replies
  • 0 kudos
Latest Reply
samshifflett46
New Contributor II
  • 0 kudos

Hey @ABINASH, The JSON file being flattened to 620 million records seems like the area of optimization would be to restructure the JSON file. My initial thought being that the JSON file is extremely nested which is causing a large amount of redundant...

  • 0 kudos
sondergaard
by New Contributor II
  • 304 Views
  • 2 replies
  • 0 kudos

Simba ODBC driver // .Net Core

Hi,I have been looking into the Simba Spark ODBC driver to see if it can simplify our integration with .Net Core. The first results were promising, but when I started to process larger queries I started to notice out-of-memory exceptions in the conta...

  • 304 Views
  • 2 replies
  • 0 kudos
Latest Reply
Rjdudley
Honored Contributor
  • 0 kudos

Something we're considering for a similar purpose (.NET Core service pulling data from Databricks) is the ADO.NET connector from CData: Databricks Driver: ADO.NET Provider | Create & integrate .NET apps

  • 0 kudos
1 More Replies
van45678
by New Contributor
  • 702 Views
  • 1 replies
  • 0 kudos

Getting connection reset issue while connecting to a SQL server

Hello All,I am unable to connect to a SQL server instance that is installed in a on-premise network from databricks. I am able to successfully ping the server from the notebook using this command [nc -vz <hostname> <port>]  which means I am able to e...

Data Engineering
Databricks
sqlserver
timeout
  • 702 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kebadu
New Contributor II
  • 0 kudos

Hi Run in to similar problem. Were you able to find a solution?Thanks

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels