cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Hubert-Dudek
by Esteemed Contributor III
  • 6081 Views
  • 1 replies
  • 3 kudos

Workflow timeout

Always set a timeout for your jobs! It not only safeguards against unforeseen hang-ups but also optimizes resource utilization. Equally essential is to consider having a threshold warning. This can alert you before a potential failure, allowing proac...

ezgif-2-283506cee0.gif
  • 6081 Views
  • 1 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Moderator
  • 3 kudos

Thank you for sharing this @Hubert-Dudek 

  • 3 kudos
YSDPrasad
by New Contributor III
  • 4171 Views
  • 4 replies
  • 3 kudos

Resolved! NoClassDefFoundError: scala/Product$class

import com.microsoft.azure.sqldb.spark.config.Configimport com.microsoft.azure.sqldb.spark.connect._import com.microsoft.azure.sqldb.spark.query._val query = "Truncate table tablename"val config = Config(Map( "url"     -> dbutils.secrets.get(scope = ...

  • 4171 Views
  • 4 replies
  • 3 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 3 kudos

Hi @Someswara Durga Prasad Yaralgadda​ (Customer)​, We haven’t heard from you since the last response from @Suteja Kanuri​  (Customer)​, and I was checking back to see if her suggestions helped you.Or else, If you have any solution, please share it w...

  • 3 kudos
3 More Replies
adriennn
by Contributor
  • 1405 Views
  • 2 replies
  • 1 kudos

Resolved! Delay when updating Bronze and Silver tables in the same notebook (DBR 13.1)

I created a notebook that uses Autoloader to load data from storage and append it to a bronze table in the first cell, this works fine and Autoloader picks up new data when it arrives (the notebook is ran using a Job).In the same notebook, a few cell...

  • 1405 Views
  • 2 replies
  • 1 kudos
Latest Reply
adriennn
Contributor
  • 1 kudos

Thanks @Kaniz_Fatma, in a case where it's not possible or not practical to implement a pipeline  with DLTs, what would be that "retry mechanism" based on ? I.e., is there an API other that the table history that can be leveraged to retry until "it wo...

  • 1 kudos
1 More Replies
Nino
by Contributor
  • 1103 Views
  • 2 replies
  • 1 kudos

cluster nodes unavailable scenarios

Concerning job cluster configuration, I'm trying to figure out what happens if AWS node type availability is smaller than the minimum number of workers specified in the configuration json (either availabilty<num_workers or, for autoscaling, availabil...

  • 1103 Views
  • 2 replies
  • 1 kudos
Latest Reply
Nino
Contributor
  • 1 kudos

thanks, @Kaniz_Fatma , useful info!My specific scenario is running a notebook task with Job Clusters, and I've noticed that I get the best overall notebook run time by going without Autoscaling, setting the cluster configuration with a fixed `num_wor...

  • 1 kudos
1 More Replies
DE-cat
by New Contributor III
  • 1249 Views
  • 1 replies
  • 1 kudos

Resolved! DatabricksStreamingQueryListener Stopping the stream

I am running the following structured streaming Scala code in DB 13.3LTS job:  val query = spark.readStream.format("delta") .option("ignoreDeletes", "true") .option("maxFilesPerTrigger", maxEqlPerBatch) .load(tblPath) .writeStream .qu...

  • 1249 Views
  • 1 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @DE-cat ,  • The given code is a structured streaming Scala code that reads data from a Delta table, processes it, and writes the output to a streaming sink.• The job gets cancelled around 30 minutes after starting with error messages like DAGSche...

  • 1 kudos
Fiona
by New Contributor II
  • 3047 Views
  • 3 replies
  • 1 kudos

Resolved! Reading a protobuf file in a Databricks notebook

I have proto files (offline data storage) that I'd like to read from a Databricks notebook. I found this documentation (https://docs.databricks.com/structured-streaming/protocol-buffers.html), but it only covers how to read the protobuf data once the...

  • 3047 Views
  • 3 replies
  • 1 kudos
Latest Reply
StephanK
New Contributor II
  • 1 kudos

If you have proto files in offline data storage, you should be able to read them with:input_df = spark.read.format("binaryFile").load(data_path) 

  • 1 kudos
2 More Replies
DE-cat
by New Contributor III
  • 1240 Views
  • 2 replies
  • 0 kudos

err:setfacl: Option -m: Invalid argument LibraryDownloadManager error

When starting a DB job using 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12) cluster, I am seeing a lots of these errors in log4j output. Any ideas? Thx23/09/11 13:24:14 ERROR CommandLineHelper$: Command [REDACTED] failed with exit code 2 out: err...

Data Engineering
LibraryDownloadManager
  • 1240 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @DE-cat , To configure an AWS instance connection in Databricks, you need to follow these steps:1. Create an access policy and a user with access keys in the AWS Console:  - Go to the IAM service.  - Click the Policies tab in the sidebar.  - Click...

  • 0 kudos
1 More Replies
DBUser2
by New Contributor II
  • 983 Views
  • 2 replies
  • 0 kudos

Databricks sql using odbc issue

Hi,I'm connecting to a Databricks instance on Azure from a Windows Application using Simba ODBC driver, and when running SQL statements on delta tables, like INSERT, UPDATE, DELETE commands using Execute, the result doesn't indicate the no. of rows a...

  • 983 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @DBUser2 ,  When using the Simba ODBC driver to connect to Databricks on Azure and running SQL statements like INSERT, UPDATE, or DELETE, it's common to encounter a result of -1 for the number of rows affected. This behaviour is not specific to th...

  • 0 kudos
1 More Replies
yzhang
by New Contributor III
  • 1690 Views
  • 3 replies
  • 0 kudos

How to trigger a "Git provider" job with commit?

I have "Git provider" job created and running fine on the remote git. The problem is that I have to manually trigger it. Is there a way to run the job automatically whenever a new commit to the branch? (In "Schedules & Triggers section", I can find a...

  • 1690 Views
  • 3 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @yzhang, To automatically trigger a job whenever there is a new commit to the branch in a remote Git repository, you can follow these steps: 1. Go to your job's "Schedules and Triggers" section.2. Click on the "Add Trigger" button.3. In the trigge...

  • 0 kudos
2 More Replies
Ludo
by New Contributor III
  • 4268 Views
  • 7 replies
  • 2 kudos

Resolved! Jobs with multi-tasking are failing to retry; how to fix this issue?

Hello,This is question on our platform with `Databricks Runtime 11.3 LTS`.I'm running a Job with multiple tasks in // using a shared cluster.Each task runs a dedicated scala class within a JAR library attached as a dependency.One of the task fails (c...

  • 4268 Views
  • 7 replies
  • 2 kudos
Latest Reply
YoshiCoppens61
New Contributor II
  • 2 kudos

Hi,This actually should not be marked as solved. We are having the same problem, whenever a Shared Job Cluster crashes for some reason (generally OoM), all tasks will start failing until eternity, with the error message as described above. This is ac...

  • 2 kudos
6 More Replies
Kratik
by New Contributor III
  • 1332 Views
  • 1 replies
  • 0 kudos

Spark submit job running python file

I have a spark submit job which is running one python file called main.py.The other file is alert.py which is being imported in main.py.Also main.py is using multiple config files.Alert.py is passed in --py-files and other config files are passed as ...

Data Engineering
pyfiles
spark
submit
  • 1332 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Kratik, To run the Spark submit job in Databricks and pass the --py-files and --files options, you can use the dbx command-line tool.

  • 0 kudos
TimB
by New Contributor III
  • 3691 Views
  • 1 replies
  • 0 kudos

Create external table using multiple paths/locations

I want to create an external table from more than a single path. I have configured my storage creds and added an external location, and I can successfully create a table using the following code;create table test.base.Example using csv options ( h...

  • 3691 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @TimB, you can import data from multiple paths using wildcards or similar patterns when creating an external table in Databricks. To import data from multiple paths using wildcards, you can modify the location parameter in the CREATE TABLE stateme...

  • 0 kudos
marcuskw
by Contributor
  • 2258 Views
  • 2 replies
  • 1 kudos

Resolved! whenNotMatchedBySourceUpdate ConcurrentAppendException Partition

ConcurrentAppendException requires a good partitioning strategy, here my logic works without fault for "whenMatchedUpdate" and "whenNotMatchedInsert" logic. When using "whenNotMatchedBySourceUpdate" however it seems that the condition doesn't isolate...

  • 2258 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @marcuskw, Based on the provided information and the given code snippet, it seems that the condition in the whenNotMatchedBySourceUpdate The clause does not isolate the specific partition in the Delta table. This can lead to a ConcurrentAppendExc...

  • 1 kudos
1 More Replies
Ajay-Pandey
by Esteemed Contributor III
  • 3678 Views
  • 5 replies
  • 0 kudos

How we can send databricks log to Azure Application Insight ?

Hi All,I want to send databricks logs to azure application insight.Is there any way we can do it ??Any blog or doc will help me.

  • 3678 Views
  • 5 replies
  • 0 kudos
Latest Reply
floringrigoriu
New Contributor II
  • 0 kudos

hi @Debayan in the  https://learn.microsoft.com/en-us/azure/architecture/databricks-monitoring/application-logs. there is a github repository mentioned https://github.com/mspnp/spark-monitoring ? That repository is marked as  maintainance mode.  Just...

  • 0 kudos
4 More Replies
pvm26042000
by New Contributor III
  • 2708 Views
  • 4 replies
  • 2 kudos

benefit of using vectorized pandas UDFs instead of the standard Pyspark UDFs?

benefit of using vectorized pandas UDFs instead of the standard Pyspark UDFs?

  • 2708 Views
  • 4 replies
  • 2 kudos
Latest Reply
Sai1098
New Contributor II
  • 2 kudos

Vectorized Pandas UDFs offer improved performance compared to standard PySpark UDFs by leveraging the power of Pandas and operating on entire columns of data at once, rather than row by row.They provide a more intuitive and familiar programming inter...

  • 2 kudos
3 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels