cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Nino
by Contributor
  • 1622 Views
  • 1 replies
  • 0 kudos

cluster nodes unavailable scenarios

Concerning job cluster configuration, I'm trying to figure out what happens if AWS node type availability is smaller than the minimum number of workers specified in the configuration json (either availabilty<num_workers or, for autoscaling, availabil...

  • 1622 Views
  • 1 replies
  • 0 kudos
Latest Reply
Nino
Contributor
  • 0 kudos

thanks, @Retired_mod , useful info!My specific scenario is running a notebook task with Job Clusters, and I've noticed that I get the best overall notebook run time by going without Autoscaling, setting the cluster configuration with a fixed `num_wor...

  • 0 kudos
Fiona
by New Contributor II
  • 3981 Views
  • 3 replies
  • 1 kudos

Resolved! Reading a protobuf file in a Databricks notebook

I have proto files (offline data storage) that I'd like to read from a Databricks notebook. I found this documentation (https://docs.databricks.com/structured-streaming/protocol-buffers.html), but it only covers how to read the protobuf data once the...

  • 3981 Views
  • 3 replies
  • 1 kudos
Latest Reply
StephanK
New Contributor II
  • 1 kudos

If you have proto files in offline data storage, you should be able to read them with:input_df = spark.read.format("binaryFile").load(data_path) 

  • 1 kudos
2 More Replies
DE-cat
by New Contributor III
  • 1633 Views
  • 1 replies
  • 0 kudos

err:setfacl: Option -m: Invalid argument LibraryDownloadManager error

When starting a DB job using 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12) cluster, I am seeing a lots of these errors in log4j output. Any ideas? Thx23/09/11 13:24:14 ERROR CommandLineHelper$: Command [REDACTED] failed with exit code 2 out: err...

Data Engineering
LibraryDownloadManager
  • 1633 Views
  • 1 replies
  • 0 kudos
Latest Reply
" src="" />
This widget could not be displayed.
This widget could not be displayed.
This widget could not be displayed.
  • 0 kudos

This widget could not be displayed.
When starting a DB job using 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12) cluster, I am seeing a lots of these errors in log4j output. Any ideas? Thx23/09/11 13:24:14 ERROR CommandLineHelper$: Command [REDACTED] failed with exit code 2 out: err...

This widget could not be displayed.
  • 0 kudos
This widget could not be displayed.
DBUser2
by New Contributor III
  • 1423 Views
  • 1 replies
  • 0 kudos

Databricks sql using odbc issue

Hi,I'm connecting to a Databricks instance on Azure from a Windows Application using Simba ODBC driver, and when running SQL statements on delta tables, like INSERT, UPDATE, DELETE commands using Execute, the result doesn't indicate the no. of rows a...

  • 1423 Views
  • 1 replies
  • 0 kudos
Latest Reply
" src="" />
This widget could not be displayed.
This widget could not be displayed.
This widget could not be displayed.
  • 0 kudos

This widget could not be displayed.
Hi,I'm connecting to a Databricks instance on Azure from a Windows Application using Simba ODBC driver, and when running SQL statements on delta tables, like INSERT, UPDATE, DELETE commands using Execute, the result doesn't indicate the no. of rows a...

This widget could not be displayed.
  • 0 kudos
This widget could not be displayed.
DE-cat
by New Contributor III
  • 1657 Views
  • 0 replies
  • 0 kudos

DatabricksStreamingQueryListener Stopping the stream

I am running the following structured streaming Scala code in DB 13.3LTS job:  val query = spark.readStream.format("delta") .option("ignoreDeletes", "true") .option("maxFilesPerTrigger", maxEqlPerBatch) .load(tblPath) .writeStream .qu...

  • 1657 Views
  • 0 replies
  • 0 kudos
yzhang
by New Contributor III
  • 2421 Views
  • 2 replies
  • 0 kudos

How to trigger a "Git provider" job with commit?

I have "Git provider" job created and running fine on the remote git. The problem is that I have to manually trigger it. Is there a way to run the job automatically whenever a new commit to the branch? (In "Schedules & Triggers section", I can find a...

  • 2421 Views
  • 2 replies
  • 0 kudos
Latest Reply
yzhang
New Contributor III
  • 0 kudos

here is my screen after clicked "Add Trigger", I don't see option "Git provider" as a trigger type. Or something else shall I do? see attached.

  • 0 kudos
1 More Replies
Ludo
by New Contributor III
  • 5865 Views
  • 7 replies
  • 2 kudos

Resolved! Jobs with multi-tasking are failing to retry; how to fix this issue?

Hello,This is question on our platform with `Databricks Runtime 11.3 LTS`.I'm running a Job with multiple tasks in // using a shared cluster.Each task runs a dedicated scala class within a JAR library attached as a dependency.One of the task fails (c...

  • 5865 Views
  • 7 replies
  • 2 kudos
Latest Reply
YoshiCoppens61
New Contributor II
  • 2 kudos

Hi,This actually should not be marked as solved. We are having the same problem, whenever a Shared Job Cluster crashes for some reason (generally OoM), all tasks will start failing until eternity, with the error message as described above. This is ac...

  • 2 kudos
6 More Replies
Pbarbosa154
by New Contributor III
  • 5166 Views
  • 7 replies
  • 2 kudos

Ingest Data into Databricks with Kafka

I am trying to ingest data into Databricks with Kafka. I have Kafka installed in a Virtual Machine where I already have the data I need in a Kafka Topic stored as json. In Databricks, I have the following code:```df = (spark.readStream .format("kaf...

  • 5166 Views
  • 7 replies
  • 2 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 2 kudos

you need to check the driver's logs when your streaming is initializing. Please check the log4j output for the driver's logs. If there is an issue connecting to your Kafka broker, you will be able to see it 

  • 2 kudos
6 More Replies
marcuskw
by Contributor II
  • 3774 Views
  • 1 replies
  • 1 kudos

Resolved! whenNotMatchedBySourceUpdate ConcurrentAppendException Partition

ConcurrentAppendException requires a good partitioning strategy, here my logic works without fault for "whenMatchedUpdate" and "whenNotMatchedInsert" logic. When using "whenNotMatchedBySourceUpdate" however it seems that the condition doesn't isolate...

  • 3774 Views
  • 1 replies
  • 1 kudos
Latest Reply
" src="" />
This widget could not be displayed.
This widget could not be displayed.
This widget could not be displayed.
  • 1 kudos

This widget could not be displayed.
ConcurrentAppendException requires a good partitioning strategy, here my logic works without fault for "whenMatchedUpdate" and "whenNotMatchedInsert" logic. When using "whenNotMatchedBySourceUpdate" however it seems that the condition doesn't isolate...

This widget could not be displayed.
  • 1 kudos
This widget could not be displayed.
Ajay-Pandey
by Esteemed Contributor III
  • 5197 Views
  • 5 replies
  • 0 kudos

How we can send databricks log to Azure Application Insight ?

Hi All,I want to send databricks logs to azure application insight.Is there any way we can do it ??Any blog or doc will help me.

  • 5197 Views
  • 5 replies
  • 0 kudos
Latest Reply
floringrigoriu
New Contributor II
  • 0 kudos

hi @Debayan in the  https://learn.microsoft.com/en-us/azure/architecture/databricks-monitoring/application-logs. there is a github repository mentioned https://github.com/mspnp/spark-monitoring ? That repository is marked as  maintainance mode.  Just...

  • 0 kudos
4 More Replies
pvm26042000
by New Contributor III
  • 4716 Views
  • 4 replies
  • 2 kudos

benefit of using vectorized pandas UDFs instead of the standard Pyspark UDFs?

benefit of using vectorized pandas UDFs instead of the standard Pyspark UDFs?

  • 4716 Views
  • 4 replies
  • 2 kudos
Latest Reply
Sai1098
New Contributor II
  • 2 kudos

Vectorized Pandas UDFs offer improved performance compared to standard PySpark UDFs by leveraging the power of Pandas and operating on entire columns of data at once, rather than row by row.They provide a more intuitive and familiar programming inter...

  • 2 kudos
3 More Replies
MUA
by New Contributor
  • 3807 Views
  • 2 replies
  • 1 kudos

OSError: [Errno 7] Argument list too long

Getting this error in Databricks and don't know how to solveOSError: [Errno 7] Argument list too long: '/dbfs/databricks/aaecz/dev/w000aaecz/etl-framework-adb/0.4.31-20230503.131701-1/etl_libraries/utils/datadog/restart_datadog.sh'if anyone can help 

  • 3807 Views
  • 2 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

@MUA  Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

  • 1 kudos
1 More Replies
lawrence009
by Contributor
  • 5221 Views
  • 3 replies
  • 1 kudos

Troubleshooting Spill

I am trying to troubleshoot why spill occurred during DeltaOptimizeWrite. I am running a 64-core cluster with 256 GB RAM, which I expect to be handle this amount data (see attached DAG).

IMG_1085.jpeg
  • 5221 Views
  • 3 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

You can resolver the Spill to memory by increasing the shuffle partitions, but 16 GB of spill memory should not create a major impact of your job execution. Could you share more details on the actual source code that you are running?

  • 1 kudos
2 More Replies
JKR
by Contributor
  • 3822 Views
  • 4 replies
  • 1 kudos

Resolved! Got Failure: com.databricks.backend.common.rpc.SparkDriverExceptions$ReplFatalException error

Got below failure on scheduled job on interactive cluster and the next scheduled run executed fine.I want to know why this error occurred and how can I prevent it to happen again.And how to debug these errors in future ?  com.databricks.backend.commo...

  • 3822 Views
  • 4 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

@JKR Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

  • 1 kudos
3 More Replies
mbejarano89
by New Contributor III
  • 1363 Views
  • 1 replies
  • 1 kudos

Resolved! Cloning content of Repos into shared Workspace

Hello, I have a git repository on Databricks with notebooks that are meant to be shared with other users. The reason these notebooks are in git as opposed to the "shared" workspace already is because they are to be continuously improved and need sepa...

  • 1363 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16539034020
Databricks Employee
  • 1 kudos

Hello,  Thanks for contacting Databricks Support.  I presume you're looking to transfer files from external repositories to Databricks workspace. I'm afraid currently there is no direct support on it. You may consider to use REST API which allows for...

  • 1 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels