cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Data_Engineer3
by Contributor III
  • 2534 Views
  • 5 replies
  • 0 kudos

Default maximum spark streaming chunk size in delta files in each batch?

working with delta files spark structure streaming , what is the maximum default chunk size in each batch?How do identify this type of spark configuration in databricks?#[Databricks SQL]​ #[Spark streaming]​ #[Spark structured streaming]​ #Spark​ 

  • 2534 Views
  • 5 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

doc - https://docs.databricks.com/en/structured-streaming/delta-lake.html  Also, what is the challenge while using foreachbatch?

  • 0 kudos
4 More Replies
CarterM
by New Contributor III
  • 5595 Views
  • 4 replies
  • 2 kudos

Resolved! Why Spark Streaming from S3 is returning thousands of files when there are only 9?

I am attempting to stream JSON endpoint responses from an s3 bucket into a spark DLT. I have been very successful in this practice previously, but the difference this time is that I am storing the responses from multiple endpoints in the same s3 buck...

8_9 endpoint response structure Soccer  endpoint  9 9 endpoint responses in same s3 bucket
  • 5595 Views
  • 4 replies
  • 2 kudos
Latest Reply
williamyoung
New Contributor II
  • 2 kudos

Hello Everyone,It seems like the issue you're encountering could be related to how Spark Streaming interprets the S3 file structure, especially when dealing with multiple sources. When files from multiple endpoints are stored in the same bucket, Spar...

  • 2 kudos
3 More Replies
YFL
by New Contributor III
  • 6220 Views
  • 11 replies
  • 6 kudos

Resolved! When delta is a streaming source, how can we get the consumer lag?

Hi, I want to keep track of the streaming lag from the source table, which is a delta table. I see that in query progress logs, there is some information about the last version and the last file in the version for the end offset, but this don't give ...

  • 6220 Views
  • 11 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hey @Yerachmiel Feltzman​ I hope all is well.Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.Thanks!

  • 6 kudos
10 More Replies
SamarthJain
by New Contributor II
  • 5574 Views
  • 4 replies
  • 2 kudos

Hi All,I'm facing an issue with my Spark Streaming Job. It gets stuck in the "Stream Initializing" phase for more than 3 hours.Need your...

Hi All,I'm facing an issue with my Spark Streaming Job. It gets stuck in the "Stream Initializing" phase for more than 3 hours.Need your help here to understand what happens internally at the "Stream Initializing" phase of the Spark Streaming job tha...

  • 5574 Views
  • 4 replies
  • 2 kudos
Latest Reply
MohsenJ
Contributor
  • 2 kudos

I'm facing the same issue when I try to run this example Create a monitor using the API | Databricks on AWS (Inference Lakehouse Monitor regression example notebook). any idea? 

  • 2 kudos
3 More Replies
avnish26
by New Contributor III
  • 10514 Views
  • 4 replies
  • 8 kudos

Spark 3.3.0 connect kafka problem

I am trying to connect to my Kafka from spark but getting an error:Kafka Version: 2.4.1Spark Version: 3.3.0I am using jupyter notebook to execute the pyspark code below:```from pyspark.sql.functions import *from pyspark.sql.types import *#import libr...

  • 10514 Views
  • 4 replies
  • 8 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 8 kudos

Hi @avnish26, did you added the Jar files to the cluster? do you still have issues? please let us know

  • 8 kudos
3 More Replies
lnights
by New Contributor II
  • 4505 Views
  • 5 replies
  • 2 kudos

High cost of storage when using structured streaming

Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data df = (spark.readStream .format("eventhubs") .options(**ehConf) ...

transactions in azure storage
  • 4505 Views
  • 5 replies
  • 2 kudos
Latest Reply
PetePP
New Contributor II
  • 2 kudos

I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...

  • 2 kudos
4 More Replies
sanjay
by Valued Contributor II
  • 8528 Views
  • 8 replies
  • 3 kudos

How to stop continuous running streaming job over weekend

I have an continuous running streaming Job, I would like to stop this over weekend and start again on Monday. Here is my streaming job code.(spark.readStream.format("delta").load(input_path).writeStream.option("checkpointLocation", input_checkpoint_p...

  • 8528 Views
  • 8 replies
  • 3 kudos
Latest Reply
NDK
New Contributor II
  • 3 kudos

@sanjay Any luck on that, I am also looking for the solution for the same issue 

  • 3 kudos
7 More Replies
sparkstreaming
by New Contributor III
  • 5402 Views
  • 5 replies
  • 4 kudos

Resolved! Missing rows while processing records using foreachbatch in spark structured streaming from Azure Event Hub

I am new to real time scenarios and I need to create a spark structured streaming jobs in databricks. I am trying to apply some rule based validations from backend configurations on each incoming JSON message. I need to do the following actions on th...

  • 5402 Views
  • 5 replies
  • 4 kudos
Latest Reply
Rishi045
New Contributor III
  • 4 kudos

Were you able to achieve any solutions if yes please can you help with it.

  • 4 kudos
4 More Replies
Ryan_Chynoweth
by Esteemed Contributor
  • 2342 Views
  • 2 replies
  • 2 kudos

medium.com

Hi All, I recently published a streaming data comparison between Snowflake and Databricks. Hope you enjoy! Please let me know what you think! https://medium.com/@24chynoweth/data-streaming-at-scale-databricks-and-snowflake-ca65a2401649

  • 2342 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Nicely done. 

  • 2 kudos
1 More Replies
gg_047320_gg_94
by New Contributor II
  • 7948 Views
  • 1 replies
  • 1 kudos

DLT Spark readstream fails on the source table which is overwritten

I am reading the source table which gets updated every day. It is usually append/merge with updates and is occasionally overwritten for other reasons. df = spark.readStream.schema(schema).format("delta").option("ignoreChanges", True).option('starting...

  • 7948 Views
  • 1 replies
  • 1 kudos
Latest Reply
Debayan
Databricks Employee
  • 1 kudos

Hi, Could you please confirm DLT and DBR versions? Also please tag @Debayan​ with your next response which will notify me, Thank you!

  • 1 kudos
RateVan
by New Contributor II
  • 2338 Views
  • 3 replies
  • 0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

3P1l3
  • 2338 Views
  • 3 replies
  • 0 kudos
Latest Reply
RateVan
New Contributor II
  • 0 kudos

No, the problem remains the same. The meaning doesn't change because you increased the timeout a little bit. As the window did not close, and does not close until a new message arrives

  • 0 kudos
2 More Replies
tech2cloud
by New Contributor II
  • 2590 Views
  • 3 replies
  • 2 kudos
  • 2590 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Ravi Vishwakarma​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...

  • 2 kudos
2 More Replies
chanansh
by Contributor
  • 1131 Views
  • 1 replies
  • 0 kudos

stream from azure credentials

I am trying to read stream from azure:(spark.readStream .format("cloudFiles") .option('cloudFiles.clientId', CLIENT_ID) .option('cloudFiles.clientSecret', CLIENT_SECRET) .option('cloudFiles.tenantId', TENTANT_ID) .option("header", "true") .opti...

  • 1131 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Hanan Shteingart​ :It looks like you're using the Azure Blob Storage connector for Spark to read data from Azure. The error message suggests that the credentials you provided are not being used by the connector.To specify the credentials, you can se...

  • 0 kudos
Sandesh87
by New Contributor III
  • 1628 Views
  • 3 replies
  • 0 kudos

parse and combine multiple datasets within a single file

An application receives messages from event hub. Below is a message received from event hub and loaded into a dataframe with one columnname,gender,idsam,m,001-----time,x,y,z,long,lat160,22,45,51,83,56230,82,95,48,18,26-----event,a,b,c034,1,5,6073,4,2...

  • 1628 Views
  • 3 replies
  • 0 kudos
Latest Reply
Vartika
Databricks Employee
  • 0 kudos

Hi @Sandesh Puligundla​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you...

  • 0 kudos
2 More Replies
RengarLee
by Contributor
  • 7485 Views
  • 10 replies
  • 3 kudos

Resolved! Databricks write to Azure Data Explorer writes suddenly become slower

Now, I write to Azure Data explorer using Spark streaming. one day, writes suddenly become slower. restart is no effect.I have a questions about Spark Streaming to Azure Data explorer.Q1: What should I do to get performance to reply?Figure 1 shows th...

  • 7485 Views
  • 10 replies
  • 3 kudos
Latest Reply
RengarLee
Contributor
  • 3 kudos

I'm so sorry, I just thought the issue wasn't resolvedSolutionSet maxFilesPerTrigger and maxBytesPerTrigger Enable autpoptimizeReason for the first day, it processes larger files and then eventually process smaller files。Detailed reason B...

  • 3 kudos
9 More Replies
Labels