cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

SamarthJain
by New Contributor II
  • 2792 Views
  • 4 replies
  • 2 kudos

Hi All,I'm facing an issue with my Spark Streaming Job. It gets stuck in the "Stream Initializing" phase for more than 3 hours.Need your...

Hi All,I'm facing an issue with my Spark Streaming Job. It gets stuck in the "Stream Initializing" phase for more than 3 hours.Need your help here to understand what happens internally at the "Stream Initializing" phase of the Spark Streaming job tha...

  • 2792 Views
  • 4 replies
  • 2 kudos
Latest Reply
MohsenJ
New Contributor III
  • 2 kudos

I'm facing the same issue when I try to run this example Create a monitor using the API | Databricks on AWS (Inference Lakehouse Monitor regression example notebook). any idea? 

  • 2 kudos
3 More Replies
avnish26
by New Contributor III
  • 6914 Views
  • 4 replies
  • 8 kudos

Spark 3.3.0 connect kafka problem

I am trying to connect to my Kafka from spark but getting an error:Kafka Version: 2.4.1Spark Version: 3.3.0I am using jupyter notebook to execute the pyspark code below:```from pyspark.sql.functions import *from pyspark.sql.types import *#import libr...

  • 6914 Views
  • 4 replies
  • 8 kudos
Latest Reply
jose_gonzalez
Moderator
  • 8 kudos

Hi @avnish26, did you added the Jar files to the cluster? do you still have issues? please let us know

  • 8 kudos
3 More Replies
lnights
by New Contributor II
  • 2104 Views
  • 5 replies
  • 2 kudos

High cost of storage when using structured streaming

Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data df = (spark.readStream .format("eventhubs") .options(**ehConf) ...

transactions in azure storage
  • 2104 Views
  • 5 replies
  • 2 kudos
Latest Reply
PetePP
New Contributor II
  • 2 kudos

I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...

  • 2 kudos
4 More Replies
sanjay
by Valued Contributor II
  • 4724 Views
  • 8 replies
  • 3 kudos

How to stop continuous running streaming job over weekend

I have an continuous running streaming Job, I would like to stop this over weekend and start again on Monday. Here is my streaming job code.(spark.readStream.format("delta").load(input_path).writeStream.option("checkpointLocation", input_checkpoint_p...

  • 4724 Views
  • 8 replies
  • 3 kudos
Latest Reply
NDK
New Contributor II
  • 3 kudos

@sanjay Any luck on that, I am also looking for the solution for the same issue 

  • 3 kudos
7 More Replies
sparkstreaming
by New Contributor III
  • 3103 Views
  • 5 replies
  • 4 kudos

Resolved! Missing rows while processing records using foreachbatch in spark structured streaming from Azure Event Hub

I am new to real time scenarios and I need to create a spark structured streaming jobs in databricks. I am trying to apply some rule based validations from backend configurations on each incoming JSON message. I need to do the following actions on th...

  • 3103 Views
  • 5 replies
  • 4 kudos
Latest Reply
Rishi045
New Contributor III
  • 4 kudos

Were you able to achieve any solutions if yes please can you help with it.

  • 4 kudos
4 More Replies
Ryan_Chynoweth
by Honored Contributor III
  • 1022 Views
  • 2 replies
  • 2 kudos

medium.com

Hi All, I recently published a streaming data comparison between Snowflake and Databricks. Hope you enjoy! Please let me know what you think! https://medium.com/@24chynoweth/data-streaming-at-scale-databricks-and-snowflake-ca65a2401649

  • 1022 Views
  • 2 replies
  • 2 kudos
Latest Reply
babyhari
New Contributor II
  • 2 kudos

Nicely done. 

  • 2 kudos
1 More Replies
gg_047320_gg_94
by New Contributor II
  • 2611 Views
  • 1 replies
  • 1 kudos

DLT Spark readstream fails on the source table which is overwritten

I am reading the source table which gets updated every day. It is usually append/merge with updates and is occasionally overwritten for other reasons. df = spark.readStream.schema(schema).format("delta").option("ignoreChanges", True).option('starting...

  • 2611 Views
  • 1 replies
  • 1 kudos
Latest Reply
Debayan
Esteemed Contributor III
  • 1 kudos

Hi, Could you please confirm DLT and DBR versions? Also please tag @Debayan​ with your next response which will notify me, Thank you!

  • 1 kudos
RateVan
by New Contributor II
  • 1226 Views
  • 3 replies
  • 0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

3P1l3
  • 1226 Views
  • 3 replies
  • 0 kudos
Latest Reply
RateVan
New Contributor II
  • 0 kudos

No, the problem remains the same. The meaning doesn't change because you increased the timeout a little bit. As the window did not close, and does not close until a new message arrives

  • 0 kudos
2 More Replies
tech2cloud
by New Contributor II
  • 1335 Views
  • 3 replies
  • 2 kudos
  • 1335 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Ravi Vishwakarma​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...

  • 2 kudos
2 More Replies
chanansh
by Contributor
  • 608 Views
  • 1 replies
  • 0 kudos

stream from azure credentials

I am trying to read stream from azure:(spark.readStream .format("cloudFiles") .option('cloudFiles.clientId', CLIENT_ID) .option('cloudFiles.clientSecret', CLIENT_SECRET) .option('cloudFiles.tenantId', TENTANT_ID) .option("header", "true") .opti...

  • 608 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Hanan Shteingart​ :It looks like you're using the Azure Blob Storage connector for Spark to read data from Azure. The error message suggests that the credentials you provided are not being used by the connector.To specify the credentials, you can se...

  • 0 kudos
Data_Engineer3
by Contributor II
  • 1509 Views
  • 2 replies
  • 0 kudos

Default maximum spark streaming chunk size in delta files in each batch?

working with delta files spark structure streaming , what is the maximum default chunk size in each batch?How do identify this type of spark configuration in databricks?#[Databricks SQL]​ #[Spark streaming]​ #[Spark structured streaming]​ #Spark​ 

  • 1509 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Valued Contributor II
  • 0 kudos

Hello @KARTHICK N​ ,The default value for spark.sql.files.maxPartitionBytes is 128 MB. These defaults are in the Apache Spark documentation https://spark.apache.org/docs/latest/sql-performance-tuning.html (unless there might be some overrides).To che...

  • 0 kudos
1 More Replies
Sandesh87
by New Contributor III
  • 723 Views
  • 3 replies
  • 0 kudos

parse and combine multiple datasets within a single file

An application receives messages from event hub. Below is a message received from event hub and loaded into a dataframe with one columnname,gender,idsam,m,001-----time,x,y,z,long,lat160,22,45,51,83,56230,82,95,48,18,26-----event,a,b,c034,1,5,6073,4,2...

  • 723 Views
  • 3 replies
  • 0 kudos
Latest Reply
Vartika
Moderator
  • 0 kudos

Hi @Sandesh Puligundla​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you...

  • 0 kudos
2 More Replies
RengarLee
by Contributor
  • 3949 Views
  • 11 replies
  • 3 kudos

Resolved! Databricks write to Azure Data Explorer writes suddenly become slower

Now, I write to Azure Data explorer using Spark streaming. one day, writes suddenly become slower. restart is no effect.I have a questions about Spark Streaming to Azure Data explorer.Q1: What should I do to get performance to reply?Figure 1 shows th...

  • 3949 Views
  • 11 replies
  • 3 kudos
Latest Reply
RengarLee
Contributor
  • 3 kudos

I'm so sorry, I just thought the issue wasn't resolvedSolutionSet maxFilesPerTrigger and maxBytesPerTrigger Enable autpoptimizeReason for the first day, it processes larger files and then eventually process smaller files。Detailed reason B...

  • 3 kudos
10 More Replies
chanansh
by Contributor
  • 2651 Views
  • 9 replies
  • 9 kudos

copy files from azure to s3

I am trying to copy files from azure to s3. I've created a solution by comparing file lists and copy manually to a temp file and upload. However, I just found AutoLoader and I would like to use that https://docs.databricks.com/ingestion/auto-loader/i...

  • 2651 Views
  • 9 replies
  • 9 kudos
Latest Reply
Falokun
New Contributor II
  • 9 kudos

Just use tools like Goodsync and Gs Richcopy 360 to copy directly from blob to S3, I think you will never face problems like that ​

  • 9 kudos
8 More Replies
chanansh
by Contributor
  • 4113 Views
  • 3 replies
  • 0 kudos

Relative path in absolute URI when reading a folder with files containing ":" colons in filename

I am trying to read a folder with partition files where each partition is date/hour/timestamp.csv where timestamp is the exact timestamp in ISO format, e.g. 09-2022-12-05T20:35:15.2786966Z It seems like spark having issues with reading files with col...

  • 4113 Views
  • 3 replies
  • 0 kudos
Latest Reply
Kaniz
Community Manager
  • 0 kudos

Hi @Hanan Shteingart​  (Customer)​, We haven’t heard from you since the last response from @Debayan Mukherjee​  (Customer)​ , and I was checking back to see if his suggestions helped you.Or else, If you have any solution, please share it with the co...

  • 0 kudos
2 More Replies
Labels