cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Sandesh87
by New Contributor III
  • 4972 Views
  • 3 replies
  • 2 kudos

Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.streaming.DataStreamWriter

I have a getS3Object function to get (json) objects located in aws s3  object client_connect extends Serializable { val s3_get_path = "/dbfs/mnt/s3response" def getS3Objects(s3ObjectName: String, s3Client: AmazonS3): String = { val...

  • 4972 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hey there @Sandesh Puligundla​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear f...

  • 2 kudos
2 More Replies
lnights
by New Contributor II
  • 5028 Views
  • 5 replies
  • 2 kudos

High cost of storage when using structured streaming

Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data df = (spark.readStream .format("eventhubs") .options(**ehConf) ...

transactions in azure storage
  • 5028 Views
  • 5 replies
  • 2 kudos
Latest Reply
PetePP
New Contributor II
  • 2 kudos

I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...

  • 2 kudos
4 More Replies
UmaMahesh1
by Honored Contributor III
  • 7609 Views
  • 7 replies
  • 17 kudos

Spark Structured Streaming : Data write is too slow into adls.

I'm a bit new to spark structured streaming stuff so do ask all the relevant questions if I missed any.I have a notebook which consumes the events from a kafka topic and writes those records into adls. The topic is json serialized so I'm just writing...

  • 7609 Views
  • 7 replies
  • 17 kudos
Latest Reply
Miletto
New Contributor II
  • 17 kudos

 

  • 17 kudos
6 More Replies
sparkstreaming
by New Contributor III
  • 6084 Views
  • 5 replies
  • 4 kudos

Resolved! Missing rows while processing records using foreachbatch in spark structured streaming from Azure Event Hub

I am new to real time scenarios and I need to create a spark structured streaming jobs in databricks. I am trying to apply some rule based validations from backend configurations on each incoming JSON message. I need to do the following actions on th...

  • 6084 Views
  • 5 replies
  • 4 kudos
Latest Reply
Rishi045
New Contributor III
  • 4 kudos

Were you able to achieve any solutions if yes please can you help with it.

  • 4 kudos
4 More Replies
tlecomte
by New Contributor III
  • 5406 Views
  • 6 replies
  • 3 kudos

Resolved! Enabling Adaptive Query Execution and Cost-Based Optimizer in Structured Streaming foreachBatch

Dear Databricks community,I am using Spark Structured Streaming to move data from silver to gold in an ETL fashion. The source stream is the change data feed of a Delta table in silver. The streaming dataframe is transformed and joined with a couple ...

  • 5406 Views
  • 6 replies
  • 3 kudos
Latest Reply
Lingesh
Databricks Employee
  • 3 kudos

It's not recommended to have AQE on a Streaming query for the same reason you shared in the description. It has been documented here

  • 3 kudos
5 More Replies
scalasparkdev
by New Contributor
  • 2640 Views
  • 2 replies
  • 0 kudos

Pyspark Structured Streaming Avro integration to Azure Schema Registry with Kafka/Eventhub in Databricks environment.

I am looking for a simple way to have a structured streaming pipeline that would automatically register a schema to Azure schema registry when converting a df col into avro and that would be able to deserialize an avro col based on schema registry ur...

  • 2640 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Tomas Sedlon​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

  • 0 kudos
1 More Replies
pranathisg97
by New Contributor III
  • 3420 Views
  • 2 replies
  • 1 kudos

readStream query throws exception if there's no data in delta location.

Hi,I have a scenario where writeStream query writes the stream data to bronze location and I have to read from bronze, do some processing and finally write it to silver. I use S3 location for delta tablesBut for the very first execution , readStream ...

  • 3420 Views
  • 2 replies
  • 1 kudos
Latest Reply
Vartika
Databricks Employee
  • 1 kudos

Hi @Pranathi Girish​,Hope all is well!Checking in. If @Suteja Kanuri​'s answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?We'd love to hear from you.Thanks!

  • 1 kudos
1 More Replies
Starki
by New Contributor III
  • 1093 Views
  • 1 replies
  • 0 kudos

Maintaining Custom State in Structured Streaming

I am consuming an IoT stream with thousands of different signals using Structured Streaming. During processing of the stream, I need to know the previous timestamp and value for each signal in the micro batch. The signal stream is eventually written ...

  • 1093 Views
  • 1 replies
  • 0 kudos
Latest Reply
Soma
Valued Contributor
  • 0 kudos

@Suteja Kanuri​ Tried the above on streaming DFBut facing the below errorAttributeError: 'DataFrame' object has no attribute 'groupByKey'Can you please let me know DBR runtime

  • 0 kudos
Ossian
by New Contributor
  • 2008 Views
  • 1 replies
  • 0 kudos

Driver restarts and job dies after 10-20 hours (Structured Streaming)

I am running a java/jar Structured Streaming job on a single node cluster (Databricks runtime 8.3). The job contains a single query which reads records from multiple Azure Event Hubs using Spark Kafka functionality and outputs results to a mssql dat...

  • 2008 Views
  • 1 replies
  • 0 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 0 kudos

its seems that when your nodes are increasing it is seeking for init script and it is failing so you can use reserve instances for this activity instead of spot instances it will increase your overall costor alternatively, you can use depended librar...

  • 0 kudos
Ashok1
by New Contributor II
  • 1424 Views
  • 2 replies
  • 1 kudos
  • 1424 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hey there @Ashok ch​ Hope everything is going great.Does @Ivan Tang​'s response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly? Else please let us know if you need more hel...

  • 1 kudos
1 More Replies
Imran_Anwar
by New Contributor II
  • 879 Views
  • 0 replies
  • 1 kudos

Structured streaming vs Confluent Kstream

For Ultra low latency customer facing App, I am curious on cost efficiency between Structured streaming and Kstream; which work better in terms of cost ? Though still achieving the ultra low latency and quality outcome. Appreciate any thoughts from p...

  • 879 Views
  • 0 replies
  • 1 kudos
drewster
by New Contributor III
  • 13321 Views
  • 13 replies
  • 13 kudos

Resolved! Spark streaming autoloader slow second batch - checkpoint issues?

I am running a massive history of about 250gb ~6mil phone call transcriptions (json read in as raw text) from a raw -> bronze pipeline in Azure Databricks using pyspark. The source is mounted storage and is continuously having files added and we do n...

  • 13321 Views
  • 13 replies
  • 13 kudos
Latest Reply
Brooksjit
New Contributor III
  • 13 kudos

Thank you for the explanation.

  • 13 kudos
12 More Replies
Constantine
by Contributor III
  • 3076 Views
  • 1 replies
  • 1 kudos

Resolved! Can we reuse checkpoints in Spark Streaming?

I am reading data from a Kafka topic, say topic_a. I have an application, app_one which uses Spark Streaming to read data from topic_a. I have a checkpoint location, loc_a to store the checkpoint. Now, app_one has read data till offset 90.Can I creat...

  • 3076 Views
  • 1 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi @John Constantine​,Is not recommended to share the checkpoint with your queries. Every streaming query should have their own checkpoint. If you can to start at the offset 90 in another query, then you can define it when starting your job. You can ...

  • 1 kudos
dataslicer
by Contributor
  • 5567 Views
  • 7 replies
  • 2 kudos

Resolved! Exploring additional cost saving options for structured streaming 24x7x365 uptime workloads

I currently have multiple jobs (each running its own job cluster) for my spark structured streaming pipelines that are long running 24x7x365 on DBR 9.x/10.x LTS. My SLAs are 24x7x365 with 1 minute latency. I have already accomplished the following co...

  • 5567 Views
  • 7 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

http://doramasmp4.tv/

  • 2 kudos
6 More Replies
Labels