cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

lnights
by New Contributor II
  • 4736 Views
  • 5 replies
  • 2 kudos

High cost of storage when using structured streaming

Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data df = (spark.readStream .format("eventhubs") .options(**ehConf) ...

transactions in azure storage
  • 4736 Views
  • 5 replies
  • 2 kudos
Latest Reply
PetePP
New Contributor II
  • 2 kudos

I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...

  • 2 kudos
4 More Replies
blackcoffeeAR
by Contributor
  • 12329 Views
  • 10 replies
  • 5 kudos

How to use/access in a python notebook a scala library installed from JAR file?

I'm using Azure Event Hubs Connector https://github.com/Azure/azure-event-hubs-spark to connect an Even Hub.When I install this library from Maven , then everything works, I can access lib classes using JVM:connection_string = "<connection_string>" s...

2023-02-02 09_30_01-Window
  • 12329 Views
  • 10 replies
  • 5 kudos
Latest Reply
Anonymous
Not applicable
  • 5 kudos

Hi @blackcoffee AR​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answer...

  • 5 kudos
9 More Replies
guru1
by New Contributor II
  • 3901 Views
  • 2 replies
  • 0 kudos

Resolved! facing issue mentioned in body when connecting event hub with databricks , followed earlier discussion on this but no solution

ERROR: Query termination received for [id=37bada03-131b-4fbb-8992-a427263fef2c, runId=cf3d7c18-780e-43ae-aed0-9daf2939b823], with exception: java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit at java.util.Base64$Decoder...

  • 3901 Views
  • 2 replies
  • 0 kudos
Latest Reply
Annapurna_Hiriy
Databricks Employee
  • 0 kudos

The issue could be due to the mismatch in the eventHub jar and the dependencies added. Also, not all the required dependencies may be added.Suggestions:Using the azure_eventhubs_spark_2_12_.jar eventHub spark jar along with the following dependencies...

  • 0 kudos
1 More Replies
Gilg
by Contributor II
  • 5129 Views
  • 4 replies
  • 5 kudos

Avro Deserialization from Event Hub capture and Autoloader

Hi All,I am getting data from Event Hub capture in Avro format and using Auto Loader to process it.I get into the point where I can read the Avro by casting the Body into a string.Now I wanted to deserialized the Body column so it will in table forma...

image image
  • 5129 Views
  • 4 replies
  • 5 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 5 kudos

If you still want to go with the above approach and don't want to provide schema manually, then you can fetch a tiny batch with 1 record and build the schema into a variable using a .schema option. Once done, you can add a new Body column by providin...

  • 5 kudos
3 More Replies
VN11111
by New Contributor III
  • 9014 Views
  • 5 replies
  • 6 kudos

Resolved! ERROR: Some streams terminated before this command could finish!

I have a databricks notebook which is to read stream from Azure Event Hub.My code does the following:1.Configure path for Eventhubs2.Read Streamdf_read_stream = (spark.readStream .format("eventhubs") .options(**conf)...

  • 9014 Views
  • 5 replies
  • 6 kudos
Latest Reply
guru1
New Contributor II
  • 6 kudos

I am also facing same issue , using Cluster11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12) liberary : com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.21Please help me for sameconf = {}conf["eventhubs.connectionString"] = "Endpoint=sb://xxxx.ser...

  • 6 kudos
4 More Replies
Rahul_Tiwary
by New Contributor II
  • 6022 Views
  • 1 replies
  • 4 kudos

Getting Error "java.lang.NoSuchMethodError: org.apache.spark.sql.AnalysisException" while writing data to event hub for streaming. It is working fine if I am writing it to another data brick table

import org.apache.spark.sql._import scala.collection.JavaConverters._import com.microsoft.azure.eventhubs._import java.util.concurrent._import scala.collection.immutable._import org.apache.spark.eventhubs._import scala.concurrent.Futureimport scala.c...

  • 6022 Views
  • 1 replies
  • 4 kudos
Latest Reply
Gepap
New Contributor II
  • 4 kudos

The dataframe to write needs to have the following schema:Column | Type ---------------------------------------------- body (required) | string or binary partitionId (*optional) | string partitionKey...

  • 4 kudos
databricksuser2
by New Contributor II
  • 1294 Views
  • 1 replies
  • 2 kudos

Structured streaming job sees throughput being capped after running normally for a few days

The job (written in PySpark) uses azure eventhub as source and use Databricks delta table as sink. The job is hosted in Azure Databricks.Transformation part is simple, the message body is converted from bytes to json string, the json string is then a...

figure 1
  • 1294 Views
  • 1 replies
  • 2 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 2 kudos

Hi @Databricks User10293847​ You can try using auto-inflate and let the TU increase automatically. The feature then scales automatically to the maximum limit of TUs you need, depending on the increase in your traffic. You can check the below doc: htt...

  • 2 kudos
Aran_Oribu
by New Contributor II
  • 4282 Views
  • 5 replies
  • 2 kudos

Resolved! Create and update a csv/json file in ADLSG2 with Eventhub in Databricks streaming

Hello ,This is my first post here and I am a total beginner with DataBricks and spark.Working on an IoT Cloud project with azure , I'm looking to set up a continuous stream processing of data.A current architecture already exists thanks to Stream Ana...

  • 4282 Views
  • 5 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

So the event hub creates files (json/csv) on adls.You can read those files into databricks with the spark.read.csv/json method. If you want to read many files in one go, you can use wildcards.f.e. spark.read.json("/mnt/datalake/bronze/directory/*/*...

  • 2 kudos
4 More Replies
RengarLee
by Contributor
  • 3915 Views
  • 5 replies
  • 0 kudos

Resolved! How to improve Spark Streaming writer Input Rate and Processing rate?

Hi!I have many questions about Spark Streaming and Evnethub。Can you help me?Q1:How to improve Spark Streaming writer Input Rate and Processing rate?I connect Azure Eventhubs using Spark Streaming(Azure Databricks), but I found if I use display, this ...

  • 3915 Views
  • 5 replies
  • 0 kudos
Latest Reply
RengarLee
Contributor
  • 0 kudos

setMaxEventsPerTrigger not equal to numInputRow is my problem

  • 0 kudos
4 More Replies
Jreco
by Contributor
  • 4905 Views
  • 6 replies
  • 4 kudos

Resolved! messages from event hub does not flow after a time

Hi Team,I'm trying to build a Real-time solution using Databricks and Event hubs.Something weird happens after a time that the process start.At the begining the messages flow through the process as expected with this rate: please, note that the last ...

image image image
  • 4905 Views
  • 6 replies
  • 4 kudos
Latest Reply
Jreco
Contributor
  • 4 kudos

Thanks for your answer @Hubert Dudek​ , Is already specifiedWhat do youn mean with this? This is the weird part of this, bucause the data is flowing good, but at any time is like the Job stop the reading or somethign like that and if I restart the ...

  • 4 kudos
5 More Replies
Jreco
by Contributor
  • 12797 Views
  • 13 replies
  • 3 kudos

Event hub streaming improve processing rate

Hi all,I'm working with event hubs and data bricks to process and enrich data in real-time.Doing a "simple" test, I'm getting some weird values (input rate vs processing rate) and I think I'm losing data:If you can see, there is a peak with 5k record...

image image
  • 12797 Views
  • 13 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 3 kudos

hi @Jhonatan Reyes​ ,How many Event hubs partitions are you readying from? your micro-batch takes a few milliseconds to complete, which I think is good time, but I would like to undertand better what are you trying to improve here.Also, in this case ...

  • 3 kudos
12 More Replies
User16868770416
by Contributor
  • 4227 Views
  • 1 replies
  • 0 kudos

What is the best way to decode protobuf using pyspark?

I am using spark structured streaming to read a protobuf encoded message from the event hub. We use a lot of Delta tables, but there isn't a simple way to integrate this. We are currently using K-SQL to transform into avro on the fly and then use Dat...

  • 4227 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

hi @Will Block​ ,I think there is a related question being asked in the past. I think it was this one I found this library, I hope it helps.

  • 0 kudos
Labels