Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I have a getS3Object function to get (json) objects located in aws s3 object client_connect extends Serializable {
val s3_get_path = "/dbfs/mnt/s3response"
def getS3Objects(s3ObjectName: String, s3Client: AmazonS3): String = {
val...
Hey there @Sandesh Puligundla Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear f...
Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data
df = (spark.readStream
.format("eventhubs")
.options(**ehConf)
...
I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...
I'm a bit new to spark structured streaming stuff so do ask all the relevant questions if I missed any.I have a notebook which consumes the events from a kafka topic and writes those records into adls. The topic is json serialized so I'm just writing...
I am new to real time scenarios and I need to create a spark structured streaming jobs in databricks. I am trying to apply some rule based validations from backend configurations on each incoming JSON message. I need to do the following actions on th...
Dear Databricks community,I am using Spark Structured Streaming to move data from silver to gold in an ETL fashion. The source stream is the change data feed of a Delta table in silver. The streaming dataframe is transformed and joined with a couple ...
I am looking for a simple way to have a structured streaming pipeline that would automatically register a schema to Azure schema registry when converting a df col into avro and that would be able to deserialize an avro col based on schema registry ur...
Hi @Tomas Sedlon Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...
Hi,I have a scenario where writeStream query writes the stream data to bronze location and I have to read from bronze, do some processing and finally write it to silver. I use S3 location for delta tablesBut for the very first execution , readStream ...
Hi @Pranathi Girish,Hope all is well!Checking in. If @Suteja Kanuri's answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?We'd love to hear from you.Thanks!
I am consuming an IoT stream with thousands of different signals using Structured Streaming. During processing of the stream, I need to know the previous timestamp and value for each signal in the micro batch. The signal stream is eventually written ...
@Suteja Kanuri Tried the above on streaming DFBut facing the below errorAttributeError: 'DataFrame' object has no attribute 'groupByKey'Can you please let me know DBR runtime
I am running a java/jar Structured Streaming job on a single node cluster (Databricks runtime 8.3). The job contains a single query which reads records from multiple Azure Event Hubs using Spark Kafka functionality and outputs results to a mssql dat...
its seems that when your nodes are increasing it is seeking for init script and it is failing so you can use reserve instances for this activity instead of spot instances it will increase your overall costor alternatively, you can use depended librar...
Hey there @Ashok ch Hope everything is going great.Does @Ivan Tang's response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly? Else please let us know if you need more hel...
For Ultra low latency customer facing App, I am curious on cost efficiency between Structured streaming and Kstream; which work better in terms of cost ? Though still achieving the ultra low latency and quality outcome. Appreciate any thoughts from p...
I am running a massive history of about 250gb ~6mil phone call transcriptions (json read in as raw text) from a raw -> bronze pipeline in Azure Databricks using pyspark. The source is mounted storage and is continuously having files added and we do n...
I have timeseries data in k Kafka topics. I would like to read this data into windows of length 10 minutes. For each window, I want to run N SQL queries and materialize result. The specific N queries to run depends on the kafka topic name. How should...
I am reading data from a Kafka topic, say topic_a. I have an application, app_one which uses Spark Streaming to read data from topic_a. I have a checkpoint location, loc_a to store the checkpoint. Now, app_one has read data till offset 90.Can I creat...
Hi @John Constantine,Is not recommended to share the checkpoint with your queries. Every streaming query should have their own checkpoint. If you can to start at the offset 90 in another query, then you can define it when starting your job. You can ...
I currently have multiple jobs (each running its own job cluster) for my spark structured streaming pipelines that are long running 24x7x365 on DBR 9.x/10.x LTS. My SLAs are 24x7x365 with 1 minute latency. I have already accomplished the following co...