Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I am trying to connect to my Kafka from spark but getting an error:Kafka Version: 2.4.1Spark Version: 3.3.0I am using jupyter notebook to execute the pyspark code below:```from pyspark.sql.functions import *from pyspark.sql.types import *#import libr...
Below is the error we have received when trying to read the stream Caused by: kafkashaded.org.apache.kafka.common.KafkaException: Failed to load SSL keystore /dbfs/FileStore/Certs/client.keystore.jksCaused by: java.nio.file.NoSuchFileException: /dbfs...
Ok, scrub that - the problem in my case was that I was using the 14.0 databricks runtime, which appears to have a bug relating to abfss paths here. Switching back to the 13.3 LTS release resolved it for me. So if you're in the same boat finding abfss...
I am trying to read data from Kafka, which is installed on my local system. I am using Databricks Community Edition with a cluster version of 12.2. However, I am unable to read data from Kafka. My use case is to read data from Kafka installed on my l...
I have an avro schema for my Kafka topic. In that schema it has defaults. I would like to exclude the defaulted columns from databricks and just let them default as an empty array. Sample avro, trying to not provide the UserFields because I can't...
I was trying to find information about configuring the consumer groups for kafka stream in databricks. By doing so I want to parallelize the stream and load it into databricks tables. Does the databricks handle this internally? If we can configure th...
Hi, we have a few examples on stream processing using Kafka (https://docs.databricks.com/structured-streaming/kafka.html), there is no straight public document for Kafka consumer group creation. You can refer to https://kafka.apache.org/documentation...
When using Delta tables with DBR jobs or even with DLT pipelines, the upserts (especially updates) (on key and timestamp) are taking quite higher than expected time to update the files/tables data (~2 mins for even 1 record poll) (Inserts are lightni...
Hi @Surya Agarwal Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so...
py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.spark.sql.SQLContextI already disable acl for cluster using "...
Hi @Ravi Teja,Just a friendly follow-up. Do you still need help? if you do, please share more details, like DBR version, standard or High concurrency cluster? etc
I am trying to write a data frame to Kafka topic with Avro schema for key and value using a schema registry URL. The to_avro function is not writing to the topic and throwing an exception with code 40403 something. Is there an alternate way to do thi...
Hello I am a newbie in this field and trying to access confluent kafka stream in Databricks Azure based on a beginner's video by Databricks. I have a free trial of Databricks cluster right now. When I run the below notebook, it errors out on line 5 o...
For testing, create without secret scope. It will be unsafe, but you can post secrets as strings in the notebook for testing. Here is the code which I used for loading data from confluent:inputDF = (spark
.readStream
.format("kafka")
.option("kafka.b...
I'm working on the case to configure Kafka that is installed on my machine (Laptop) & I want to connect it with my Databricks account hosted on the AWS cloud.Secondly, I have CSV files that I want to use for real-time processing from Kafka to Databri...
For CSV, you need just to readStream in the notebook and append output to CSV using forEachBatch method.Your Kafka on PC needs to have the public address or you need to set AWS VPN and connect from your laptop to be in the same VPC as databricks.
This is code that I am using to read from KafkainputDF = (spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", host)
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("ka...
I have a large stream of data read from Confluent Kafka, 500+ millions of row. When I initialize the stream I cannot control the batch sizes that are read.I've tried setting options on the readstream - maxBytesPerTrigger, maxOffsetsPerTrigger, fetc...
Hi @Adam Rink Just checking for further info on your question. How are you deducing that the batch sizes are more than what you are providing as maxOffsetsPerTrigger ?
I am receiving SSL handshake error even though the trust-store I have created is based on server certificate and the fingerprint in the certificate matches the trust-store fingerprint.kafkashaded.org.apache.kafka.common.errors.SslAuthenticationExcept...
Hi @Jayanth Goulla , worth a try ,https://stackoverflow.com/questions/54903381/kafka-failed-authentication-due-to-ssl-handshake-failedDid you follow: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/kafka?
Hello,I'm trying to use Databricks on Azure with a Spark structured streaming job and an having very mysterious issue.I boiled the job down it it's basics for testing, reading from a Kafka topic and writing to console in a forEachBatch.On local, ever...