Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I have an avro schema for my Kafka topic. In that schema it has defaults. I would like to exclude the defaulted columns from databricks and just let them default as an empty array. Sample avro, trying to not provide the UserFields because I can't...
Hello, i need to add a custom metadata into a avro file. The avro file containt data. we have tried to use "option" within the write function but it's not taken without generated any error.df.write.format("avro").option("avro.codec", "snappy").option...
I am trying to write a data frame to Kafka topic with Avro schema for key and value using a schema registry URL. The to_avro function is not writing to the topic and throwing an exception with code 40403 something. Is there an alternate way to do thi...
Hi All,I am getting data from Event Hub capture in Avro format and using Auto Loader to process it.I get into the point where I can read the Avro by casting the Body into a string.Now I wanted to deserialized the Body column so it will in table forma...
If you still want to go with the above approach and don't want to provide schema manually, then you can fetch a tiny batch with 1 record and build the schema into a variable using a .schema option. Once done, you can add a new Body column by providin...
I am loading avro files into the detla tables. I am doing this for multiple tables and some files are big like (2-3GB) and most of them are small like in few MBs.I am using autoloader to load the data into the delta tables.My question is:What is the ...
Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...
If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.
I Know most of the time parquets file is great for different workload, but I still see Avro files are in use , What type of scenario where avro would be great to use over parquet format.
Hi,I have files hosted on an Azure Data Lake Store which I can connect from Azure Databricks configured as per instructions here.I can read JSON files fine, however, I'm getting the following error when I try to read an Avro file.spark.read.format("c...
Taras's answer is correct. Because spark-avro is based on the RDD APIs, the properties must be set in the hadoopConfiguration options.
Please note these docs for configuration using the RDD API: https://docs.azuredatabricks.net/spark/latest/data-sou...