cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Avro Deserialization from Event Hub capture and Autoloader

Gilg
Contributor II

Hi All,

I am getting data from Event Hub capture in Avro format and using Auto Loader to process it.

I get into the point where I can read the Avro by casting the Body into a string.

imageNow I wanted to deserialized the Body column so it will in table format. Managed to do this by constructing a json_schema using StructType() and used the json_schema within the from_json() function which then I do a writeStream into a delta table.

image 

Question. Is there a way that I can deserialize the Avro data without constructing a schema? Event Hub schema registry is one option that I am looking at but not sure how to do that within Autoloader using PySpark.

Cheers,

Gil

5 REPLIES 5

UmaMahesh1
Honored Contributor III

Does your cluster has schema registry service ? If yes, you need not provide schema explicitly.

UmaMahesh1
Honored Contributor III

If you still want to go with the above approach and don't want to provide schema manually, then you can fetch a tiny batch with 1 record and build the schema into a variable using a .schema option. Once done, you can add a new Body column by providing the schema in the previous variable using the same way from_json. Now since this will be in json format, you can anyways use explode function and do stuff.

Hi @Uma Maheswara Rao Desula​ .

If I do the .schema even for a tiny batch. It will only get the schema of the original Avro. ie System columns generated plus the Body column where it holds the actual data .

image

UmaMahesh1
Honored Contributor III

You can build your schema like this...

schema = spark.read.json(capture_tmp.rdd.map(lambda row: row.columnNamewithValue)).schema

Then, df.withColumn('new_col', from_json(col('yourcolumnname'), schema))

Kaniz
Community Manager
Community Manager

Hi @Gil Gonong​, We haven’t heard from you since the last response from @Uma Maheswara Rao Desula​ and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.