cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark Structured Streaming Avro integration to Azure Schema Registry with Kafka/Eventhub in Databricks environment.

scalasparkdev
New Contributor

I am looking for a simple way to have a structured streaming pipeline that would automatically register a schema to Azure schema registry when converting a df col into avro and that would be able to deserialize an avro col based on schema registry url.

E.g. when I am using Scala and confluent schema registry, there is a great integration library for structured streaming from Absa - https://github.com/AbsaOSS/ABRiS

I am ideally looking for something similar on Azure that could be using in Python with Azure Schema Registry.

Found e.g. this databricks guide https://docs.databricks.com/structured-streaming/avro-dataframe.html#language-python that looks pretty close but only mentions integration to confluent schema registry, not how to use and authenticate against an Azure one.

On a semirelated note - I have issues importing the correct to_avro / from_avro method on Azure databricks - trying to pass schemaRegistryUrl to them raises

TypeError: to_avro() takes from 1 to 2 positional arguments but 3 were given

so these seem to be the vanilla spark avro methods, not the databricks ones?

My environment is Azure Databricks - DBR 11.3 LTS

importing them via

from pyspark.sql.avro.functions import from_avro, to_avro

are they on some different path/in a shaded jar?

Thanks!

2 REPLIES 2

Anonymous
Not applicable

@Tomas Sedlon​ :

It sounds like you're looking for a way to integrate Azure Schema Registry with your Python-based structured streaming pipeline in Databricks, and you've found some resources that are close to what you need but not quite there yet.

Regarding the from_avro() and to_avro() methods, it's possible that the methods you're importing are not the Databricks-specific ones that support the Azure Schema Registry integration. One way to verify this is to check the documentation for these methods to see if they mention Azure Schema Registry as a supported feature. If not, you may need to use a different set of methods that are specific to Azure Schema Registry.

As for integrating Azure Schema Registry with your pipeline, I would recommend checking out the Azure Databricks documentation on integrating with Azure Event Hubs, which includes a section on integrating with Azure Schema Registry: https://docs.databricks.com/spark/latest/structured-streaming/streaming-event-hubs.html#schema-regis...

This should give you an idea of how to set up the integration in your Databricks environment and how to use it in your pipeline. Additionally, you may want to look into any Python-specific libraries or APIs that can help you work with Azure Schema Registry, as there may be some out there that could simplify the process for you.

Hope this suggestion helps you!

Anonymous
Not applicable

Hi @Tomas Sedlon​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.