cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

add custom metadata to avro file with pyspark

zak
New Contributor II

Hello,

i need to add a custom metadata into a avro file. The avro file containt data.

we have tried to use "option" within the write function but it's not taken without generated any error.

df.write.format("avro").option("avro.codec", "snappy").option("header", "metadata_key:metadata_value").mode("overwrite").save("/tmp/avro_with_metadata")

I'm seeking a solution to add a custom metadata into a data avro file.

Thanks,

Zakaria

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @zakaria belamri​, You can add custom metadata to an Avro file in PySpark by creating an Avro schema with the custom metadata fields and passing it to the DataFrameWriter as an option. Here's an example code snippet that demonstrates how to do this:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
 
# create SparkSession
spark = SparkSession.builder.appName("AvroCustomMetadata").getOrCreate()
 
# create example DataFrame
df = spark.range(10).withColumn("value", F.lit("hello"))
 
# define custom metadata fields as a dictionary
custom_metadata = {
    "key1": "value1",
    "key2": "value2"
}
 
# create Avro schema with custom metadata fields
avro_schema = """
{
    "type": "record",
    "name": "ExampleRecord",
    "fields": [
        {"name": "id", "type": "long"},
        {"name": "value", "type": "string"}
    ],
    "metadata": {
        "custom": %s
    }
}
""" % (json.dumps(custom_metadata),)
 
# write DataFrame to Avro file with custom metadata
df.write.format("avro").option("avroSchema", avro_schema).save("example.avro")
 
# read Avro file and display custom metadata
read_df = spark.read.format("avro").load("example.avro")
print(read_df.schema.metadata["custom"])
  • In this example, we create an example DataFrame with ten rows and a single value column.
  • We then define a dictionary of custom metadata fields we want to add to the Avro schema.
  • We create an Avro schema with these custom metadata fields by embedding the glossary in the metadata field of the schema definition.

  • We then use the DataFrameWriter to write the DataFrame to an Avro file with the custom metadata included in the schema.
  • Finally, we use the DataFrameReader to read the Avro file and display the custom metadata.

Note that in this example, we use the JSON.dumps method to convert the dictionary of custom metadata to a JSON string that can be embedded in the Avro schema definition.

View solution in original post

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @zakaria belamri​, You can add custom metadata to an Avro file in PySpark by creating an Avro schema with the custom metadata fields and passing it to the DataFrameWriter as an option. Here's an example code snippet that demonstrates how to do this:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
 
# create SparkSession
spark = SparkSession.builder.appName("AvroCustomMetadata").getOrCreate()
 
# create example DataFrame
df = spark.range(10).withColumn("value", F.lit("hello"))
 
# define custom metadata fields as a dictionary
custom_metadata = {
    "key1": "value1",
    "key2": "value2"
}
 
# create Avro schema with custom metadata fields
avro_schema = """
{
    "type": "record",
    "name": "ExampleRecord",
    "fields": [
        {"name": "id", "type": "long"},
        {"name": "value", "type": "string"}
    ],
    "metadata": {
        "custom": %s
    }
}
""" % (json.dumps(custom_metadata),)
 
# write DataFrame to Avro file with custom metadata
df.write.format("avro").option("avroSchema", avro_schema).save("example.avro")
 
# read Avro file and display custom metadata
read_df = spark.read.format("avro").load("example.avro")
print(read_df.schema.metadata["custom"])
  • In this example, we create an example DataFrame with ten rows and a single value column.
  • We then define a dictionary of custom metadata fields we want to add to the Avro schema.
  • We create an Avro schema with these custom metadata fields by embedding the glossary in the metadata field of the schema definition.

  • We then use the DataFrameWriter to write the DataFrame to an Avro file with the custom metadata included in the schema.
  • Finally, we use the DataFrameReader to read the Avro file and display the custom metadata.

Note that in this example, we use the JSON.dumps method to convert the dictionary of custom metadata to a JSON string that can be embedded in the Avro schema definition.

Hi @zakaria belamri​  (Customer)​, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Thank you. @zakaria belamri​ ! If you have any other questions or concerns, feel free to ask. Have a great day!

zak
New Contributor II

thank you a lot for your answer, it's very helpful.

I have a additionnal question please, is it possible to add a avro binnary metada inside the json avro_schema?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group