cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

add custom metadata to avro file with pyspark

zak
New Contributor II

Hello,

i need to add a custom metadata into a avro file. The avro file containt data.

we have tried to use "option" within the write function but it's not taken without generated any error.

df.write.format("avro").option("avro.codec", "snappy").option("header", "metadata_key:metadata_value").mode("overwrite").save("/tmp/avro_with_metadata")

I'm seeking a solution to add a custom metadata into a data avro file.

Thanks,

Zakaria

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @zakaria belamri​, You can add custom metadata to an Avro file in PySpark by creating an Avro schema with the custom metadata fields and passing it to the DataFrameWriter as an option. Here's an example code snippet that demonstrates how to do this:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
 
# create SparkSession
spark = SparkSession.builder.appName("AvroCustomMetadata").getOrCreate()
 
# create example DataFrame
df = spark.range(10).withColumn("value", F.lit("hello"))
 
# define custom metadata fields as a dictionary
custom_metadata = {
    "key1": "value1",
    "key2": "value2"
}
 
# create Avro schema with custom metadata fields
avro_schema = """
{
    "type": "record",
    "name": "ExampleRecord",
    "fields": [
        {"name": "id", "type": "long"},
        {"name": "value", "type": "string"}
    ],
    "metadata": {
        "custom": %s
    }
}
""" % (json.dumps(custom_metadata),)
 
# write DataFrame to Avro file with custom metadata
df.write.format("avro").option("avroSchema", avro_schema).save("example.avro")
 
# read Avro file and display custom metadata
read_df = spark.read.format("avro").load("example.avro")
print(read_df.schema.metadata["custom"])
  • In this example, we create an example DataFrame with ten rows and a single value column.
  • We then define a dictionary of custom metadata fields we want to add to the Avro schema.
  • We create an Avro schema with these custom metadata fields by embedding the glossary in the metadata field of the schema definition.

  • We then use the DataFrameWriter to write the DataFrame to an Avro file with the custom metadata included in the schema.
  • Finally, we use the DataFrameReader to read the Avro file and display the custom metadata.

Note that in this example, we use the JSON.dumps method to convert the dictionary of custom metadata to a JSON string that can be embedded in the Avro schema definition.

View solution in original post

4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @zakaria belamri​, You can add custom metadata to an Avro file in PySpark by creating an Avro schema with the custom metadata fields and passing it to the DataFrameWriter as an option. Here's an example code snippet that demonstrates how to do this:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
 
# create SparkSession
spark = SparkSession.builder.appName("AvroCustomMetadata").getOrCreate()
 
# create example DataFrame
df = spark.range(10).withColumn("value", F.lit("hello"))
 
# define custom metadata fields as a dictionary
custom_metadata = {
    "key1": "value1",
    "key2": "value2"
}
 
# create Avro schema with custom metadata fields
avro_schema = """
{
    "type": "record",
    "name": "ExampleRecord",
    "fields": [
        {"name": "id", "type": "long"},
        {"name": "value", "type": "string"}
    ],
    "metadata": {
        "custom": %s
    }
}
""" % (json.dumps(custom_metadata),)
 
# write DataFrame to Avro file with custom metadata
df.write.format("avro").option("avroSchema", avro_schema).save("example.avro")
 
# read Avro file and display custom metadata
read_df = spark.read.format("avro").load("example.avro")
print(read_df.schema.metadata["custom"])
  • In this example, we create an example DataFrame with ten rows and a single value column.
  • We then define a dictionary of custom metadata fields we want to add to the Avro schema.
  • We create an Avro schema with these custom metadata fields by embedding the glossary in the metadata field of the schema definition.

  • We then use the DataFrameWriter to write the DataFrame to an Avro file with the custom metadata included in the schema.
  • Finally, we use the DataFrameReader to read the Avro file and display the custom metadata.

Note that in this example, we use the JSON.dumps method to convert the dictionary of custom metadata to a JSON string that can be embedded in the Avro schema definition.

Kaniz
Community Manager
Community Manager

Hi @zakaria belamri​  (Customer)​, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Kaniz
Community Manager
Community Manager

Thank you. @zakaria belamri​ ! If you have any other questions or concerns, feel free to ask. Have a great day!

zak
New Contributor II

thank you a lot for your answer, it's very helpful.

I have a additionnal question please, is it possible to add a avro binnary metada inside the json avro_schema?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.