cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

read base64 json column with Autoloader and inferschema.

MerelyPerfect
New Contributor II

I have json files falling in our blob with two fields, 1. offset(integer), 2. value(base64).

This value column is json with unicode. so they sent it as base64. Challenge is this json is very large with 100+ fields. so we cannot define the schema. We can only have some schemahints. so Autoloader is the best fit.

I tried with Autoloader with schemahints and other options. It always picks value as string and unable to parse it back to json without providing the schema. I want databricks to infer the schema. Any help is much appreciated. Thanks Advance.

Attached the sameple json!

3 REPLIES 3

Anonymous
Not applicable

@MerelyPerfect Perโ€‹ :

When using Autoloader in Databricks, the schema inference is done based on the first few rows of the data. If your JSON files have a consistent structure, you can try setting the "inferSchema" option to "true" in the Autoloader options. This will make Autoloader attempt to infer the schema from the first few rows of the data.

Here's an example of how you can use Autoloader with schema inference in Databricks:

python

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
 
# Define the schema hints
schema_hints = StructType([
    StructField("offset", IntegerType()),
    StructField("value", StringType())
])
 
# Define the Autoloader options
options = {
  "schema": schema_hints.json(),
  "inferSchema": "true"
}
 
# Load the data using Autoloader
df = spark.read.format("cloudFiles") \
    .options(**options) \
    .load("abfss://mycontainer@myaccount.dfs.core.windows.net/myfolder/*.json")
 
# Parse the "value" column from base64 to JSON
df = df.withColumn("value", from_json(col("value").cast("string"), schema_hints["value"].dataType))
 
# Show the resulting DataFrame
df.show()

In this example, we first define the schema hints for the two fields, "offset" and "value". Then, we define the Autoloader options and set "inferSchema" to "true". This tells Autoloader to attempt to infer the schema from the data.

We then load the data using Autoloader and parse the "value" column from base64 to JSON using the

from_json function. Finally, we show the resulting DataFrame.

If your JSON files have a more complex structure that cannot be inferred from the first few rows of the data, you may need to manually define the schema using the schema hints.

The json is complex with nested of 10 to 15 levels. so manually defining the json schema is impossible and not maintainable since it changes.

The above code doesnt work either.

I need to think of another way.

Can I rewrite json message to another json file, just by convert the binary to json.

Whatever I try, it converts the binary data and write it as string instead of json. how can I write it as json without defining the schema for the json.

All possible way writes the base64 value as escaped string.

eg. output looks like

{

"offset": 1,

"value": "{\"key1\": \"value\"..................}"

}

Anonymous
Not applicable

Hi @MerelyPerfect Perโ€‹ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group