Databricks

MerelyPerfect · ‎03-24-2023

I have json files falling in our blob with two fields, 1. offset(integer), 2. value(base64).

This value column is json with unicode. so they sent it as base64. Challenge is this json is very large with 100+ fields. so we cannot define the schema. We can only have some schemahints. so Autoloader is the best fit.

I tried with Autoloader with schemahints and other options. It always picks value as string and unable to parse it back to json without providing the schema. I want databricks to infer the schema. Any help is much appreciated. Thanks Advance.

Attached the sameple json!

Anonymous · ‎03-24-2023

@MerelyPerfect Per :

When using Autoloader in Databricks, the schema inference is done based on the first few rows of the data. If your JSON files have a consistent structure, you can try setting the "inferSchema" option to "true" in the Autoloader options. This will make Autoloader attempt to infer the schema from the first few rows of the data.

Here's an example of how you can use Autoloader with schema inference in Databricks:

python

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
 
# Define the schema hints
schema_hints = StructType([
    StructField("offset", IntegerType()),
    StructField("value", StringType())
])
 
# Define the Autoloader options
options = {
  "schema": schema_hints.json(),
  "inferSchema": "true"
}
 
# Load the data using Autoloader
df = spark.read.format("cloudFiles") \
    .options(**options) \
    .load("abfss://mycontainer@myaccount.dfs.core.windows.net/myfolder/*.json")
 
# Parse the "value" column from base64 to JSON
df = df.withColumn("value", from_json(col("value").cast("string"), schema_hints["value"].dataType))
 
# Show the resulting DataFrame
df.show()

In this example, we first define the schema hints for the two fields, "offset" and "value". Then, we define the Autoloader options and set "inferSchema" to "true". This tells Autoloader to attempt to infer the schema from the data.

We then load the data using Autoloader and parse the "value" column from base64 to JSON using the

from_json function. Finally, we show the resulting DataFrame.

If your JSON files have a more complex structure that cannot be inferred from the first few rows of the data, you may need to manually define the schema using the schema hints.

MerelyPerfect · ‎03-26-2023

The json is complex with nested of 10 to 15 levels. so manually defining the json schema is impossible and not maintainable since it changes.

The above code doesnt work either.

I need to think of another way.

Can I rewrite json message to another json file, just by convert the binary to json.

Whatever I try, it converts the binary data and write it as string instead of json. how can I write it as json without defining the schema for the json.

All possible way writes the base64 value as escaped string.

eg. output looks like

{

"offset": 1,

"value": "{\"key1\": \"value\"..................}"

}

Anonymous · ‎03-25-2023

Hi @MerelyPerfect Per

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Databricks

read base64 json column with Autoloader and inferschema.

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI