cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

[VARIANT_SIZE_LIMIT] Cannot build variant bigger than 16.0 MiB in parse_json

bbastian
New Contributor

I have a table coming from postgreSql, with one column containing json data in string format. We have been using parse_json to convert that to a vraiant column. But lately it is failing with the SIZE_LIMIT error. 

When I isolated the row which gave error, the json string contains a an array of 75 items (with four fields each). when I  saved that data column as a text/json file it takes only 30 kb (disk space) .When I tried from_json function with a pre defined schema , I was successfully able to convert to a StructType column.

We like to keep the column as Variant, as we don't want it tied to a specific schema. Any suggestions.

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @bbastian ,

Unfortunately, as of now there is strict limitation regarding size - a variant column cannot contain a value larger than 16 MiB.

Variant support in Delta Lake | Databricks on AWS

And tbh you cannot compare the size of this json string to how it is represented internally in memory because there's a lot that is happening there (i.e JSON is tokenize into objects, arrays, keys, value).

In your case you have 3 options:

1. Store this columns as a JSON string. You can store data in a single string column using standard JSON formatting and then query fields in the JSON using : notation. JSON string supports string of arbitrary length

2. Use structs - the provide great performance on read but you lose flexibility (you need to define schema upfront)

3. Preprocess your data - it depends on your data, but you can try divide this array of items into 2 or 3 array and store them in separate variant columns. This way you can overcome 16Mib size limitation

 

 

View solution in original post

1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @bbastian ,

Unfortunately, as of now there is strict limitation regarding size - a variant column cannot contain a value larger than 16 MiB.

Variant support in Delta Lake | Databricks on AWS

And tbh you cannot compare the size of this json string to how it is represented internally in memory because there's a lot that is happening there (i.e JSON is tokenize into objects, arrays, keys, value).

In your case you have 3 options:

1. Store this columns as a JSON string. You can store data in a single string column using standard JSON formatting and then query fields in the JSON using : notation. JSON string supports string of arbitrary length

2. Use structs - the provide great performance on read but you lose flexibility (you need to define schema upfront)

3. Preprocess your data - it depends on your data, but you can try divide this array of items into 2 or 3 array and store them in separate variant columns. This way you can overcome 16Mib size limitation

 

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now