Hi, I have a streaming source loading data to a raw table, which has a string type col (whose value is JSON) to hold all data. I want to use colon sign operator to get fields from the JSON string. Is this going to have some perf issues vs. I use a schema+from_json function?
Hi @Brad, When dealing with JSON data in your streaming source, you have a couple of options for extracting fields.
Letโs explore both approaches:
Colon Sign Operator (:) for JSON Fields:
The colon sign operator allows you to directly access fields within a JSON string. For example, if your JSON string looks like this: {"name": "Alice", "age": 30}, you can extract the โnameโ field using json_column.name.
Pros:
Simplicity: Itโs straightforward and concise.
No need to define a schema explicitly.
Cons:
Performance: Using the colon sign operator may have some performance overhead, especially when dealing with large datasets. Parsing the JSON string for each record can be computationally expensive.
Lack of type validation: The colon sign operator does not validate the data types of the extracted fields.
Use Case:
If youโre dealing with small datasets and simplicity is your priority, the colon sign operator can be convenient.
Schema + from_json Function:
In this approach, you define a schema for your JSON data and use the from_json function to parse the JSON string into a structured format.
Pros:
Performance: Defining a schema and using from_json can be more efficient, especially for large datasets. The schema provides type validation and optimization.
Type safety: You can enforce data types for each field.
Cons:
Slightly more verbose: You need to define the schema explicitly.
Use Case:
If performance and type safety are critical, especially for production-grade systems, using a schema and from_json is recommended.
Yes, I did some testing. With some schema, I read the same data source and write the parsing results to diff tables. For 586K rows, the perf diff is 9sec vs. 37sec. For 2.3 million rows, 16sec vs. 133sec.
Welcome to Databricks Community: Lets learn, network and celebrate together
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.