<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Validate a schema of json in column in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34562#M25304</link>
    <description>&lt;P&gt;What I was trying to say is that json.loads will not work on a spark column (I was not very clear).&lt;/P&gt;&lt;P&gt;It is not just a list of values. So the for-loop will not work.&lt;/P&gt;&lt;P&gt;Instead you should use a spark function to check the validity of the json string, f.e. to_json.&lt;/P&gt;&lt;P&gt;And what I mean by passing in the df as a function parameter is just def is_json(df, ...).&lt;/P&gt;&lt;P&gt;It is sometimes necessary to work with column names (so not the column itself but only the name) and also with the col itself (so the actual df-column with the values).&lt;/P&gt;&lt;P&gt;If that is the case you also have to put the DF into the picture.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 25 Nov 2021 08:02:18 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2021-11-25T08:02:18Z</dc:date>
    <item>
      <title>Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34552#M25294</link>
      <description>&lt;P&gt;I have a dataframe like below with col2 as key-value pairs. I would like to filter col2 to only the rows with a valid schema. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="df"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2303iF9362E706756E2AC/image-size/large?v=v2&amp;amp;px=999" role="button" title="df" alt="df" /&gt;&lt;/span&gt;There could be many of pairs, sometimes less, sometimes more and this is fine as long as the structure is fine. Nulls in col2 are also allowed. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The wrong values are like in case 4 and 5 where one of "name" or "value" is missing or there are lack of parenthesis []&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;schema:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;[
{
"name": "aa",
"value": "abc"
},
{
"name": "bb",
"value": "12"
},
{
"name": "cc",
"value": "3"
}
]&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data sample:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;col1	col2
1	        [{"name":"aaa","value":"5"},{"name":"bbb","value":"500"},{"name":"ccc","value":"300"}]
2	        [{"name":"aaa","value":"5"},{"name":"bbb","value":"500"}]
3	
4	        {"name":"aaa","value":"5"},{"name":"bbb","value":"500"}
5	        [{"name":"aaa"},{"name":"bbb","value":"500"}]&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 23 Nov 2021 14:50:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34552#M25294</guid>
      <dc:creator>Braxx</dc:creator>
      <dc:date>2021-11-23T14:50:41Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34554#M25296</link>
      <description>&lt;P&gt;when reading file sources you can set a 'badrecordspath':&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html#input-file-contains-bad-record" alt="https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html#input-file-contains-bad-record" target="_blank"&gt;https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html#input-file-contains-bad-record&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 23 Nov 2021 16:12:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34554#M25296</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-11-23T16:12:22Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34555#M25297</link>
      <description>&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;As I read this method requires csv or json as a source and is implemented while reading from file. The schema must be defined on the whole file not a particular column. In my case, I have a dataframe with a json column. So it looks the case is different&lt;/P&gt;</description>
      <pubDate>Wed, 24 Nov 2021 08:53:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34555#M25297</guid>
      <dc:creator>Braxx</dc:creator>
      <dc:date>2021-11-24T08:53:19Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34556#M25298</link>
      <description>&lt;P&gt;So you have this data already written in parquet?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Nov 2021 09:04:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34556#M25298</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-11-24T09:04:01Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34557#M25299</link>
      <description>&lt;P&gt;data are from hive table created by another process. I could have it in parquet if necessary&lt;/P&gt;</description>
      <pubDate>Wed, 24 Nov 2021 09:36:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34557#M25299</guid>
      <dc:creator>Braxx</dc:creator>
      <dc:date>2021-11-24T09:36:01Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34558#M25300</link>
      <description>&lt;P&gt;Not a literal answer:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html" alt="https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html" target="_blank"&gt;https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html" alt="https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html" target="_blank"&gt;https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You can try to achieve this using a defined schema, and then read the table, or more advanced technique like regex etc.&lt;/P&gt;&lt;P&gt;There are quite some options when working with nested data as you will see.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You could even work with delta lake and check constraints.&lt;/P&gt;&lt;P&gt;(https://docs.microsoft.com/en-us/azure/databricks/delta/delta-constraints)&lt;/P&gt;</description>
      <pubDate>Wed, 24 Nov 2021 10:25:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34558#M25300</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-11-24T10:25:29Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34559#M25301</link>
      <description>&lt;P&gt;To verify is the structure correct I tried json.load. It should retrive error in case it  is wrong or true in case it is correct. Then, I would only filter the True ones. Unfortunatelly I am getting error with the below code, which when solve could work for me. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;"TypeError: Column is not iterable"&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;def is_json(myjson):
  try:
    for x in myjson:
      json.loads(x)
  except ValueError as e:
    return False
  return True
&amp;nbsp;
df = data\
  .withColumn("Col3", is_json(col("Col2")))\
  .filter(col("Col3") == True)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Nov 2021 14:12:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34559#M25301</guid>
      <dc:creator>Braxx</dc:creator>
      <dc:date>2021-11-24T14:12:21Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34560#M25302</link>
      <description>&lt;P&gt;you want to loop over a DF-column, that is the issue as Column is not iterable.&lt;/P&gt;&lt;P&gt;if you give the column name instead of the column itself it should work.&lt;/P&gt;&lt;P&gt;But json.loads might not know what to do but then you could also add the dataframe as a function parameter.&lt;/P&gt;</description>
      <pubDate>Wed, 24 Nov 2021 15:37:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34560#M25302</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-11-24T15:37:49Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34561#M25303</link>
      <description>&lt;P&gt;Thanks. Not really understand what you mean by "add the dataframe as a function parameter". Were you able to draft? I could then test it?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Nov 2021 17:52:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34561#M25303</guid>
      <dc:creator>Braxx</dc:creator>
      <dc:date>2021-11-24T17:52:20Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34562#M25304</link>
      <description>&lt;P&gt;What I was trying to say is that json.loads will not work on a spark column (I was not very clear).&lt;/P&gt;&lt;P&gt;It is not just a list of values. So the for-loop will not work.&lt;/P&gt;&lt;P&gt;Instead you should use a spark function to check the validity of the json string, f.e. to_json.&lt;/P&gt;&lt;P&gt;And what I mean by passing in the df as a function parameter is just def is_json(df, ...).&lt;/P&gt;&lt;P&gt;It is sometimes necessary to work with column names (so not the column itself but only the name) and also with the col itself (so the actual df-column with the values).&lt;/P&gt;&lt;P&gt;If that is the case you also have to put the DF into the picture.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Nov 2021 08:02:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34562#M25304</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-11-25T08:02:18Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34563#M25305</link>
      <description>&lt;P&gt;Have this finally resolved. &lt;/P&gt;&lt;P&gt;Corrupted rows are flagged with 1 and could be then easly filtered out&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;#define a schema for col2
from pyspark.sql.types import StructType, StructField
json_schema = ArrayType(StructType([StructField("name", StringType(), nullable = True), StructField("value", StringType(), nullable = True)]))
&amp;nbsp;
# from_json is used to validate if col2 has a valid schema. If yes -&amp;gt; correct_json = col2, if no -&amp;gt; correct_json = null
# null is a default value returned by from_json when a valid json could not be created
# rows with corrupted jsons are flagged with 1 by checking a result before and after validation. If col2 was not null and after a validation become null it means that json is corrupted
df = data\
  .withColumn("correct_json", from_json(col("col2"), json_schema))\
  .withColumn("json_flag", when(col("col2").isNotNull() &amp;amp; col("correct_json").isNull(), 1).otherwise(0))\
  .drop("correct_json")
&amp;nbsp;
display(df)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 01 Dec 2021 10:41:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34563#M25305</guid>
      <dc:creator>Braxx</dc:creator>
      <dc:date>2021-12-01T10:41:11Z</dc:date>
    </item>
    <item>
      <title>Re: Validate a schema of json in column</title>
      <link>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34564#M25306</link>
      <description>&lt;P&gt;@Bartosz Wachocki​&amp;nbsp;- Thank you for sharing your solution and marking it as best. &lt;/P&gt;</description>
      <pubDate>Wed, 01 Dec 2021 16:41:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/validate-a-schema-of-json-in-column/m-p/34564#M25306</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-12-01T16:41:53Z</dc:date>
    </item>
  </channel>
</rss>

