<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: PySpark JSON read with strict schema check and mark the valid and invalid records based on the n in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-json-read-with-strict-schema-check-and-mark-the-valid/m-p/109399#M43300</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/148228"&gt;@sujitmk77&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;You have to ensure that valid records are processed while invalid records are marked appropriately, you can use the following PySpark code. This code reads the JSON files with schema enforcement and handles invalid records by marking them as corrupt&lt;/P&gt;
&lt;P&gt;from pyspark.sql.functions import input_file_name&lt;/P&gt;
&lt;P&gt;# Define the schema&lt;BR /&gt;estate_schema = StructType(&lt;BR /&gt;[&lt;BR /&gt;StructField(&lt;BR /&gt;"meta",&lt;BR /&gt;StructType(&lt;BR /&gt;[&lt;BR /&gt;StructField("id", StringType(), False),&lt;BR /&gt;StructField("timestamp", TimestampType(), False),&lt;BR /&gt;StructField("version", IntegerType(), False),&lt;BR /&gt;]&lt;BR /&gt;),&lt;BR /&gt;False,&lt;BR /&gt;),&lt;BR /&gt;StructField(&lt;BR /&gt;"data",&lt;BR /&gt;ArrayType(&lt;BR /&gt;StructType(&lt;BR /&gt;[&lt;BR /&gt;StructField("data_col_1", IntegerType(), False),&lt;BR /&gt;StructField("data_col_2", StringType(), False),&lt;BR /&gt;StructField("data_col_3", IntegerType(), True),&lt;BR /&gt;StructField("data_col_4", IntegerType(), True)&lt;BR /&gt;]&lt;BR /&gt;)&lt;BR /&gt;),&lt;BR /&gt;False&lt;BR /&gt;)&lt;BR /&gt;]&lt;BR /&gt;)&lt;/P&gt;
&lt;P&gt;# Read the JSON files with schema enforcement and handle invalid records&lt;BR /&gt;invalid_df = (&lt;BR /&gt;spark.read.schema(estate_schema)&lt;BR /&gt;.option("mode", "PERMISSIVE")&lt;BR /&gt;.option("columnNameOfCorruptRecord", "_corrupt_record")&lt;BR /&gt;.option("multiline", "true")&lt;BR /&gt;.json("/data/json_files/")&lt;BR /&gt;.withColumn("src_filename", input_file_name())&lt;BR /&gt;)&lt;/P&gt;
&lt;P&gt;# Show the DataFrame with invalid records marked&lt;BR /&gt;invalid_df.show(truncate=False)&lt;/P&gt;</description>
    <pubDate>Fri, 07 Feb 2025 13:11:59 GMT</pubDate>
    <dc:creator>Alberto_Umana</dc:creator>
    <dc:date>2025-02-07T13:11:59Z</dc:date>
    <item>
      <title>PySpark JSON read with strict schema check and mark the valid and invalid records based on the non-n</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-json-read-with-strict-schema-check-and-mark-the-valid/m-p/109362#M43287</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have a use case where I have to read the JSON files from "/data/json_files/" location with schema enforced.&lt;BR /&gt;For the completeness we want to mark the invalid records. The invalid records may be the ones where the mandatory field/s are null, data type mismatch or invalid json itself.&lt;/P&gt;&lt;P&gt;I have tried below but nothing worked as of now. It would be nice if someone has already this use case and a solution for it or may be knowledgeable in this area.&lt;/P&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Example Schema:&lt;/DIV&gt;&lt;DIV&gt;schema = StructType(&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; [&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; StructField(&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "meta",&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; StructType(&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; [&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; StructField("id", StringType(), False),&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; StructField("timestamp", TimestampType(), False),&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; StructField("version", IntegerType(), False),&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ]&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ),&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; False,&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ),&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; StructField(&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data",&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ArrayType(&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; StructType(&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; [&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; StructField("data_col_1", IntegerType(), False),&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; StructField("data_col_2", StringType(), False),&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; StructField("data_col_3", IntegerType(), True),&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;StructField("data_col_4", IntegerType(), True)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ]&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; )&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ),&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; False&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; )&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; ]&lt;/DIV&gt;&lt;DIV&gt;)&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;JSON file:&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;json_1.json&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;"data_col_4" is having wrong data type.&lt;/DIV&gt;&lt;DIV&gt;"data_col_2" is mandatory as per schema but got null.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;{&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; "meta": {&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "id": "abcd1234",&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "timestamp": "2025-02-07T07:59:12.123Z",&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "version": 1,&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; },&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; "tasks": [&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; {&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_1": 12,&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_2": "Required",&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_3": 9,&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"data_col_4": 7&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; },&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_1": 13,&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_2": "Required",&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_3": 10,&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"data_col_4": "Wrong data type"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; },&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_1": 14,&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_2": null,&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_3": 11,&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"data_col_4": 8&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; ]&lt;/DIV&gt;&lt;DIV&gt;}&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;json_2.json&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;the "data_col_1" is missing in the tasks.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;{&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; "meta": {&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "id": "efgh5678",&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "timestamp": "2025-02-07T07:59:12.123Z",&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "version": 1,&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; },&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; "tasks": [&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; {&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_2": "Required",&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_3": 9,&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"data_col_4": 7,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; },&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_1": 22,&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_2": "Required",&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "data_col_3": 10,&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"data_col_4": 11&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; ]&lt;/DIV&gt;&lt;DIV&gt;}&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;PySpark Code:&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;raw_df = (&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; spark.read.schema(estate_schema)&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .option("mode", "PERMISSIVE")&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .option("multiline", "true")&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .json("/data/json_files/")&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .withColumn("src_filename", input_file_name())&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; )&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;OR&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;invalid_df = (&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;spark.read.schema(estate_schema)&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;.option("mode", "PERMISSIVE")&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp;.option("columnNameOfCorruptRecord", "_corrupt_record")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp;.option("multiline", "true")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp;.json("/data/json_files/")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp;.withColumn("src_filename", input_file_name())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Expected Outcome:&lt;/DIV&gt;&lt;DIV&gt;All the valid records of meta and within the tasks array should be processed and invalid (missing mandatory field or incorrect data type or invalid json) should be marked as invalid for that particular records.&lt;/DIV&gt;</description>
      <pubDate>Fri, 07 Feb 2025 08:29:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-json-read-with-strict-schema-check-and-mark-the-valid/m-p/109362#M43287</guid>
      <dc:creator>sujitmk77</dc:creator>
      <dc:date>2025-02-07T08:29:56Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark JSON read with strict schema check and mark the valid and invalid records based on the n</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-json-read-with-strict-schema-check-and-mark-the-valid/m-p/109399#M43300</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/148228"&gt;@sujitmk77&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;You have to ensure that valid records are processed while invalid records are marked appropriately, you can use the following PySpark code. This code reads the JSON files with schema enforcement and handles invalid records by marking them as corrupt&lt;/P&gt;
&lt;P&gt;from pyspark.sql.functions import input_file_name&lt;/P&gt;
&lt;P&gt;# Define the schema&lt;BR /&gt;estate_schema = StructType(&lt;BR /&gt;[&lt;BR /&gt;StructField(&lt;BR /&gt;"meta",&lt;BR /&gt;StructType(&lt;BR /&gt;[&lt;BR /&gt;StructField("id", StringType(), False),&lt;BR /&gt;StructField("timestamp", TimestampType(), False),&lt;BR /&gt;StructField("version", IntegerType(), False),&lt;BR /&gt;]&lt;BR /&gt;),&lt;BR /&gt;False,&lt;BR /&gt;),&lt;BR /&gt;StructField(&lt;BR /&gt;"data",&lt;BR /&gt;ArrayType(&lt;BR /&gt;StructType(&lt;BR /&gt;[&lt;BR /&gt;StructField("data_col_1", IntegerType(), False),&lt;BR /&gt;StructField("data_col_2", StringType(), False),&lt;BR /&gt;StructField("data_col_3", IntegerType(), True),&lt;BR /&gt;StructField("data_col_4", IntegerType(), True)&lt;BR /&gt;]&lt;BR /&gt;)&lt;BR /&gt;),&lt;BR /&gt;False&lt;BR /&gt;)&lt;BR /&gt;]&lt;BR /&gt;)&lt;/P&gt;
&lt;P&gt;# Read the JSON files with schema enforcement and handle invalid records&lt;BR /&gt;invalid_df = (&lt;BR /&gt;spark.read.schema(estate_schema)&lt;BR /&gt;.option("mode", "PERMISSIVE")&lt;BR /&gt;.option("columnNameOfCorruptRecord", "_corrupt_record")&lt;BR /&gt;.option("multiline", "true")&lt;BR /&gt;.json("/data/json_files/")&lt;BR /&gt;.withColumn("src_filename", input_file_name())&lt;BR /&gt;)&lt;/P&gt;
&lt;P&gt;# Show the DataFrame with invalid records marked&lt;BR /&gt;invalid_df.show(truncate=False)&lt;/P&gt;</description>
      <pubDate>Fri, 07 Feb 2025 13:11:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-json-read-with-strict-schema-check-and-mark-the-valid/m-p/109399#M43300</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-02-07T13:11:59Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark JSON read with strict schema check and mark the valid and invalid records based on the n</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-json-read-with-strict-schema-check-and-mark-the-valid/m-p/109506#M43333</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;There was a type in the schema name, it should be "estate_schema".&lt;BR /&gt;&lt;BR /&gt;However the issue still remains the same, I do not recognise any change in my code and the code you have provided. Let me know if it is other wise.&lt;/P&gt;</description>
      <pubDate>Sat, 08 Feb 2025 19:47:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-json-read-with-strict-schema-check-and-mark-the-valid/m-p/109506#M43333</guid>
      <dc:creator>sujitmk77</dc:creator>
      <dc:date>2025-02-08T19:47:24Z</dc:date>
    </item>
  </channel>
</rss>

