<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Issues while writing into bad_records path in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/issues-while-writing-into-bad-records-path/m-p/71680#M34379</link>
    <description>&lt;P&gt;Hello All,&lt;/P&gt;&lt;P&gt;I would like to get your inputs with a scenario that I see while writing into the bad_records file.&lt;/P&gt;&lt;P&gt;I am reading a ‘Ԓ’ delimited CSV file based on a schema that I have already defined. I have enabled error handling while reading the file to write the error rows into a badRecordsPath if I have a schema mismatch.&lt;/P&gt;&lt;P&gt;I have new line characters coming in from the source file because of which a few columns get moved to the next line and since those new rows do not align with the schema defined, it writes the rows into a file in the bad_records path that I have specified.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This works well for almost all the scenarios EXCEPT when I define the schema with DateType(). If I try to write a non date type value to this column, instead of writing the whole row to the bad_records path, it creates blank files in the bad_records folder. It also creates another folder named bad_files and creates another file in it which shows the error –&lt;/P&gt;&lt;P&gt;&amp;nbsp;"&lt;EM&gt;reason":"org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark &amp;gt;= 3.0:\nFail to parse '009-7-4-23&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ' in the new parser. You can set \"spark.sql.legacy.timeParserPolicy\" to \"LEGACY\" to restore the behavior before Spark 3.0, or set to \"CORRECTED\" and treat it as an invalid datetime string.&lt;/EM&gt;"&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I get this error only while defining the datatype as DateType(). For testing purposes, I tried replacing it with IntegerType/TimestampType/DoubleType,etc and all of them writes to the bad_records file as expected with the error data.&lt;/P&gt;&lt;P&gt;Any leads on why this happens?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Below is the sample code&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;modified_schema = StructType(
    [
        StructField(".....", StringType(), True),
		.....
        StructField("ENTRYDATE", DateType(), True),
		.....
        StructField(".....", IntegerType(), True)
    ]   
)

df = spark.read.format("csv").option("header","true").option("sep",” Ԓ”).schema(modified_schema).option("badRecordsPath",badRecordsPath).load(filepath)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Below are the 2 folders generated inside my badRecordsPath.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Alok1_0-1717548996735.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8042i4196DBE9477E08DD/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999" role="button" title="Alok1_0-1717548996735.png" alt="Alok1_0-1717548996735.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Below are the files generated inside the bad_records folder and it contains no information on the erroneous rows.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Alok1_1-1717549044696.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8043i8D0E0A5617DD7F4F/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999" role="button" title="Alok1_1-1717549044696.png" alt="Alok1_1-1717549044696.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 05 Jun 2024 01:01:47 GMT</pubDate>
    <dc:creator>AlokThampi</dc:creator>
    <dc:date>2024-06-05T01:01:47Z</dc:date>
    <item>
      <title>Issues while writing into bad_records path</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-while-writing-into-bad-records-path/m-p/71680#M34379</link>
      <description>&lt;P&gt;Hello All,&lt;/P&gt;&lt;P&gt;I would like to get your inputs with a scenario that I see while writing into the bad_records file.&lt;/P&gt;&lt;P&gt;I am reading a ‘Ԓ’ delimited CSV file based on a schema that I have already defined. I have enabled error handling while reading the file to write the error rows into a badRecordsPath if I have a schema mismatch.&lt;/P&gt;&lt;P&gt;I have new line characters coming in from the source file because of which a few columns get moved to the next line and since those new rows do not align with the schema defined, it writes the rows into a file in the bad_records path that I have specified.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This works well for almost all the scenarios EXCEPT when I define the schema with DateType(). If I try to write a non date type value to this column, instead of writing the whole row to the bad_records path, it creates blank files in the bad_records folder. It also creates another folder named bad_files and creates another file in it which shows the error –&lt;/P&gt;&lt;P&gt;&amp;nbsp;"&lt;EM&gt;reason":"org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark &amp;gt;= 3.0:\nFail to parse '009-7-4-23&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ' in the new parser. You can set \"spark.sql.legacy.timeParserPolicy\" to \"LEGACY\" to restore the behavior before Spark 3.0, or set to \"CORRECTED\" and treat it as an invalid datetime string.&lt;/EM&gt;"&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I get this error only while defining the datatype as DateType(). For testing purposes, I tried replacing it with IntegerType/TimestampType/DoubleType,etc and all of them writes to the bad_records file as expected with the error data.&lt;/P&gt;&lt;P&gt;Any leads on why this happens?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Below is the sample code&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;modified_schema = StructType(
    [
        StructField(".....", StringType(), True),
		.....
        StructField("ENTRYDATE", DateType(), True),
		.....
        StructField(".....", IntegerType(), True)
    ]   
)

df = spark.read.format("csv").option("header","true").option("sep",” Ԓ”).schema(modified_schema).option("badRecordsPath",badRecordsPath).load(filepath)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Below are the 2 folders generated inside my badRecordsPath.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Alok1_0-1717548996735.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8042i4196DBE9477E08DD/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999" role="button" title="Alok1_0-1717548996735.png" alt="Alok1_0-1717548996735.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Below are the files generated inside the bad_records folder and it contains no information on the erroneous rows.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Alok1_1-1717549044696.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8043i8D0E0A5617DD7F4F/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999" role="button" title="Alok1_1-1717549044696.png" alt="Alok1_1-1717549044696.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Jun 2024 01:01:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-while-writing-into-bad-records-path/m-p/71680#M34379</guid>
      <dc:creator>AlokThampi</dc:creator>
      <dc:date>2024-06-05T01:01:47Z</dc:date>
    </item>
  </channel>
</rss>

