<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Unable to enforce schema on data read from jsonl file in Azure Databricks using pyspark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/unable-to-enforce-schema-on-data-read-from-jsonl-file-in-azure/m-p/69291#M33888</link>
    <description>&lt;P&gt;&lt;SPAN&gt;I'm tring to build a ETL pipeline in which I'm reading the jsonl files from the azure blob storage, then trying to transform and load it to delta tables in databricks. I have created the below schema for loading my data :&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;schema = StructType([
    StructField("restaurantId", IntegerType(), nullable=False),
    StructField("reviewId", IntegerType(), nullable=False),
    StructField("text", StringType(), nullable=False),
    StructField("rating", DoubleType(), nullable=False),
    StructField("publishedAt", TimestampType(), nullable=False),
    StructField("_corrupt_record", StringType(), nullable=True)
])&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm reading the jsonl files from the below code :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = spark.read \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(schema) \
    .json(raw_file_location).cache()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But I get no results from the below code :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;display(df.select(col("_corrupt_record")).where(col("_corrupt_record").isNotNull()))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There are many column which are null in my raw data and these are inserted to the table as null and the&lt;SPAN&gt;&amp;nbsp;_corrupt_record&amp;nbsp;&lt;/SPAN&gt;column is null for that case. Please let me know how to resolve this issue&lt;/P&gt;&lt;P&gt;My expectation is to see the corrupt record (the record which does not math the defined schema) populated for the failed records , for this I have also tried SQL queries to manully create the schema and load the data but still doesn't works&lt;/P&gt;</description>
    <pubDate>Fri, 17 May 2024 15:44:35 GMT</pubDate>
    <dc:creator>prabhu26</dc:creator>
    <dc:date>2024-05-17T15:44:35Z</dc:date>
    <item>
      <title>Unable to enforce schema on data read from jsonl file in Azure Databricks using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-enforce-schema-on-data-read-from-jsonl-file-in-azure/m-p/69291#M33888</link>
      <description>&lt;P&gt;&lt;SPAN&gt;I'm tring to build a ETL pipeline in which I'm reading the jsonl files from the azure blob storage, then trying to transform and load it to delta tables in databricks. I have created the below schema for loading my data :&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;schema = StructType([
    StructField("restaurantId", IntegerType(), nullable=False),
    StructField("reviewId", IntegerType(), nullable=False),
    StructField("text", StringType(), nullable=False),
    StructField("rating", DoubleType(), nullable=False),
    StructField("publishedAt", TimestampType(), nullable=False),
    StructField("_corrupt_record", StringType(), nullable=True)
])&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm reading the jsonl files from the below code :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = spark.read \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(schema) \
    .json(raw_file_location).cache()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But I get no results from the below code :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;display(df.select(col("_corrupt_record")).where(col("_corrupt_record").isNotNull()))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There are many column which are null in my raw data and these are inserted to the table as null and the&lt;SPAN&gt;&amp;nbsp;_corrupt_record&amp;nbsp;&lt;/SPAN&gt;column is null for that case. Please let me know how to resolve this issue&lt;/P&gt;&lt;P&gt;My expectation is to see the corrupt record (the record which does not math the defined schema) populated for the failed records , for this I have also tried SQL queries to manully create the schema and load the data but still doesn't works&lt;/P&gt;</description>
      <pubDate>Fri, 17 May 2024 15:44:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-enforce-schema-on-data-read-from-jsonl-file-in-azure/m-p/69291#M33888</guid>
      <dc:creator>prabhu26</dc:creator>
      <dc:date>2024-05-17T15:44:35Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to enforce schema on data read from jsonl file in Azure Databricks using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-enforce-schema-on-data-read-from-jsonl-file-in-azure/m-p/69305#M33893</link>
      <description>&lt;P&gt;Try this.&lt;/P&gt;&lt;P&gt;Add option("multiline","true")&lt;/P&gt;</description>
      <pubDate>Fri, 17 May 2024 19:15:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-enforce-schema-on-data-read-from-jsonl-file-in-azure/m-p/69305#M33893</guid>
      <dc:creator>DataEngineer</dc:creator>
      <dc:date>2024-05-17T19:15:38Z</dc:date>
    </item>
  </channel>
</rss>

