<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Issues reading json files with databricks vs oss pyspark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/issues-reading-json-files-with-databricks-vs-oss-pyspark/m-p/80387#M36008</link>
    <description>&lt;P&gt;Hi Everyone,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm currently developing an application in which I read json files with nested structure. I developed my code locally on my laptop using the opensource version of pyspark (3.5.1) using a similar code to this:&lt;/P&gt;&lt;P&gt;sample_schema:&lt;/P&gt;&lt;P&gt;schema = StructType([StructField('DATA', StructType([StructField('-ext_end', StringType(), True), StructField('-ext_start', StringType(), True), StructField('-xml_creation_date', StringType(), True), StructField('FILE', ArrayType(StructType([StructField('F', StructType([StructField('ROW', StructType([StructField('F1', StringType(), True)]), True)]), True), StructField('G', StructType([StructField('ROW', StructType([StructField('G1', StringType(), True), StructField('G2', StringType(), True), StructField('G3', StringType(), True), StructField('G4', StringType(), True)]), True)]), True)]), True), True)]), True)])&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;#Json reader&lt;/P&gt;&lt;P&gt;spark.readStream.json(path="input", schema=schema, multiLine="true")&lt;/P&gt;&lt;P&gt;Test scenario:&lt;/P&gt;&lt;P&gt;1. the input files sometimes are not complete e.g.:&lt;/P&gt;&lt;P&gt;the F structfield sometimes will be empty and if we load the file using schema inference this results if F as string = null&lt;/P&gt;&lt;P&gt;-&amp;gt; when reading this incomplete data on the OSS version of spark The schema is applied correctly and if the file data is incomplete as suggested above the fields are populated with the default values, this means F1 will be populated with null.&lt;/P&gt;&lt;P&gt;However executing this same code on databricks it results in a null overall.&lt;/P&gt;&lt;P&gt;sample outputs:&lt;BR /&gt;OSS pyspark&lt;BR /&gt;|DATA|&lt;BR /&gt;|{"-ext_end":"sample", "-ext_start":"sample", ...}}&lt;/P&gt;&lt;P&gt;Databricks&lt;BR /&gt;|DATA|&lt;BR /&gt;|null|&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there a way to replicate the behavior on the OSS version of pyspark on databricks? what am I missing here?&lt;/P&gt;&lt;P&gt;I hope someone can point me in the right direction, Thanks!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 24 Jul 2024 13:38:37 GMT</pubDate>
    <dc:creator>aalanis</dc:creator>
    <dc:date>2024-07-24T13:38:37Z</dc:date>
    <item>
      <title>Issues reading json files with databricks vs oss pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-reading-json-files-with-databricks-vs-oss-pyspark/m-p/80387#M36008</link>
      <description>&lt;P&gt;Hi Everyone,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm currently developing an application in which I read json files with nested structure. I developed my code locally on my laptop using the opensource version of pyspark (3.5.1) using a similar code to this:&lt;/P&gt;&lt;P&gt;sample_schema:&lt;/P&gt;&lt;P&gt;schema = StructType([StructField('DATA', StructType([StructField('-ext_end', StringType(), True), StructField('-ext_start', StringType(), True), StructField('-xml_creation_date', StringType(), True), StructField('FILE', ArrayType(StructType([StructField('F', StructType([StructField('ROW', StructType([StructField('F1', StringType(), True)]), True)]), True), StructField('G', StructType([StructField('ROW', StructType([StructField('G1', StringType(), True), StructField('G2', StringType(), True), StructField('G3', StringType(), True), StructField('G4', StringType(), True)]), True)]), True)]), True), True)]), True)])&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;#Json reader&lt;/P&gt;&lt;P&gt;spark.readStream.json(path="input", schema=schema, multiLine="true")&lt;/P&gt;&lt;P&gt;Test scenario:&lt;/P&gt;&lt;P&gt;1. the input files sometimes are not complete e.g.:&lt;/P&gt;&lt;P&gt;the F structfield sometimes will be empty and if we load the file using schema inference this results if F as string = null&lt;/P&gt;&lt;P&gt;-&amp;gt; when reading this incomplete data on the OSS version of spark The schema is applied correctly and if the file data is incomplete as suggested above the fields are populated with the default values, this means F1 will be populated with null.&lt;/P&gt;&lt;P&gt;However executing this same code on databricks it results in a null overall.&lt;/P&gt;&lt;P&gt;sample outputs:&lt;BR /&gt;OSS pyspark&lt;BR /&gt;|DATA|&lt;BR /&gt;|{"-ext_end":"sample", "-ext_start":"sample", ...}}&lt;/P&gt;&lt;P&gt;Databricks&lt;BR /&gt;|DATA|&lt;BR /&gt;|null|&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there a way to replicate the behavior on the OSS version of pyspark on databricks? what am I missing here?&lt;/P&gt;&lt;P&gt;I hope someone can point me in the right direction, Thanks!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jul 2024 13:38:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-reading-json-files-with-databricks-vs-oss-pyspark/m-p/80387#M36008</guid>
      <dc:creator>aalanis</dc:creator>
      <dc:date>2024-07-24T13:38:37Z</dc:date>
    </item>
    <item>
      <title>Re: Issues reading json files with databricks vs oss pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-reading-json-files-with-databricks-vs-oss-pyspark/m-p/80451#M36017</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/113563"&gt;@aalanis&lt;/a&gt;, I'd like to try replicating your scenario. Do you mind sharing a sample file so I can test it locally?&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jul 2024 21:42:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-reading-json-files-with-databricks-vs-oss-pyspark/m-p/80451#M36017</guid>
      <dc:creator>raphaelblg</dc:creator>
      <dc:date>2024-07-24T21:42:47Z</dc:date>
    </item>
    <item>
      <title>Re: Issues reading json files with databricks vs oss pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-reading-json-files-with-databricks-vs-oss-pyspark/m-p/80458#M36020</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hi, I'd like to try the scenario and find a solution. Would you mind sharing a sample file?&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;</description>
      <pubDate>Thu, 25 Jul 2024 02:03:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-reading-json-files-with-databricks-vs-oss-pyspark/m-p/80458#M36020</guid>
      <dc:creator>sushmithajk</dc:creator>
      <dc:date>2024-07-25T02:03:57Z</dc:date>
    </item>
    <item>
      <title>Re: Issues reading json files with databricks vs oss pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-reading-json-files-with-databricks-vs-oss-pyspark/m-p/80459#M36021</link>
      <description>&lt;P&gt;Please try to use the extract method to explode the JSON by extracting specific fields and handling optional nested fields by using the method for special scenarios&amp;nbsp; - when(col(&lt;SPAN class=""&gt;"file. F"&lt;/SPAN&gt;).isNull(), &lt;SPAN class=""&gt;None&lt;/SPAN&gt;).otherwise(col(&lt;SPAN class=""&gt;"file.F.ROW.F1"&lt;/SPAN&gt;)).alias(&lt;SPAN class=""&gt;"F1"&lt;/SPAN&gt;)&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jul 2024 02:14:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-reading-json-files-with-databricks-vs-oss-pyspark/m-p/80459#M36021</guid>
      <dc:creator>sushmithajk</dc:creator>
      <dc:date>2024-07-25T02:14:10Z</dc:date>
    </item>
  </channel>
</rss>

