<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic cant read json file with just 1,75 MiB ? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/cant-read-json-file-with-just-1-75-mib/m-p/69132#M33823</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am realtively new on databricks, although I am conscious about lazy evaluation, transformations and actions and peristence.&lt;/P&gt;&lt;P&gt;I have a json file (complex-nested) with about 1,73 MiB.&amp;nbsp;&lt;/P&gt;&lt;P&gt;when&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"multiLine"&lt;/SPAN&gt;&lt;SPAN&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;"false"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;json&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json'&lt;/SPAN&gt;&lt;SPAN&gt;), spark goes on forever without finishing the job. eventually i get an error "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Reading this file on my local computer is a no braniner !&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;you kan get the file if you send a post request to:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;table_07129 = "&lt;A href="https://data.ssb.no/api/v0/no/table/07129/" target="_blank" rel="nofollow noopener noreferrer"&gt;https://data.ssb.no/api/v0/no/table/07129/&lt;/A&gt;"&lt;BR /&gt;query_07129 ={"query":[],"response":{"format":"json-stat2"}}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;resultat = requests.post(table_07129, json = query_07129)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;I am using a multi node (max 2 workers) 64GB 16 core each standard d16ads_v5 cluster&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;thanks for your help.&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 16 May 2024 07:48:53 GMT</pubDate>
    <dc:creator>NTRT</dc:creator>
    <dc:date>2024-05-16T07:48:53Z</dc:date>
    <item>
      <title>cant read json file with just 1,75 MiB ?</title>
      <link>https://community.databricks.com/t5/data-engineering/cant-read-json-file-with-just-1-75-mib/m-p/69132#M33823</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am realtively new on databricks, although I am conscious about lazy evaluation, transformations and actions and peristence.&lt;/P&gt;&lt;P&gt;I have a json file (complex-nested) with about 1,73 MiB.&amp;nbsp;&lt;/P&gt;&lt;P&gt;when&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"multiLine"&lt;/SPAN&gt;&lt;SPAN&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;"false"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;json&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json'&lt;/SPAN&gt;&lt;SPAN&gt;), spark goes on forever without finishing the job. eventually i get an error "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Reading this file on my local computer is a no braniner !&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;you kan get the file if you send a post request to:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;table_07129 = "&lt;A href="https://data.ssb.no/api/v0/no/table/07129/" target="_blank" rel="nofollow noopener noreferrer"&gt;https://data.ssb.no/api/v0/no/table/07129/&lt;/A&gt;"&lt;BR /&gt;query_07129 ={"query":[],"response":{"format":"json-stat2"}}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;resultat = requests.post(table_07129, json = query_07129)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;I am using a multi node (max 2 workers) 64GB 16 core each standard d16ads_v5 cluster&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;thanks for your help.&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 16 May 2024 07:48:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cant-read-json-file-with-just-1-75-mib/m-p/69132#M33823</guid>
      <dc:creator>NTRT</dc:creator>
      <dc:date>2024-05-16T07:48:53Z</dc:date>
    </item>
    <item>
      <title>Re: cant read json file with just 1,75 MiB ?</title>
      <link>https://community.databricks.com/t5/data-engineering/cant-read-json-file-with-just-1-75-mib/m-p/69154#M33825</link>
      <description>&lt;P&gt;This can be resolved by redefining the schema structure explicitly and using that schema to read the file.&amp;nbsp;&lt;/P&gt;&lt;P&gt;from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType&lt;/P&gt;&lt;P&gt;# Define the schema according to the JSON structure&lt;BR /&gt;schema = StructType([&lt;BR /&gt;StructField("field1", StringType(), True),&lt;BR /&gt;StructField("field2", IntegerType(), True),&lt;BR /&gt;# Add fields according to the JSON structure&lt;BR /&gt;])&lt;/P&gt;&lt;P&gt;# Read the JSON file with the defined schema&lt;BR /&gt;df = spark.read.schema(schema).json('dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json')&lt;BR /&gt;df.show()&lt;/P&gt;</description>
      <pubDate>Thu, 16 May 2024 11:16:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cant-read-json-file-with-just-1-75-mib/m-p/69154#M33825</guid>
      <dc:creator>koushiknpvs</dc:creator>
      <dc:date>2024-05-16T11:16:07Z</dc:date>
    </item>
    <item>
      <title>Re: cant read json file with just 1,75 MiB ?</title>
      <link>https://community.databricks.com/t5/data-engineering/cant-read-json-file-with-just-1-75-mib/m-p/69162#M33829</link>
      <description>&lt;P&gt;thanks for your reply. In my case I ll need to read different json files in a loop. they have not the same scheme , how to proceed in that case? thanks&lt;/P&gt;</description>
      <pubDate>Thu, 16 May 2024 13:08:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cant-read-json-file-with-just-1-75-mib/m-p/69162#M33829</guid>
      <dc:creator>NTRT</dc:creator>
      <dc:date>2024-05-16T13:08:22Z</dc:date>
    </item>
  </channel>
</rss>

