topic cant read json file with just 1,75 MiB ? in Data Engineering

cant read json file with just 1,75 MiB ?

NTRT — Thu, 16 May 2024 07:48:53 GMT

Hi,

I am realtively new on databricks, although I am conscious about lazy evaluation, transformations and actions and peristence.

I have a json file (complex-nested) with about 1,73 MiB.

when

df = spark.read.option("multiLine", "false").json('dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json'), spark goes on forever without finishing the job. eventually i get an error "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

Reading this file on my local computer is a no braniner !

you kan get the file if you send a post request to:

table_07129 = "https://data.ssb.no/api/v0/no/table/07129/"
query_07129 ={"query":[],"response":{"format":"json-stat2"}}

resultat = requests.post(table_07129, json = query_07129)

I am using a multi node (max 2 workers) 64GB 16 core each standard d16ads_v5 cluster

thanks for your help.

Re: cant read json file with just 1,75 MiB ?

koushiknpvs — Thu, 16 May 2024 11:16:07 GMT

This can be resolved by redefining the schema structure explicitly and using that schema to read the file.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define the schema according to the JSON structure
schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", IntegerType(), True),
# Add fields according to the JSON structure
])

# Read the JSON file with the defined schema
df = spark.read.schema(schema).json('dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json')
df.show()

Re: cant read json file with just 1,75 MiB ?

NTRT — Thu, 16 May 2024 13:08:22 GMT

thanks for your reply. In my case I ll need to read different json files in a loop. they have not the same scheme , how to proceed in that case? thanks