Databricks Community

NTRT · ‎05-16-2024

Hi,

I am realtively new on databricks, although I am conscious about lazy evaluation, transformations and actions and peristence.

I have a json file (complex-nested) with about 1,73 MiB.

when

df = spark.read.option("multiLine", "false").json('dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json'), spark goes on forever without finishing the job. eventually i get an error "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

Reading this file on my local computer is a no braniner !

you kan get the file if you send a post request to:

table_07129 = "https://data.ssb.no/api/v0/no/table/07129/"
query_07129 ={"query":[],"response":{"format":"json-stat2"}}

resultat = requests.post(table_07129, json = query_07129)

I am using a multi node (max 2 workers) 64GB 16 core each standard d16ads_v5 cluster

thanks for your help.

koushiknpvs · ‎05-16-2024

This can be resolved by redefining the schema structure explicitly and using that schema to read the file.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define the schema according to the JSON structure
schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", IntegerType(), True),
# Add fields according to the JSON structure
])

# Read the JSON file with the defined schema
df = spark.read.schema(schema).json('dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json')
df.show()

NTRT · ‎05-16-2024

thanks for your reply. In my case I ll need to read different json files in a loop. they have not the same scheme , how to proceed in that case? thanks

Databricks Community

cant read json file with just 1,75 MiB ?

Join Us as a Local Community Builder!

🚀 Weekly Delta (1 - 7 October): A Look Back at This Week’s Top Community Highlights!

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming

Level Up with Databricks Specialist Sessions

Announcing Data Intelligence for Cybersecurity