Databricks Community

NTRT · ‎05-15-2024

Hi,

I am realtively new on databricks, although I am conscious about lazy evaluation, transformations and actions and peristence.

I have a json file (complex-nested) with about 1,73 MiB.

when

df = spark.read.option("multiLine", "false").json('dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json'), spark goes on forever without finishing the job. eventually i get an error "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

Reading this file on my local computer is a no braniner !

you kan get the file if you send a post request to:

table_07129 = "https://data.ssb.no/api/v0/no/table/07129/"
query_07129 ={"query":[],"response":{"format":"json-stat2"}}

resultat = requests.post(table_07129, json = query_07129)

I am using a multi node (max 2 workers) 64GB 16 core each standard d16ads_v5 cluster

thanks for your help.

koushiknpvs · ‎05-15-2024

Please give me a kudos if this works.

Efficiency in Data Collection: Using .collect() on large datasets can lead to out-of-memory errors as it collects all rows to the driver node. If the dataset is large, consider alternatives such as extracting only the necessary parts of data or performing operations that do not require collecting the entire DataFrame. You could replace the colelct section with first(). For example -
json_string = batch_df.toJSON().first()

batch_df.select("label").first()['label']

View solution in original post

koushiknpvs · ‎05-15-2024

Please give me a kudos if this works.

Efficiency in Data Collection: Using .collect() on large datasets can lead to out-of-memory errors as it collects all rows to the driver node. If the dataset is large, consider alternatives such as extracting only the necessary parts of data or performing operations that do not require collecting the entire DataFrame. You could replace the colelct section with first(). For example -
json_string = batch_df.toJSON().first()

batch_df.select("label").first()['label']

NTRT · ‎05-16-2024

Hi,

I am realtively new on databricks, although I am conscious about lazy evaluation, transformations and actions and peristence.

I have a json file (complex-nested) with about 1,73 MiB.

when

df = spark.read.option("multiLine", "false").json('dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json'), spark goes on forever without finishing the job. eventually i get an error "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

Reading this file on my local computer is a no braniner !

you kan get the file if you send a post request to:

table_07129 = "https://data.ssb.no/api/v0/no/table/07129/"
query_07129 ={"query":[],"response":{"format":"json-stat2"}}

resultat = requests.post(table_07129, json = query_07129)

I am using a multi node (max 2 workers) 64GB 16 core each standard d16ads_v5 cluster

thanks for your help.

koushiknpvs · ‎05-16-2024

This can be resolved by redefining the schema structure explicitly and using that schema to read the file.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define the schema according to the JSON structure
schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", IntegerType(), True),
# Add fields according to the JSON structure
])

# Read the JSON file with the defined schema
df = spark.read.schema(schema).json('dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json')
df.show()

NTRT · ‎05-16-2024

thanks for your reply. I do thought about that solution but what about if I Have several jsons file that i need to read with different schema (similar but not exatly like) ?

Databricks Community

performance issues when readingjson-stat2

Join Us as a Local Community Builder!

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops

Introducing Community Pulse — Your Weekly Databricks Roundup!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Databricks DevConnect I Washington D.C.