- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-15-2024 12:15 PM - edited 05-16-2024 12:34 AM
Hi,
I am realtively new on databricks, although I am conscious about lazy evaluation, transformations and actions and peristence.
I have a json file (complex-nested) with about 1,73 MiB.
when
query_07129 ={"query":[],"response":{"format":"json-stat2"}}
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-15-2024 12:29 PM
Please give me a kudos if this works.
Efficiency in Data Collection: Using .collect() on large datasets can lead to out-of-memory errors as it collects all rows to the driver node. If the dataset is large, consider alternatives such as extracting only the necessary parts of data or performing operations that do not require collecting the entire DataFrame. You could replace the colelct section with first(). For example -
json_string = batch_df.toJSON().first()
batch_df.select("label").first()['label']
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-15-2024 12:29 PM
Please give me a kudos if this works.
Efficiency in Data Collection: Using .collect() on large datasets can lead to out-of-memory errors as it collects all rows to the driver node. If the dataset is large, consider alternatives such as extracting only the necessary parts of data or performing operations that do not require collecting the entire DataFrame. You could replace the colelct section with first(). For example -
json_string = batch_df.toJSON().first()
batch_df.select("label").first()['label']
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2024 12:46 AM
Hi,
I am realtively new on databricks, although I am conscious about lazy evaluation, transformations and actions and peristence.
I have a json file (complex-nested) with about 1,73 MiB.
when
query_07129 ={"query":[],"response":{"format":"json-stat2"}}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2024 04:13 AM
This can be resolved by redefining the schema structure explicitly and using that schema to read the file.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType
# Define the schema according to the JSON structure
schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", IntegerType(), True),
# Add fields according to the JSON structure
])
# Read the JSON file with the defined schema
df = spark.read.schema(schema).json('dbfs:/mnt/makro/bronze/json_ssb/07129_20240514.json')
df.show()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2024 04:16 AM
thanks for your reply. I do thought about that solution but what about if I Have several jsons file that i need to read with different schema (similar but not exatly like) ?

