Re: performance issues when transformin json-stat2

koushiknpvs · ‎05-15-2024

Please give me a kudos if this works.

Efficiency in Data Collection: Using .collect() on large datasets can lead to out-of-memory errors as it collects all rows to the driver node. If the dataset is large, consider alternatives such as extracting only the necessary parts of data or performing operations that do not require collecting the entire DataFrame. You could replace the colelct section with first(). For example -
json_string = batch_df.toJSON().first()

batch_df.select("label").first()['label']

View solution in original post