I am having trouble efficiently reading & parsing in a large number of stream files in Pyspark!
Context
Here is the schema of the stream file that I am reading in JSON. Blank spaces are edits for confidentiality purposes.
root
|-- location_info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- restaurant_type: string (nullable = true)
| | |
| | |
| | |-- other_data: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- other_data_1 string (nullable = true)
| | | | |-- other_data_2: string (nullable = true)
| | | | |-- other_data_3: string (nullable = true)
| | | | |-- other_data_4: string (nullable = true)
| | | | |-- other_data_5: string (nullable = true)
| | |
| | |-- latitude: string (nullable = true)
| | |-- longitude: string (nullable = true)
| | |-- timezone: string (nullable = true)
|-- restaurant_id: string (nullable = true)
Current Method of Reading & Parsing (which works but takes TOO long)
```
# Reading multiple files in the dir source_df_1 = spark.read.json(sc.wholeTextFiles("file_path/*").values().flatMap(lambda x: x .replace('{"restaurant_id','\n{"restaurant_id').split('\n')))# explode here to have restaurant_id, and nested data exploded_source_df_1 = source_df_1.select(col('restaurant_id'), explode(col('location_info')).alias('location_info'))# Via SQL operation : this will solve the problem for parsing exploded_source_df_1.createOrReplaceTempView('result_1')
subset_data_1 = spark.sql(''' SELECT restaurant_id, location_infos.latitude,location_infos.longitude,location_infos.timezone from result_1 ''')
```
Things I would love help with:
You can refer to this thread as to how I arrived at this solution in the first place: link. Thank you very much for your time