Performance issue writing an extract of a huge unpartitionned single column dataframe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-17-2025 03:23 PM
I have a huge df (40 billions rows) shared by delta share that has only one column 'payload' which contains json and that is not partitionned:
Even if all those payloads are not the same, they have a common col sessionId that i need to extract to be able to write my bronze table and partition using this sessionId.
I am trying to either use limit(10) or sample(fraction=0.00000001, seed=42) on the df to be able to treat a small part of it just to ensure my code is running but my write method has always the same number of tasks and would take too long to end.
What would you suggest to write and exploded small subset this huge unpartitionned json column based df ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2025 06:16 AM
Hi @leymariv,
You can check the schema of data in delta sharing table, using df.printSchema to better understand the JSON structure. Use from_json function to flatten or normalize the data to respective columns.
Additionally, you can understand how data is being loaded into the table by using the DESCRIBE HISTORY command. Look for append or merge conditions in the operation column and refer to the operationMetrics column for data metrics.
If you notice that data is being loaded incrementally (append or merge) into the Delta Sharing table, you can read the data version by version or timestamp by timestamp using below code.
Alternatively, you can specify a range for the timestamp or version to further narrow down the data read.
Further, you can leverage Spark Structured Streaming to read data from delta sharing table.
Regards,
Hari Prasad
Regards,
Hari Prasad

