Performance issue writing an extract of a huge unpartitionned single column dataframe

leymariv — Fri, 17 Jan 2025 23:23:39 GMT

I have a huge df (40 billions rows) shared by delta share that has only one column 'payload' which contains json and that is not partitionned:

Even if all those payloads are not the same, they have a common col sessionId that i need to extract to be able to write my bronze table and partition using this sessionId.

I am trying to either use limit(10) or sample(fraction=0.00000001, seed=42) on the df to be able to treat a small part of it just to ensure my code is running but my write method has always the same number of tasks and would take too long to end.

What would you suggest to write and exploded small subset this huge unpartitionned json column based df ?

Re: Performance issue writing an extract of a huge unpartitionned single column dataframe

hari-prasad — Sat, 18 Jan 2025 14:16:52 GMT

Hi @leymariv,

You can check the schema of data in delta sharing table, using df.printSchema to better understand the JSON structure. Use from_json function to flatten or normalize the data to respective columns.

Additionally, you can understand how data is being loaded into the table by using the DESCRIBE HISTORY command. Look for append or merge conditions in the operation column and refer to the operationMetrics column for data metrics.

If you notice that data is being loaded incrementally (append or merge) into the Delta Sharing table, you can read the data version by version or timestamp by timestamp using below code.

Alternatively, you can specify a range for the timestamp or version to further narrow down the data read.

Further, you can leverage Spark Structured Streaming to read data from delta sharing table.

Regards,
Hari Prasad

topic Performance issue writing an extract of a huge unpartitionned single column dataframe in Data Engineering

Performance issue writing an extract of a huge unpartitionned single column dataframe

Re: Performance issue writing an extract of a huge unpartitionned single column dataframe