cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Performance issue writing an extract of a huge unpartitionned single column dataframe

leymariv
New Contributor

I have a huge df (40 billions rows) shared by delta share that has only one column 'payload' which contains json and that is not partitionned:

leymariv_2-1737155764713.png

Even if all those payloads are not the same, they have a common col sessionId that i need to extract to be able to write my bronze table and partition using this sessionId.

I am trying to either use limit(10) or sample(fraction=0.00000001, seed=42on the df to be able to treat a small part of it just to ensure my code is running but my write method has always the same number of tasks and would take too long to end.

leymariv_0-1737155486874.png

What would you suggest to write and exploded small subset this huge unpartitionned json column based df ?

1 REPLY 1

hari-prasad
Valued Contributor II

Hi @leymariv,

You can check the schema of data in delta sharing table, using df.printSchema to better understand the JSON structure. Use from_json function to flatten or normalize the data to respective columns.

Additionally, you can understand how data is being loaded into the table by using the DESCRIBE HISTORY command. Look for append or merge conditions in the operation column and refer to the operationMetrics column for data metrics.

hariprasad_0-1737209659726.png

 

 

If you notice that data is being loaded incrementally (append or merge) into the Delta Sharing table, you can read the data version by version or timestamp by timestamp using below code.

hariprasad_1-1737209712122.png

 

 

Alternatively, you can specify a range for the timestamp or version to further narrow down the data read.

hariprasad_2-1737209739138.png

 

 Further, you can leverage Spark Structured Streaming to read data from delta sharing table.

 

hariprasad_4-1737209788947.png

 

Regards,
Hari Prasad



Regards,
Hari Prasad

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group