Databricks Community

leymariv · ‎01-17-2025

I have a huge df (40 billions rows) shared by delta share that has only one column 'payload' which contains json and that is not partitionned:

Even if all those payloads are not the same, they have a common col sessionId that i need to extract to be able to write my bronze table and partition using this sessionId.

I am trying to either use limit(10) or sample(fraction=0.00000001, seed=42) on the df to be able to treat a small part of it just to ensure my code is running but my write method has always the same number of tasks and would take too long to end.

What would you suggest to write and exploded small subset this huge unpartitionned json column based df ?

hari-prasad · ‎01-18-2025

Hi @leymariv,

You can check the schema of data in delta sharing table, using df.printSchema to better understand the JSON structure. Use from_json function to flatten or normalize the data to respective columns.

Additionally, you can understand how data is being loaded into the table by using the DESCRIBE HISTORY command. Look for append or merge conditions in the operation column and refer to the operationMetrics column for data metrics.

If you notice that data is being loaded incrementally (append or merge) into the Delta Sharing table, you can read the data version by version or timestamp by timestamp using below code.

Alternatively, you can specify a range for the timestamp or version to further narrow down the data read.

Further, you can leverage Spark Structured Streaming to read data from delta sharing table.

Regards,
Hari Prasad

Databricks Community

Performance issue writing an extract of a huge unpartitionned single column dataframe

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!