cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Need to see all the records in DeltaTable. Exception - java.lang.OutOfMemoryError: GC overhead limit exceeded

AzureDatabricks
New Contributor III

Truncate False not working in Delta table. 

df_delta.show(df_delta.count(),False)

Computer size

Single Node - Standard_F4S - 8GB Memory, 4 cores

How much max data we can persist in Delta table in Parquet file and How fast we can retrieve data.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

OK I see the problem. The issue is not databricks not being able to show these records.

The show command will run on the driver and for a lot of data this will give errors.

But there is a huge difference between showing data on screen and processing/writing them.

There is a reason there is a limit on the amount of records shown, as this is pretty expensive (cannot run in parallel too).

the display() command will default to 1000 records ,which can be overridden to 100K (or even a million, can't recall).

View solution in original post

8 REPLIES 8

-werners-
Esteemed Contributor III

a record count is very easy: first read the delta table in a DF and then do df.count

How fast: depends on the cluster and the lineage of the dataframe (what transformations are applied to it).

There is no way to tell. But a single node cluster with 4 cores will process 8 threads in parallel I believe.

So depending on the amount of data this will return within a few seconds or half an hour or more.

The out of memory error is weird as a record count is stored in the metadata of the table. So it does not take a lot of memory.

What exactly are you trying to do in your code, because it seems you try do process a lot of data locally, not only a record count.

SailajaB
Valued Contributor III

Thank you for your reply..

We stored our processed data to delta format.

Now from testing point of view, I am reading all the parquet files to dataframe to apply the queries.

Here, we tried to see how many records data we can display or show in databricks, so we used the below command as normal display is giving first 232 rows only

df_delta.show(df_delta.count(),False) -- we are trying to show/read 7 lakh records(df_delta.count()) and making truncate is false.

Thankyou

-werners-
Esteemed Contributor III

OK I see the problem. The issue is not databricks not being able to show these records.

The show command will run on the driver and for a lot of data this will give errors.

But there is a huge difference between showing data on screen and processing/writing them.

There is a reason there is a limit on the amount of records shown, as this is pretty expensive (cannot run in parallel too).

the display() command will default to 1000 records ,which can be overridden to 100K (or even a million, can't recall).

can you please let us know limit of data that can be store in  Delta table/Hive table or in Parquet file

-werners-
Esteemed Contributor III

As parquet/delta lake is designed for big data: a lot! Think billions of records.

I don't think there is a hard limit, only in the limits set by the cloud provider (cpu quota etc)

AzureDatabricks
New Contributor III

thank you !!!

Hi @sujata birajdar​ ,

Did @Werner Stinckens​  fully answered your question, would you be happy to mark their answer as best so that others can quickly find the solution?

AzureDatabricks
New Contributor III

thank you !!!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group