Databricks Community

AzureDatabricks · ‎11-21-2021

Truncate False not working in Delta table.

df_delta.show(df_delta.count(),False)

Computer size

Single Node - Standard_F4S - 8GB Memory, 4 cores

How much max data we can persist in Delta table in Parquet file and How fast we can retrieve data.

-werners- · ‎11-22-2021

OK I see the problem. The issue is not databricks not being able to show these records.

The show command will run on the driver and for a lot of data this will give errors.

But there is a huge difference between showing data on screen and processing/writing them.

There is a reason there is a limit on the amount of records shown, as this is pretty expensive (cannot run in parallel too).

the display() command will default to 1000 records ,which can be overridden to 100K (or even a million, can't recall).

View solution in original post

-werners- · ‎11-21-2021

a record count is very easy: first read the delta table in a DF and then do df.count

How fast: depends on the cluster and the lineage of the dataframe (what transformations are applied to it).

There is no way to tell. But a single node cluster with 4 cores will process 8 threads in parallel I believe.

So depending on the amount of data this will return within a few seconds or half an hour or more.

The out of memory error is weird as a record count is stored in the metadata of the table. So it does not take a lot of memory.

What exactly are you trying to do in your code, because it seems you try do process a lot of data locally, not only a record count.

SailajaB · ‎11-21-2021

Thank you for your reply..

We stored our processed data to delta format.

Now from testing point of view, I am reading all the parquet files to dataframe to apply the queries.

Here, we tried to see how many records data we can display or show in databricks, so we used the below command as normal display is giving first 232 rows only

df_delta.show(df_delta.count(),False) -- we are trying to show/read 7 lakh records(df_delta.count()) and making truncate is false.

Thankyou

-werners- · ‎11-22-2021

OK I see the problem. The issue is not databricks not being able to show these records.

The show command will run on the driver and for a lot of data this will give errors.

But there is a huge difference between showing data on screen and processing/writing them.

There is a reason there is a limit on the amount of records shown, as this is pretty expensive (cannot run in parallel too).

the display() command will default to 1000 records ,which can be overridden to 100K (or even a million, can't recall).

AzureDatabricks · ‎11-22-2021

can you please let us know limit of data that can be store in Delta table/Hive table or in Parquet file

-werners- · ‎11-22-2021

As parquet/delta lake is designed for big data: a lot! Think billions of records.

I don't think there is a hard limit, only in the limits set by the cloud provider (cpu quota etc)

AzureDatabricks · ‎11-22-2021

thank you !!!

jose_gonzalez · ‎11-22-2021

Hi @sujata birajdar ,

Did @Werner Stinckens fully answered your question, would you be happy to mark their answer as best so that others can quickly find the solution?

AzureDatabricks · ‎11-22-2021

thank you !!!

Databricks Community

Need to see all the records in DeltaTable. Exception - java.lang.OutOfMemoryError: GC overhead limit exceeded

Connect with Databricks Users in Your Area

Meet the Databricks MVPs

Databricks training invests in closing the data + AI skills gap across enterprises

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs

Data + AI Summit: Call for Presentations

Season's Speedings: Databricks SQL Delivers 4x Performance Boost Over Two Years