โ11-21-2021 11:25 PM
Truncate False not working in Delta table.
df_delta.show(df_delta.count(),False)
Computer size
Single Node - Standard_F4S - 8GB Memory, 4 cores
How much max data we can persist in Delta table in Parquet file and How fast we can retrieve data.
โ11-22-2021 12:00 AM
OK I see the problem. The issue is not databricks not being able to show these records.
The show command will run on the driver and for a lot of data this will give errors.
But there is a huge difference between showing data on screen and processing/writing them.
There is a reason there is a limit on the amount of records shown, as this is pretty expensive (cannot run in parallel too).
the display() command will default to 1000 records ,which can be overridden to 100K (or even a million, can't recall).
โ11-21-2021 11:45 PM
a record count is very easy: first read the delta table in a DF and then do df.count
How fast: depends on the cluster and the lineage of the dataframe (what transformations are applied to it).
There is no way to tell. But a single node cluster with 4 cores will process 8 threads in parallel I believe.
So depending on the amount of data this will return within a few seconds or half an hour or more.
The out of memory error is weird as a record count is stored in the metadata of the table. So it does not take a lot of memory.
What exactly are you trying to do in your code, because it seems you try do process a lot of data locally, not only a record count.
โ11-21-2021 11:55 PM
Thank you for your reply..
We stored our processed data to delta format.
Now from testing point of view, I am reading all the parquet files to dataframe to apply the queries.
Here, we tried to see how many records data we can display or show in databricks, so we used the below command as normal display is giving first 232 rows only
df_delta.show(df_delta.count(),False) -- we are trying to show/read 7 lakh records(df_delta.count()) and making truncate is false.
Thankyou
โ11-22-2021 12:00 AM
OK I see the problem. The issue is not databricks not being able to show these records.
The show command will run on the driver and for a lot of data this will give errors.
But there is a huge difference between showing data on screen and processing/writing them.
There is a reason there is a limit on the amount of records shown, as this is pretty expensive (cannot run in parallel too).
the display() command will default to 1000 records ,which can be overridden to 100K (or even a million, can't recall).
โ11-22-2021 12:49 AM
can you please let us know limit of data that can be store in Delta table/Hive table or in Parquet file
โ11-22-2021 01:12 AM
As parquet/delta lake is designed for big data: a lot! Think billions of records.
I don't think there is a hard limit, only in the limits set by the cloud provider (cpu quota etc)
โ11-22-2021 12:48 AM
thank you !!!
โ11-22-2021 12:16 PM
Hi @sujata birajdarโ ,
Did @Werner Stinckensโ fully answered your question, would you be happy to mark their answer as best so that others can quickly find the solution?
โ11-22-2021 07:47 PM
thank you !!!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group