11-21-2021 11:18 PM
How we can persist 300 million records? What is the best option to persist data databricks hive metastore/Azure storage/Delta table?
What is the limitations we have for deltatables of databricks in terms of data?
We have usecase where testers should be able to see data in table and query them.
11-21-2021 11:54 PM
That I cannot do, there is no single ideal size/scenario.
However: the latest databricks version is a good choice (10.0 or latest LTS for production jobs).
For data jobs, the write optimized nodes are a good choice as they can use delta cache.
For online querying: databricks sql.
I myself use the cheapest node type which handles the job, and that depends on which spark program I run. So I use multiple cluster configurations.
I even run upsert jobs with a single worker on a table of over 300 million records, works fine depending on the amount of data which has to be rewritten.
It depends on filters, transformations etc on these 300 million records.
11-21-2021 11:26 PM
You can certainly store 300 million records without any problem.
The best option kinda depends on the use case.
If you want to do a lot of online querying on the table, I suggest using delta lake, which is optimeized (using z-order, bloom filter, partitioning and file pruning). With a Databricks SQL endpoint you can query the data.
If you want to use the data for data engineering (ETL jobs), I also suggest using delta lake so you can merge new/changed data incrementally.
You can use the same optimization techniques, but maybe using different columns (depending on which jobs read the table).
I do not know what the limits are concerning the amount of data. But billions of records should be no problem.
Of course everything depends on the cluster running your workload. A 4-node cluster will take longer to process this amount of data than a 20-node cluster.
So, if you can ingest your data incrementally: use delta lake, if you have to do a 300 million record overwrite every day, plain parquet is also OK.
11-21-2021 11:41 PM
Thank you for answering.
Can you please suggest the best configuration to use for node clusters for 300 million records.
i.e. Node type ,Clusture mode and Databricks version runtime version?
11-21-2021 11:54 PM
That I cannot do, there is no single ideal size/scenario.
However: the latest databricks version is a good choice (10.0 or latest LTS for production jobs).
For data jobs, the write optimized nodes are a good choice as they can use delta cache.
For online querying: databricks sql.
I myself use the cheapest node type which handles the job, and that depends on which spark program I run. So I use multiple cluster configurations.
I even run upsert jobs with a single worker on a table of over 300 million records, works fine depending on the amount of data which has to be rewritten.
It depends on filters, transformations etc on these 300 million records.
11-22-2021 02:28 AM
as delta is transactional, maybe if there is no concurrent writes to table, pure parquet correctly partitioned will be more efficient than delta to store that volume
11-22-2021 03:46 AM
Thank you for your reply!!
We have tested with 7+ lakh records data with different format options like saveAsTable, Parquet and Delta to persist the data in ADB. So that testers can use the data to run their scripts.
Timetaken to execute the above formats as mentioned below
saveAsTable -- 31.44 min
Parquet -- 38.78 min
delta -- 33.66 mins
Is there any way to write the processed datafame in persistant storage so that tester can view that as table and run sql queries( we are looking for where testers can modify the data like sample updates and inserts)
Thank you
11-22-2021 04:47 AM
you can always write a df to persistent storage.
Just use spark.write.parquet (or whatever format you choose).
You can create a table view on top of the parquet file too if necessary and run sql queries on that (by using databricks notebooks or by odbc connection).
11-22-2021 07:48 PM
thank you!!!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group